Understanding real-world documents is no longer just about reading text. Modern documents combine written content with layouts, tables, forms, and even images, creating a level of complexity that traditional natural language processing tools struggle to handle. Many existing approaches treat each part of the documents as text, structure, or visuals separately or fuse them too late, which means they often miss the context that comes from how these elements interact. This research presents a unified framework that integrates text, layout, and visuals into a single, explainable system. Built on a transformer backbone, the framework uses cross-modal attention and graph-based reasoning to capture relationships across different modalities more naturally. The framework will be tested on benchmark datasets, including FUNSD, DocVQA, SROIE, RVL-CDIP, and PubLayNet, which cover tasks such as key-value extraction, document classification, and visual question answering. We will evaluate results using standard metrics (F1, exact match, IoU) and add interpretability checks to ensure transparency. We expect the system to deliver stronger accuracy, adaptability, and interpretability than current methods. Beyond technical gains, this research aims to support practical automation in many precision domains like healthcare, legal services, and government, while advancing the foundations of multimodal reasoning in the field of Artificial Intelligence
Previous Article in event
Previous Article in session
Next Article in event
Developing a Unified Framework for Contextual Multi-Modal Reasoning in Document Understanding
Published:
03 December 2025
by MDPI
in The 6th International Electronic Conference on Applied Sciences
session Computing and Artificial Intelligence
Abstract:
Keywords: Multimodal Document Understanding; Contextual Reasoning in AI ; Cross-Modal Attention; Transformer-Based Frameworks; Graph-Based Reasoning
