Please login first
Developing a Unified Framework for Contextual Multi-Modal Reasoning in Document Understanding
* 1 , * 2
1  Computer Science Department, Nile University of Nigeria, Abuja 900108, Nigeria
2  Software Engineering Department, Nile University of Nigeria, Abuja 900108, Nigeria
Academic Editor: Francesco Arcadio

Abstract:

Understanding real-world documents is no longer just about reading text. Modern documents combine written content with layouts, tables, forms, and even images, creating a level of complexity that traditional natural language processing tools struggle to handle. Many existing approaches treat each part of the documents as text, structure, or visuals separately or fuse them too late, which means they often miss the context that comes from how these elements interact. This research presents a unified framework that integrates text, layout, and visuals into a single, explainable system. Built on a transformer backbone, the framework uses cross-modal attention and graph-based reasoning to capture relationships across different modalities more naturally. The framework will be tested on benchmark datasets, including FUNSD, DocVQA, SROIE, RVL-CDIP, and PubLayNet, which cover tasks such as key-value extraction, document classification, and visual question answering. We will evaluate results using standard metrics (F1, exact match, IoU) and add interpretability checks to ensure transparency. We expect the system to deliver stronger accuracy, adaptability, and interpretability than current methods. Beyond technical gains, this research aims to support practical automation in many precision domains like healthcare, legal services, and government, while advancing the foundations of multimodal reasoning in the field of Artificial Intelligence

Keywords: Multimodal Document Understanding; Contextual Reasoning in AI ; Cross-Modal Attention; Transformer-Based Frameworks; Graph-Based Reasoning
Comments on this paper
Currently there are no comments available.


 
 
Top