Developing a Unified Framework for Contextual Multi-Modal Reasoning in Document Understanding

Goni Mustapha; Austin Ogar

Previous Article in event

Photophysical properties and singlet oxygen generation by Zn-protoporphyrin IX embedded in hemoglobin

Previous Article in session

Inclusive Multilingual Assistive Technology using a Computer Vision- and Transformer-based NLP Approach for the Visually Impaired

Next Article in event

Enhancing Agricultural Productivity with Machine Learning-Based Soil Fertility Prediction

Developing a Unified Framework for Contextual Multi-Modal Reasoning in Document Understanding

Goni Mahmud Mustapha

^{*

1},

Austin Olom Ogar

^{*

2}

¹ Computer Science Department, Nile University of Nigeria, Abuja 900108, Nigeria
² Software Engineering Department, Nile University of Nigeria, Abuja 900108, Nigeria

Academic Editor: Francesco Arcadio

Published: 03 December 2025 by MDPI in The 6th International Electronic Conference on Applied Sciences session Computing and Artificial Intelligence

Abstract:

Understanding real-world documents is no longer just about reading text. Modern documents combine written content with layouts, tables, forms, and even images, creating a level of complexity that traditional natural language processing tools struggle to handle. Many existing approaches treat each part of the documents as text, structure, or visuals separately or fuse them too late, which means they often miss the context that comes from how these elements interact. This research presents a unified framework that integrates text, layout, and visuals into a single, explainable system. Built on a transformer backbone, the framework uses cross-modal attention and graph-based reasoning to capture relationships across different modalities more naturally. The framework will be tested on benchmark datasets, including FUNSD, DocVQA, SROIE, RVL-CDIP, and PubLayNet, which cover tasks such as key-value extraction, document classification, and visual question answering. We will evaluate results using standard metrics (F1, exact match, IoU) and add interpretability checks to ensure transparency. We expect the system to deliver stronger accuracy, adaptability, and interpretability than current methods. Beyond technical gains, this research aims to support practical automation in many precision domains like healthcare, legal services, and government, while advancing the foundations of multimodal reasoning in the field of Artificial Intelligence

Keywords: Multimodal Document Understanding; Contextual Reasoning in AI ; Cross-Modal Attention; Transformer-Based Frameworks; Graph-Based Reasoning

View Poster

44 Reads
0 Recommendations

Goni Mustapha

Austin Ogar