Please login first
Harnessing Vision-Language Models for Improved X-ray Interpretation and Diagnosis
* 1 , * 2, 3 , 2
1  Department of Artificial Intelligence, Faculty of Computer Science and Engineering, Ghulam Ishaq Khan Institute of Engineering Sciences and Technology, 23460 Topi, Khyber Pakhtoonkha, Pakistan.
2  Department of Business, University of Europe for Applied Sciences, Think Campus, 14469 Potsdam, Germany.
3  Artificial Intelligence Research (AIR) Group, , Department of Artificial Intelligence, Faculty of Computer Science and Engineering, Ghulam Ishaq Khan Institute of Engineering Sciences and Technology, 23460 Topi, Khyber Pakhtoonkha, Pakistan.
Academic Editor: Eugenio Vocaturo

Abstract:

As AI and transformers make progress, Vision-Language Models (VLMs) are set to have a big impact on medical diagnostics when it comes to understanding picture-based data.
X-rays play a key role in radio diagnoses, helping doctors spot many different illnesses.
This research aims to train a Vision-Language Model Visual BERT, to diagnose diseases by using both written patient info and X-ray images.
In the past few years, researchers around the world have tried out various AI models to identify different health problems; however, since Vision-Language Models are pretty new, people use them for general tools like chatbots.
But these models could be useful for automating the process of diagnosing diseases from X-rays.
The proposed methodology involves merging three separate datasets to build a full dataset that covers many diseases that X-ray images can spot. By bringing together different datasets, the model can figure out how to spot a wide range of conditions, which makes it better at diagnosing and more useful.
This method shows how VLMs can change medical diagnostics for the better, giving doctors a smart way to spot diseases faster and more frequently.
Visual BERT is trained on the dataset to diagnose diseases using visual as well as textual data.
For the evaluation matrix, an accuracy score is used and after fine tuning the model on the combined dataset, we obtained a 60% accuracy score.
To test the model, a front-end app with Streamlit was made to make it easier to use for end users.

Keywords: Vision Language Model; Medical Imaging; Computer Vision; Natural Language Processing
Comments on this paper
Currently there are no comments available.



 
 
Top