Evaluating the quality of generative artificial intelligence in healthcare: a systematic review

Andrea Frosolini

Previous Article in event

Electrochemical techniques for monitoring analytes dissolved in acidic solutions for the food industry

Previous Article in session

Semi-supervised facial beauty prediction using contrastive pretraining with SimCLR

Next Article in event

Green upgrading of biodiesel derived from biomass wastes

Next Article in session

Feature Engineering for Lung Cancer Classification Using Next-Generation Sequencing Data

Evaluating the quality of generative artificial intelligence in healthcare: a systematic review

Andrea Frosolini

¹ University of Siena, Via Banchi di Sotto 55, Siena, Toscana, Italy

Academic Editor: Eugenio Vocaturo

Published: 04 December 2024 by MDPI in The 5th International Electronic Conference on Applied Sciences session Computing and Artificial Intelligence

Abstract:

The burgeoning use of Large Language Models (LLMs) in healthcare has spurred a need for robust evaluation methods to assess the quality, reliability, and efficacy of their outputs. This systematic review aims to map the landscape of existing methods employed to evaluate texts and other outputs generated by LLMs in the healthcare domain. The review protocol was registered on PROSPERO. A comprehensive search was conducted across multiple databases, including PubMed, IEEE Xplore, Google scholar and Scopus, focusing on studies published between 2010 and 2023. The inclusion criteria encompassed original articles that discuss methodologies for assessing the performance of LLMs in generating clinical and healthcare-related content. The review identifies a variety of evaluation techniques, broadly categorized into quantitative and qualitative methods. Quantitative assessments often involve metrics such as accuracy, precision, recall, and F1 score, particularly in tasks like clinical documentation, diagnostic support, and patient communication. Qualitative methods, on the other hand, emphasize human judgment, focusing on aspects such as coherence, adequacy, relevance, and readability, often through expert panel reviews and user satisfaction surveys. Additionally, the review highlights challenges unique to the healthcare context, such as the need for domain-specific knowledge, the handling of sensitive patient data, and the potential for bias in AI-generated content. The findings underscore the importance of interdisciplinary collaboration in developing and validating evaluation frameworks that not only measure technical performance but also consider ethical and practical implications. In conclusion, this review provides a comprehensive overview of current evaluation methods for generative AI in healthcare, identifies gaps in the existing literature, and proposes directions for future research to enhance the assessment of these advanced technologies in medical settings.

Keywords: Large Language Models (LLMs); Healthcare evaluation methods; Quantitative and qualitative assessments; healthcare content; Interdisciplinary collaboration

5 Reads
0 Recommendations

Andrea Frosolini