The burgeoning use of Large Language Models (LLMs) in healthcare has spurred a need for robust evaluation methods to assess the quality, reliability, and efficacy of their outputs. This systematic review aims to map the landscape of existing methods employed to evaluate texts and other outputs generated by LLMs in the healthcare domain. The review protocol was registered on PROSPERO. A comprehensive search was conducted across multiple databases, including PubMed, IEEE Xplore, Google scholar and Scopus, focusing on studies published between 2010 and 2023. The inclusion criteria encompassed original articles that discuss methodologies for assessing the performance of LLMs in generating clinical and healthcare-related content. The review identifies a variety of evaluation techniques, broadly categorized into quantitative and qualitative methods. Quantitative assessments often involve metrics such as accuracy, precision, recall, and F1 score, particularly in tasks like clinical documentation, diagnostic support, and patient communication. Qualitative methods, on the other hand, emphasize human judgment, focusing on aspects such as coherence, adequacy, relevance, and readability, often through expert panel reviews and user satisfaction surveys. Additionally, the review highlights challenges unique to the healthcare context, such as the need for domain-specific knowledge, the handling of sensitive patient data, and the potential for bias in AI-generated content. The findings underscore the importance of interdisciplinary collaboration in developing and validating evaluation frameworks that not only measure technical performance but also consider ethical and practical implications. In conclusion, this review provides a comprehensive overview of current evaluation methods for generative AI in healthcare, identifies gaps in the existing literature, and proposes directions for future research to enhance the assessment of these advanced technologies in medical settings.
Previous Article in event
Previous Article in session
Next Article in event
Next Article in session
Evaluating the quality of generative artificial intelligence in healthcare: a systematic review
Published:
04 December 2024
by MDPI
in The 5th International Electronic Conference on Applied Sciences
session Computing and Artificial Intelligence
Abstract:
Keywords: Large Language Models (LLMs); Healthcare evaluation methods; Quantitative and qualitative assessments; healthcare content; Interdisciplinary collaboration
Comments on this paper