The burgeoning use of Large Language Models (LLMs) in healthcare has spurred a need for robust evaluation methods to assess the quality, reliability, and efficacy of their outputs. This systematic review aims to map the landscape of existing methods employed to evaluate texts and other outputs generated by LLMs in the healthcare domain. The review protocol was registered on PROSPERO. A comprehensive search was conducted across multiple databases, including PubMed, IEEE Xplore, Google scholar and Scopus, focusing on studies published between 2010 and 2023. The inclusion criteria encompassed original articles that discuss methodologies for assessing the performance of LLMs in generating clinical and healthcare-related content. The review identifies a variety of evaluation techniques, broadly categorized into quantitative and qualitative methods. Quantitative assessments often involve metrics such as accuracy, precision, recall, and F1 score, particularly in tasks like clinical documentation, diagnostic support, and patient communication. Qualitative methods, on the other hand, emphasize human judgment, focusing on aspects such as coherence, adequacy, relevance, and readability, often through expert panel reviews and user satisfaction surveys. Additionally, the review highlights challenges unique to the healthcare context, such as the need for domain-specific knowledge, the handling of sensitive patient data, and the potential for bias in AI-generated content. The findings underscore the importance of interdisciplinary collaboration in developing and validating evaluation frameworks that not only measure technical performance but also consider ethical and practical implications. In conclusion, this review provides a comprehensive overview of current evaluation methods for generative AI in healthcare, identifies gaps in the existing literature, and proposes directions for future research to enhance the assessment of these advanced technologies in medical settings.
                    Previous Article in event
            
                            Previous Article in session
            
                    
    
                    Next Article in event
            
                            Next Article in session
            
                    
                                                    
        
                    Evaluating the quality of generative artificial intelligence in healthcare: a systematic review
                
                                    
                
                
                    Published:
04 December 2024
by MDPI
in The 5th International Electronic Conference on Applied Sciences
session Computing and Artificial Intelligence
                
                
                
                    Abstract: 
                                    
                        Keywords: Large Language Models (LLMs); Healthcare evaluation methods; Quantitative and qualitative assessments; healthcare content; Interdisciplinary collaboration