Please login first
An Evaluation of Artificial Hallucination Rates in Large Language Models for Educational Use
* 1 , 1 , 2 , 3
1  School of Applied Science, Nanyang Polytechnic, Singapore, 180 Ang Mo Kio Avenue 8 Singapore 569830, Singapore
2  Medical Affairs, Lucence Diagnostics Pte. Ltd., Singapore, 211 Henderson Road, #04-02, 211 Henderson, Singapore 159552, Singapore
3  GSK Asia House, Singapore, 23 Rochester Park Singapore 139234, Singapore
Academic Editor: Mike Joy

Abstract:

Objective: Artificial Hallucination, termed as the “generation of plausible but factually incorrect information”, is a complication in the use of large language models (LLMs). Inappropriate reliance on LLMs, when used as an educational tool, may undermine critical thinking and risk distorting academic understanding, particularly among foundational learners in evidence-based disciplines. Systematic evaluation is therefore required to assess the extent of hallucination of publicly available LLMs for educational use.

Methods: This study evaluated the hallucination rates of four public LLMs: ChatGPT-4o, Microsoft Copilot, Google Gemini 1.5, and Claude Sonnet 3.5. A standardised, expert-validated set of multiple-choice questions, along with a standardised prompting approach, was used to assess answer accuracy, reasoning validity, and citation relevance as binary outcomes. Responses were independently reviewed by two subject experts, with discrepancies resolved by consensus. Categorical analysis, using Chi-square or Fisher–Freeman–Halton tests, was applied to gather quantitative results, and qualitative thematic analysis was used to characterise citation-related errors.

Results & Discussion: Among the four LLMs, ChatGPT-4o demonstrated the highest answer accuracy (84%) and reasoning validity (72%). While ChatGPT-4o produced the greatest number of citations, Copilot yielded the highest proportion of valid (91.8%) and relevant (48.7%) references. The most common sources of hallucination were invalid links and non-scholarly online content. However, no statistically significant difference in hallucination rates was observed among the four models (p > 0.05), indicating comparable susceptibility to misinformation.

Conclusion: Current LLMs do not replace human reasoning. The findings underscore the need for expert oversight, citation verification and a robust validation approach to ensure reliability and mitigate misinformation in evidence-based domains. Integrating a human-over-the-loop approach in LLM literacy initiatives may epistemically enhance and promote responsible AI adoption in higher education

Keywords: Artificial Hallucination; Large Language Models; Higher Education; Educational Technology

 
 
Top