Objective: Artificial Hallucination, termed as the “generation of plausible but factually incorrect information”, is a complication in the use of large language models (LLMs). Inappropriate reliance on LLMs, when used as an educational tool, may undermine critical thinking and risk distorting academic understanding, particularly among foundational learners in evidence-based disciplines. Systematic evaluation is therefore required to assess the extent of hallucination of publicly available LLMs for educational use.
Methods: This study evaluated the hallucination rates of four public LLMs: ChatGPT-4o, Microsoft Copilot, Google Gemini 1.5, and Claude Sonnet 3.5. A standardised, expert-validated set of multiple-choice questions, along with a standardised prompting approach, was used to assess answer accuracy, reasoning validity, and citation relevance as binary outcomes. Responses were independently reviewed by two subject experts, with discrepancies resolved by consensus. Categorical analysis, using Chi-square or Fisher–Freeman–Halton tests, was applied to gather quantitative results, and qualitative thematic analysis was used to characterise citation-related errors.
Results & Discussion: Among the four LLMs, ChatGPT-4o demonstrated the highest answer accuracy (84%) and reasoning validity (72%). While ChatGPT-4o produced the greatest number of citations, Copilot yielded the highest proportion of valid (91.8%) and relevant (48.7%) references. The most common sources of hallucination were invalid links and non-scholarly online content. However, no statistically significant difference in hallucination rates was observed among the four models (p > 0.05), indicating comparable susceptibility to misinformation.
Conclusion: Current LLMs do not replace human reasoning. The findings underscore the need for expert oversight, citation verification and a robust validation approach to ensure reliability and mitigate misinformation in evidence-based domains. Integrating a human-over-the-loop approach in LLM literacy initiatives may epistemically enhance and promote responsible AI adoption in higher education
Previous Article in event
Next Article in event
An Evaluation of Artificial Hallucination Rates in Large Language Models for Educational Use
Published:
10 June 2026
by MDPI
in The 1st International Online Conference on Education Sciences
session Technology Enhanced Education
Abstract:
Keywords: Artificial Hallucination; Large Language Models; Higher Education; Educational Technology
