Events The 1st International Online Conference on Education Sciences

Event submissions

Published

This submission belongs to the session S1. Technology Enhanced Education of the event The 1st International Online Conference on Education Sciences

Published date

10 Jun, 2026

Academic Editor

Mike Joy

Citation

Cheng Keat Tan, Yin Ni Annie Ng, Qing Hao Ng, Seh Yi Joseph Tan, An Evaluation of Artificial Hallucination Rates in Large Language Models for Educational Use, in Proceedings of The 1st International Online Conference on Education Sciences, 15 June–17 June 2026, MDPI: Basel, Switzerland

Facebook

Twitter

An Evaluation of Artificial Hallucination Rates in Large Language Models for Educational Use

Cheng Keat Tan ¹

Yin Ni Annie Ng ¹

Qing Hao Ng ²

Seh Yi Joseph Tan ³

1. School of Applied Science, Nanyang Polytechnic, Singapore, 180 Ang Mo Kio Avenue 8 Singapore 569830, Singapore, Singapore

2. Medical Affairs, Lucence Diagnostics Pte. Ltd., Singapore, 211 Henderson Road, #04-02, 211 Henderson, Singapore 159552, Singapore, Singapore

3. GSK Asia House, Singapore, 23 Rochester Park Singapore 139234, Singapore, Singapore

Abstract

Objective: Artificial Hallucination, termed as the “generation of plausible but factually incorrect information”, is a complication in the use of large language models (LLMs). Inappropriate reliance on LLMs, when used as an educational tool, may undermine critical thinking and risk distorting academic understanding, particularly among foundational learners in evidence-based disciplines. Systematic evaluation is therefore required to assess the extent of hallucination of publicly available LLMs for educational use.

Methods: This study evaluated the hallucination rates of four public LLMs: ChatGPT-4o, Microsoft Copilot, Google Gemini 1.5, and Claude Sonnet 3.5. A standardised, expert-validated set of multiple-choice questions, along with a standardised prompting approach, was used to assess answer accuracy, reasoning validity, and citation relevance as binary outcomes. Responses were independently reviewed by two subject experts, with discrepancies resolved by consensus. Categorical analysis, using Chi-square or Fisher–Freeman–Halton tests, was applied to gather quantitative results, and qualitative thematic analysis was used to characterise citation-related errors.

Results & Discussion: Among the four LLMs, ChatGPT-4o demonstrated the highest answer accuracy (84%) and reasoning validity (72%). While ChatGPT-4o produced the greatest number of citations, Copilot yielded the highest proportion of valid (91.8%) and relevant (48.7%) references. The most common sources of hallucination were invalid links and non-scholarly online content. However, no statistically significant difference in hallucination rates was observed among the four models (p > 0.05), indicating comparable susceptibility to misinformation.

Conclusion: Current LLMs do not replace human reasoning. The findings underscore the need for expert oversight, citation verification and a robust validation approach to ensure reliability and mitigate misinformation in evidence-based domains. Integrating a human-over-the-loop approach in LLM literacy initiatives may epistemically enhance and promote responsible AI adoption in higher education

Keywords

Artificial Hallucination

Large Language Models

Higher Education

Educational Technology

Poster

Artificial Hallucination Rates in LLMs.pdf

Remixing Historical Sound Archives with AI to Support Creative Inquiry in Early Childhood

From the classroom to the courtroom: virtual tools and outbreak scenarios for forensic parasitology training