Introduction: Real-time vision–language models (VLMs) can exhibit “cognitive-like” failure patterns, including temporally unstable judgments, persistence of incorrect hypotheses, and overconfident confabulations under uncertainty. Conventional single-image benchmarks do not isolate these time-dependent behaviors. We present CMC (Confabulation Mitigation & Calibration), a lightweight reliability wrapper, and evaluate it within a shared cognitive testing platform that probes interpretable perceptual and metacognitive functions in both humans and AI.
Methods: The platform targets three functions relevant to hallucination-like errors: visual perception and orientation discrimination, temporal stability of belief across successive observations, and confidence calibration under time constraints. We used a Tumbling-E orientation task with randomized staircase difficulty to stress perceptual decision-making while recording response time, timeouts/abstentions, and step-to-step consistency. CMC combines selective re-analysis via change detection, a risk score that integrates temporal instability signals with uncertainty features, and a confirmation stage that routes high-risk outputs to verification or calibrated responses (verify/hedge/abstain). Human trials used a 3-second response budget; AI trials used a longer budget to distinguish perceptual failure from system latency.
Results: In 40 Tumbling-E trials, baseline accuracy was 50% (20/40) without CMC and improved to 75% (30/40) with CMC v1. CMC further reduced temporally unstable behaviors by escalating high-risk steps to verification and suppressing overconfident outputs when evidence was weak or timeouts were likely.
Conclusions: Evaluated as a cognitive testing problem rather than a static captioning task, CMC improves both accuracy and cognitive-faithful reliability, supporting rigorous neuroscience-aligned research and safer deployment of real-time VLM systems.
