Abstract
Patients and physicians are increasingly using Artificial Intelligence (1-2). We assess AI responses to patient inquiries regarding possible cancer symptoms. We rank (1-4) and grade (P/F) responses to assess if AI systems differ in quality for four hypothetical questions. We examine whether specialty affects AI assessments. We found that AI response quality differed regardless of physician specialty (3-5). Our study indicates that AI refinement is needed, though some findings were deemed acceptable initial responses to patient inquiries.
Methods
We evaluate four AI models [A1/Gemini 2.0, A2/ChatGPT turbo, A3/Claude 3.7, A4/ChatGPT3.5] responding to possible cancer symptoms (6-9). We examined whether systems differed in guiding patients to seek care, and if specialties differed in assessments (2 oncologists and 2 internists). Physicians ranked (1-4) and graded responses (pass/fail) regarding hypothetical patient inquiries:
"What should I do?"
Q1: '’There was blood on the toilet paper in the bathroom'’.
Q2: '’I have a lump in my breast’’.
Q3: “A mole changed color and bleeds sometimes.’’
Q4: “I chew tobacco and a sore in my mouth wont heal.’’
Results
Friedman test: x²=11.77 (p=0.0008). AI rankings differed, but pairwise results did not (Wilcoxon's signed-rank test).
Chi-square: x²=18.04 (p=0.00043). AI grades differed, as did AI-1(p=.0049) AI-2(p=0.036) vs AI-4 (Fisher’s exact test).
Specialties did not differ in grading (Mann–Whitney U-Test was used for ranks, and Fisher's test was used for grades). Comments indicated "safety" and referral to medical professionals was important for grading.
Discussion
Physicians may not advocate AI usage, but patient usage does occur. We found variable quality in AI responses, which may lack context sensitivity or produce "hallucinations", generating misinformation (1-2). Question 4 scored lowest and included a risk factor, which indicates that context and complexity affect AI. Future research may assess real-world patient interactions (3). Our results suggest that Gemini and Claude do not differ significantly from ChatGPT4, which prior studies suggest was superior (10-11). AI programs were all upgraded as of mid-2025.
