Comparing LLM Output to Physician Notes for Pharmacogenomics

¹ Department of Medicine, Stanford University School of Medicine, Palo Alto, CA 94305, USA
² Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
³ Department of Pharmacy, Stanford Heath Care, Palo Alto, CA 94305, USA
⁴ Division of Hospital Medicine, Department of Medicine, Stanford University School of Medicine, Palo Alto, CA 94305, USA
⁵ Department of Biomedical Data Science, Stanford University School of Medicine, Palo Alto, CA 94305, USA
⁶ Division of Immunology & Rheumatology, Department of Medicine, Stanford University School of Medicine, Palo Alto, CA 94305, USA
⁷ Department of Neurosurgery, Stanford University School of Medicine, Palo Alto, CA 94305, USA

Academic Editor: Kenneth Pritzker

Published: 23 October 2025 by MDPI in The 1st International Online Conference on Personalized Medicine session Personalized Preventive Medicine

Abstract:

Introduction

Pharmacogenomic (PGx)-guided prescribing can reduce adverse drug reactions and improve efficacy, yet clinical uptake is limited. A key barrier is reliance on complicated, text-dense reports, which are incompatible with routine clinical workflows. Large language models (LLMs) show promise in generating concise, context-specific recommendations, but prior work has largely focused on concept-based question-answering rather than realistic case interpretation. In this study, we examine LLMs' efficacy in interpreting PGx reports.

Methods

GPT-4.5 was deployed in Stanford’s HIPAA-compliant SecureGPT environment and paired with retrieval-augmented generation to inject CPIC guidance at inference. To create the test dataset, we created a set of synthetic PGx laboratory reports and accompanying medical histories. These cases covered a range of genotypes and clinical contexts. An expert human evaluator interpreted each case and authored a gold-standard consult note. The LLM generated consult notes that were compared against gold-standard notes quantitatively using ROUGE-L and BERTScore. The qualitative evaluation (LLM-as-a-judge and human evaluators) used a 5-point Likert scale across five quality domains (accuracy, clinical relevance, bias, risk-management, and hallucination).

Results:

Human expert ratings achieved overall scores of 0.91 ± 0.16, while the LLM-as-a-judge scoring produced 0.708 ± 0.17. Semantic similarity metrics showed a BERTScore precision of 0.822 ± 0.012, recall of 0.788 ± 0.018, and F1 of 0.805 ± 0.013. The direct lexical overlap was lower, with a ROUGE-L precision of 0.207 ± 0.084 and a recall of 0.270 ± 0.122.

Conclusion:

LLMs can achieve quality PGx outputs compared to those from human experts. However, our findings show variable performance in the metrics. NLP-based evaluations miss nuanced clinical data and lack flexibility in interpretation. This also raises concerns about the efficacy of NLP metrics for reliably evaluating high-impact clinical information and underscores the necessity of human evaluation. Key limitations included an inability to factor in phenoconversion and decisively synthesise information for drugs influenced by multiple pharmacogenes. Further validation with real-world patient data is in progress.

Keywords: AI; Large language models; preventative pharmacogenomics; pharmacogenomics panels

7 Reads
0 Recommendations

Claire Spahn

Ivan Lopez

Abha Khandelwal

Anand Veeravagu

Latha Palaniappan