Artificial intelligence (AI) technologies have found a burgeoning role in modern healthcare, promising enhancements in diagnostic accuracy and clinical decision-making. Recent research from West Virginia University (WVU) propels this promise into the emergency department setting, where rapid and precise diagnosis is critical yet often challenging. WVU scientists, led by Gangqing “Michael” Hu, assistant professor at the WVU School of Medicine, have conducted a pioneering evaluation of multiple iterations of ChatGPT, a state-of-the-art AI language model, assessing its performance in diagnosing emergency department patients based on physicians’ clinical notes. Their findings, published in Scientific Reports, underscore both the potential and current limitations of AI in emergency diagnostics, particularly in the context of symptom presentation.
The core objective of Hu’s study was to interrogate how different versions of ChatGPT handle diagnostic tasks given real-world clinical data. Using de-identified physician notes from 30 emergency department cases, the research team prompted various ChatGPT model iterations—including GPT-3.5, GPT-4, GPT-4o, and the o1 series—to generate their top three diagnostic suggestions. The study’s methodological rigor involved comparing the models’ diagnostic precision and accuracy against actual clinical outcomes to draw a comprehensive performance profile. This approach provides a window into how AI tools can supplement, but not yet replace, human clinical judgment.
One of the profound insights emerging from this investigation is the discrepancy in AI performance between cases with classic, textbook symptoms and those with atypical or “challenging” presentations. For patients exhibiting hallmark signs of disease, ChatGPT models demonstrated promising diagnostic assistance capabilities, supporting physicians by suggesting accurate differential diagnoses. However, when confronted with complex cases lacking traditional symptomatic cues—such as pneumonia cases without accompanying fever—AI’s capacity to correctly identify diagnoses notably diminished. These failures illuminate the inherent difficulty AI models face when operating beyond their training data’s typical patterns, emphasizing the necessity for richer, more diverse datasets.
The researchers note that current AI diagnostic models primarily ingest unstructured text input—in this case, physicians’ notes—without access to multimodal clinical information. Consequently, ChatGPT’s diagnostic reasoning is limited by the breadth and variability of its textual training corpora and the information provided. Hu posits that enhancing future AI frameworks with additional clinical data streams—such as imaging results, laboratory findings, and comprehensive patient histories—could improve the fidelity and robustness of AI-assisted diagnoses in emergency contexts. Integration of these heterogeneous data types would transform AI from a purely linguistic interpreter to a more holistic clinical decision support system.
Analysis of the longitudinal performance of ChatGPT iterations reveals an interesting but cautious trajectory of improvement. While no statistically significant advance was observed when considering the inclusion of AI-generated diagnoses within the top three suggestions, the accuracy of the very top, or primary, diagnosis recommendation improved by approximately 15 to 20 percent in newer models relative to their predecessors. This subtle enhancement suggests iterative refinement in model capabilities but also highlights the persistent challenges in achieving consistently high precision necessary for clinical reliability.
The study underscores a key principle in the deployment of AI-assisted diagnostic tools: the indispensability of human oversight. Given the models’ current inadequate performance on complex cases, physician expertise remains essential to interpret AI outputs critically and corroborate or refute AI-generated hypotheses. This interplay forms a hybrid intelligence paradigm, wherein AI accelerates data synthesis and hypothesis generation while clinicians provide contextual judgment, ensuring that patient care remains both accurate and personalized.
Beyond diagnostic accuracy, Hu envisions AI modalities evolving towards greater transparency and explicability. He stresses the importance of AI systems that do not merely generate results but also reveal their reasoning pathways, enabling clinicians to understand and trust their recommendations. Such “explainable AI” is critical to fostering confidence among healthcare providers, enhancing AI’s integration into clinical workflows, and ultimately improving patient outcomes. Achieving this level of transparency will require methodological innovations in how AI models represent and communicate uncertainty and rationale.
Moreover, Hu’s research team explores imaginative avenues to augment diagnostic reasoning by leveraging multi-agent AI simulations. Drawing on prior work where ChatGPT-4 was deployed in role-playing scenarios—emulating specialists such as physiotherapists, psychologists, and nutritionists engaged in panel discussions—this approach aims to replicate the collaborative diagnostic processes typical in clinical environments. The proposed conversational model suggests that dynamic interactions among diverse AI agents could produce more nuanced, accurate diagnostic assessments, reflecting interdisciplinary integration akin to human medical teams.
Despite these promising strides, the researchers caution that current AI systems, including ChatGPT, do not qualify as certified medical devices and should not be used as standalone diagnostic solutions. In clinical settings where expanded data types, such as imaging, are incorporated, AI models must operate within secure, privacy-compliant hospital clusters as open-source platforms. Compliance with regulatory standards and patient confidentiality laws remains a non-negotiable prerequisite for AI deployment in healthcare institutions.
The study acknowledges support from the National Science Foundation and the National Institutes of Health, emphasizing the significance of federally-funded research in advancing AI applications in medicine. Additional contributors include postdoctoral fellow Jinge Wang, lab volunteer Kenneth Shue, and Li Liu from Arizona State University, reflecting a multidisciplinary collaboration essential to tackling complex problems at the intersection of computer science, bioinformatics, and clinical medicine.
Looking ahead, Hu advocates for future research to focus not only on enhancing AI’s diagnostic performance but also on its capacity to articulate reasoning in clinically meaningful ways. He suggests that improved explainability could facilitate critical emergency department decisions such as triage prioritization and treatment pathway selection, augmenting both efficiency and patient safety.
In summary, the pioneering evaluation of ChatGPT models in emergency diagnostics performed by WVU scientists reveals a nuanced landscape marked by AI’s emerging utility balanced against intrinsic challenges. While encouraging diagnostic accuracy for prototypical cases validates the promise of language models as assistive tools, persistent deficiencies in recognizing atypical disease presentations underscore the imperative for richer data integration, transparent reasoning, and robust human-AI collaboration. This research not only advances scientific understanding of AI capabilities at the clinical frontline but also charts a thoughtful course towards responsible integration of AI in patient-centered care.
Subject of Research:
Evaluation of ChatGPT AI model iterations for diagnostic assistance in emergency department patients using clinical notes.
Article Title:
Preliminary evaluation of ChatGPT model iterations in emergency department diagnostics
News Publication Date:
26-Mar-2025
Web References:
- https://www.wvu.edu/
- https://directory.hsc.wvu.edu/Profile/60888
- https://medicine.wvu.edu/
- https://medicine.wvu.edu/micro/
- https://health.wvu.edu/research-and-graduate-education/research/core-facilities/bioinformatics-core/
- https://www.nature.com/articles/s41598-025-95233-1#citeas
- http://dx.doi.org/10.1038/s41598-025-95233-1
- https://mededu.jmir.org/2024/1/e51157/
References:
Hu, G. M., Wang, J., Shue, K., Liu, L. (2025). Preliminary evaluation of ChatGPT model iterations in emergency department diagnostics. Scientific Reports. DOI: 10.1038/s41598-025-95233-1
Image Credits:
WVU Photo/Greg Ellis
Keywords:
Artificial intelligence, Disease prevention, Clinical medicine, Medical tests, Artificial consciousness, Artificial neural networks, Cognitive robotics, Forward chaining, Generative AI, Genetic algorithms, Logic based AI, Adaptive systems, Cybernetics, Robotics, Computer science