Thursday, May 28, 2026
Science
No Result
View All Result
  • Login
  • HOME
  • SCIENCE NEWS
  • CONTACT US
  • HOME
  • SCIENCE NEWS
  • CONTACT US
No Result
View All Result
Scienmag
No Result
View All Result
Home Science News Technology and Engineering

Doctor GPT: AI Achieves Nearly 76% Accuracy in Answering Healthcare Queries

May 28, 2026
in Technology and Engineering
Reading Time: 4 mins read
0
Doctor GPT: AI Achieves Nearly 76% Accuracy in Answering Healthcare Queries — Technology and Engineering

Doctor GPT: AI Achieves Nearly 76% Accuracy in Answering Healthcare Queries

65
SHARES
591
VIEWS
Share on FacebookShare on Twitter
ADVERTISEMENT

In recent years, the rise of artificial intelligence (AI) technologies, particularly large language models (LLMs), has opened new frontiers in multiple fields, including healthcare. A groundbreaking study led by researchers at Penn State has now provided a rigorous evaluation of how AI-powered chatbots respond to everyday health-related inquiries posed by the general public. The study uncovers that these AI systems achieve an accuracy rate of approximately 76% when addressing routine health questions, a figure that simultaneously highlights both the promise and the perils of deploying such technologies in real-world medical contexts.

The research uniquely focusses on the perspective of the average internet user, a group that frequently turns to AI as a modern-day symptom checker, reminiscent of how Google was traditionally used for preliminary health information. This user-centered approach is critical because prior studies predominantly examined LLMs from expert or academic lenses, often overlooking practical consumer interactions. By focusing on typical health queries submitted by laypersons, the study offers vital insights into the effectiveness and safety of AI-based medical advice in daily life.

To gather authentic data reflecting real-world usage, the research team organized an innovative event known as the “Diagnose-a-thon” at Penn State. This competition attracted 34 participants spanning faculty, staff, and students across various academic levels. Participants generated a substantial dataset of 212 health-related prompts, encompassing both genuine and hypothetical conditions, crafted from patient and clinician viewpoints. They then queried four distinct state-of-the-art LLMs: ChatGPT-4o, ChatGPT-3.5, Gemini-1.5 Pro, and Llama3-8b. By allowing participants to select their preferred AI model without constraints, the study faithfully replicated the autonomous and diverse usage patterns found in natural settings.

An essential part of the study involved a rigorous evaluation stage where nine board-certified physicians assessed the treatments and information handed back by the LLMs. The evaluation metric was comprehensive, assessing both the clinical accuracy and the potential harm posed by the AI-generated answers, measured on a nuanced six-point scale from very low to very high. This detailed scoring system illuminated how AI diagnostic responses vary across medical specialties and contexts, a level of granularity rarely seen in previous AI investigations.

The findings showed a variable performance landscape across medical disciplines. Obstetrics, gynecology, and otolaryngology yielded the highest levels of correct information with minimal risks, showcasing scenarios where LLMs currently excel. Conversely, fields such as internal medicine, neurology, and dermatology demonstrated more significant challenges for AI systems, where inaccuracies and higher harm potentials were more prevalent. These results underscore an important reality: certain specialized medical domains demand more caution when leveraging AI tools, especially if these tools are employed by untrained individuals.

A fascinating specificity in the study revealed that prompts with a length between 60 and 250 characters tended to produce more accurate AI responses. This suggests that message framing and prompt articulation play crucial roles in steering AI models toward clinically valid outputs. Moreover, highly specialized or narrowly focused questions posed difficulties, suggesting that broad generalist models still face significant hurdles when addressing deeply technical or nuanced medical issues.

Beyond evaluating off-the-shelf AI models, the research team experimented with a novel augmentation approach by retraining the base LLMs using an extensive corpus of medical textbooks, clinical guidelines, and peer-reviewed literature typical of medical school curricula. The goal was to determine whether such domain-specific tuning could enhance clinical validity while reducing harmful outputs. Surprisingly, medical professionals and trainees reviewing these augmented models showed a preference for responses from the original Gemini and Llama bases over the retrained versions. No statistically significant preference was observed regarding ChatGPT’s base versus augmented models. This counterintuitive result suggests that current fine-tuning strategies may not straightforwardly translate into improved clinical communication by AI.

The implications of these findings are profound for the future integration of AI into healthcare delivery. As Dr. Jennifer Kraschnewski, a co-author of the study and a practicing physician, articulates, AI represents a transformative force with the potential to augment clinician capabilities rather than replace human doctors. The challenge lies in harnessing AI tools in ways that bolster medical professionals’ diagnostic processes, reduce cognitive burdens, and improve patient outcomes without exposing patients to the risks of AI errors in unsupervised contexts.

Crucially, the study emphasizes that despite satisfactory accuracy scores in the mid-70s percentage range, the AI models still exhibited an error rate exceeding 20%. This rate is approximately double that of human physicians and highlights the potential for AI to propagate misinformation leading to harm if used uncritically by patients themselves. Such statistical insights counsel for cautious and responsible deployment of AI technologies in healthcare, underscoring the necessity of preserving human clinical oversight.

The study also offers a nuanced view on AI’s evolving role: rather than supplanting the physician’s role, AI could serve as a catalyst to “upskill” clinicians by providing rapid evidence summaries, differential diagnosis suggestions, and decision support, streamlining care processes. The research community is thus encouraged to focus on developing AI systems tailored to professional use, with interfaces and interpretability tuned for clinical environments.

Penn State’s research ecosystem facilitated this multidisciplinary collaboration, bringing together expertise in informatics, intelligent systems, clinical medicine, and AI ethics. Their participatory research design, which mimics user autonomy and real-world interaction dynamics, sets a new methodological standard for evaluating AI systems in societally critical domains. It also expands the discourse on AI accountability and transparency by highlighting the tangible benefits and limitations observed when AI systems engage with health-related content.

Given the inevitable persistence of AI tools in healthcare, public education and digital literacy emerge as pivotal. The study’s co-authors advocate for initiatives that enhance consumer understanding of AI’s strengths and weaknesses in medical diagnosis. Such literacy efforts will empower users to critically appraise AI-generated advice, reducing overreliance and potential misuses.

In summary, this Penn State study, to be presented at the 2026 ACM Fairness, Accountability, and Transparency (FAccT) conference, offers a watershed moment in understanding how large language models intersect with everyday healthcare. Their findings resonate with a dual narrative: AI carries tremendous promise to revolutionize medical diagnostics and patient care when stewarded responsibly, but also harbors non-negligible risks, particularly if accessible without proper clinical guidance. As artificial intelligence advances, the path forward must balance innovation with prudence, ensuring these systems enhance rather than undermine the intricate art of medicine.


Subject of Research: Evaluation of large language models’ accuracy and safety in responding to everyday health-related queries by general users.

Article Title: Dr. GPT Will See You Now, but Should It? Exploring the Benefits and Harms of Large Language Models in Medical Diagnosis using Crowdsourced Clinical Cases

News Publication Date: 25-Jun-2026

Web References:
10.48550/arXiv.2506.13805
2026 ACM FAccT Conference

References:
The study data is derived from peer evaluations by board-certified physicians, augmented training on medical textbooks and peer-reviewed articles, and participatory crowdsourced clinical cases generated during the Diagnose-a-thon event hosted by Penn State’s Center for Socially Responsible Artificial Intelligence.

Keywords

Generative AI, Artificial Intelligence, Large Language Models, Healthcare, Medical Diagnosis, Clinical Accuracy, AI Ethics, Doctor-Patient Relationship, AI Safety, Medical Informatics, Healthcare Technology, AI in Medicine

Tags: AI and patient information accuracyAI chatbot accuracy in health queriesAI in healthcareAI medical advice safetyconsumer-focused AI health toolshealthcare AI evaluation studylarge language models for medical advicelimitations of AI in medicinePenn State Diagnose-a-thon eventpublic engagement with health AIreal-world AI healthcare applicationssymptom checking with AI
Share26Tweet16
Previous Post

New Study Maps Brain Transposable Element RNA Dynamics Across Lifespan and Neurodegenerative Diseases

Next Post

Which Shocks Threaten Global Food Systems the Most?

Related Posts

Reactive Ink Transforms Prints into Permanent Copper in Breakthrough Innovation — Technology and Engineering
Technology and Engineering

Reactive Ink Transforms Prints into Permanent Copper in Breakthrough Innovation

May 28, 2026
Scientists Unveil New Structural State of Matter Exhibiting Exotic Properties — Technology and Engineering
Technology and Engineering

Scientists Unveil New Structural State of Matter Exhibiting Exotic Properties

May 28, 2026
Miniaturized Passive Vacuum System Powers Cold Atom Sensors — Technology and Engineering
Technology and Engineering

Miniaturized Passive Vacuum System Powers Cold Atom Sensors

May 28, 2026
Clinical Trial Advances Intuitive Assistive Robotics for Individuals with Paralysis — Technology and Engineering
Technology and Engineering

Clinical Trial Advances Intuitive Assistive Robotics for Individuals with Paralysis

May 28, 2026
Global Computing Giant Announces New Leadership in Landmark Election — Technology and Engineering
Technology and Engineering

Global Computing Giant Announces New Leadership in Landmark Election

May 28, 2026
Nanofiber-Based Multidrug Therapy Emerges as a Promising Approach for Glioblastoma — Technology and Engineering
Technology and Engineering

Nanofiber-Based Multidrug Therapy Emerges as a Promising Approach for Glioblastoma

May 28, 2026
Next Post
Which Shocks Threaten Global Food Systems the Most? — Athmospheric

Which Shocks Threaten Global Food Systems the Most?

  • Mothers who receive childcare support from maternal grandparents show more parental warmth, finds NTU Singapore study

    Mothers who receive childcare support from maternal grandparents show more parental warmth, finds NTU Singapore study

    27650 shares
    Share 11056 Tweet 6910
  • University of Seville Breaks 120-Year-Old Mystery, Revises a Key Einstein Concept

    1053 shares
    Share 421 Tweet 263
  • Bee body mass, pathogens and local climate influence heat tolerance

    680 shares
    Share 272 Tweet 170
  • Researchers record first-ever images and data of a shark experiencing a boat strike

    543 shares
    Share 217 Tweet 136
  • Groundbreaking Clinical Trial Reveals Lubiprostone Enhances Kidney Function

    529 shares
    Share 212 Tweet 132
Science

Embark on a thrilling journey of discovery with Scienmag.com—your ultimate source for cutting-edge breakthroughs. Immerse yourself in a world where curiosity knows no limits and tomorrow’s possibilities become today’s reality!

RECENT NEWS

  • Tumor Metabolism Shapes Pancreatic Cancer Therapy Outcomes
  • WVU Researcher Discovers Unexpected Phenomenon in NASA’s Mars Data
  • Reactive Ink Transforms Prints into Permanent Copper in Breakthrough Innovation
  • AI-Driven Atlas Uncovers Novel Prognostic and Therapeutic Insights into Tertiary Lymphoid Structures in Cancer

Categories

  • Agriculture
  • Anthropology
  • Archaeology
  • Athmospheric
  • Biology
  • Biotechnology
  • Blog
  • Bussines
  • Cancer
  • Chemistry
  • Climate
  • Earth Science
  • Editorial Policy
  • Marine
  • Mathematics
  • Medicine
  • Pediatry
  • Policy
  • Psychology & Psychiatry
  • Science Education
  • Social Science
  • Space
  • Technology and Engineering

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 5,146 other subscribers

© 2025 Scienmag - Science Magazine

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • HOME
  • SCIENCE NEWS
  • CONTACT US

© 2025 Scienmag - Science Magazine

Discover more from Science

Subscribe now to keep reading and get access to the full archive.

Continue reading