In recent years, the integration of artificial intelligence (AI) into healthcare has promised unprecedented advancements in patient care, diagnostics, and clinical decision-making. Among AI technologies, large language models (LLMs) have emerged as powerful tools capable of interpreting and generating human-like text, potentially revolutionizing the way clinicians access and process medical information. However, a groundbreaking new study published in Pediatric Research raises critical questions about the reliability of these models in performing clinical calculations, emphasizing that errors are not limited to human practitioners but extend to the machines designed to assist them.
Large language models such as GPT-4 and its successors have demonstrated remarkable capabilities in understanding complex medical queries, synthesizing evidence-based recommendations, and providing instant explanations. These traits have naturally led to enthusiasm around their deployment in clinical settings, from administrative tasks to direct patient interaction. Nevertheless, the study led by Kilpatrick, Greenberg, Boyce, and colleagues meticulously dissects instances where LLMs falter—particularly in executing clinical calculations that require numerical precision and contextual judgment.
The core of the study underscores a subtle but significant vulnerability: LLMs, while adept at language-processing tasks, are fundamentally pattern recognition systems rather than arithmetic engines. Clinical calculations, such as dosage adjustments based on patient weight, renal function, or lab values, demand an exactness that often eludes language models, which rely on probabilistic token predictions rather than deterministic computations. The presence of errors, even marginal in appearance, can have cascading consequences in pediatric care, where dosage windows are narrow and the margin for mistake remarkably small.
In a series of rigorously designed experiments, the research team tested multiple prominent LLMs on a battery of pediatric clinical calculation tasks. These ranged from estimating body surface area and calculating medication dosages to interpreting laboratory indices critical for therapeutic decision-making. The results were eye-opening. Errors occurred not only in simple arithmetic but also in applying clinical formulas correctly—such as the Schwartz equation for glomerular filtration rate estimation—highlighting the models’ inconsistency under pressure from clinical complexity.
Interestingly, the nature of these errors varied. Some stemmed from fundamental mathematical mistakes—adding or multiplying incorrectly—while others arose from misinterpretations of clinical context, such as confusing units or applying adult-centric formulas in pediatric scenarios. For practitioners trusting AI-based tools, these pitfalls are alarming. They underscore the fact that while AI can augment clinical workflows, it remains an imperfect assistant that requires vigilant oversight.
The researchers place their findings within the broader framework of human error in medicine, a well-documented source of adverse events in hospitals worldwide. Traditional clinical practice acknowledges that humans, despite experience and training, are prone to mistakes, especially under stress or fatigue. AI technologies were introduced partly to mitigate these risks. However, the study’s message is clear: machines are not exempt from error, and in some cases, their shortcomings can mirror or even exacerbate human fallibility.
One crucial implication is that reliance on LLMs without appropriate safeguards could be hazardous. For instance, clinicians using natural language interfaces for quick medication dosing recommendations might receive plausible but incorrect answers. The linguistic fluency of these models could inadvertently foster misplaced confidence, as coherent explanations may mask underlying inaccuracies in calculations. Hence, the study advocates for systematic validation and integration of AI outputs with human expertise rather than uncritical acceptance.
The technical architecture of LLMs contributes to this dilemma. These models use vast datasets during training, encompassing a myriad of textual inputs, including some medical literature. But their architecture is not specifically tuned for numeric reasoning, leading to "hallucinations"—where the model generates syntactically correct but factually wrong information. While progress has been made in enhancing AI’s capabilities in math and logic, clinical calculations represent a particularly challenging category, combining precise numeracy with context-dependent decision rules.
Moreover, the study sheds light on the ethical and legal dimensions of incorporating AI in medicine. When an AI tool errs in clinical calculations that result in patient harm, determining accountability becomes complex. Is the fault with the software developers, the healthcare institution adopting the tool, or the clinician who deployed it? These questions are at the forefront of ongoing debates about AI governance in health systems and are exacerbated by studies like this one exposing real-world risks.
To address these challenges, Kilpatrick and colleagues suggest multiple pathways forward. First, embedding specialized numerical reasoning modules within LLM frameworks could improve accuracy in calculation-heavy tasks. Second, creating hybrid models that integrate deterministic algorithms for clinical formulas alongside generative language components may strike a better balance between linguistic sophistication and computational precision. Finally, rigorous external validation standards and transparent reporting of AI limitations must become mandatory prerequisites before clinical deployment.
Importantly, the study also emphasizes the continued necessity of human expertise. Rather than viewing AI as a replacement for clinicians, the authors argue for a model of synergy—using AI to augment human reasoning but reinforcing the clinician’s role as the ultimate arbiter of patient care decisions. This partnership can harness the speed and scalability of LLMs while hedging against their vulnerabilities through human judgment and experience.
The timing of this research is particularly relevant as health systems worldwide face increasing patient volumes and complex cases. AI offers alluring solutions for alleviating cognitive loads on healthcare workers, but this study serves as a timely reminder that technology is fallible. Careful integration and cautious skepticism should guide the ongoing adoption of AI tools in medicine to safeguard patient safety.
In conclusion, the findings presented by Kilpatrick, Greenberg, Boyce, and their team represent an important milestone in the evolving narrative of AI’s role in healthcare. Their meticulous assessment of large language models in pediatric clinical calculations reveals a nuanced picture: while AI can vastly enhance accessibility and efficiency, inherent limitations in numerical reasoning persist, necessitating caution and continuous improvement. As the landscape of medicine increasingly entwines with AI, balancing innovation with patient safety remains paramount.
As clinicians, researchers, and technologists collaborate to refine AI tools, the overarching lesson is clear—both humans and machines are fallible. Identifying where and why errors occur enables the design of safer systems that harness the best qualities of both. The future of medicine lies not in replacing human intellect but in complementing it with intelligent technologies that recognize and compensate for their own imperfections.
Subject of Research: The reliability and limitations of large language models in performing clinical calculations in pediatric medicine.
Article Title: Large language models and clinical calculations: to err is human and machines are not exempt.
Article References:
Kilpatrick, R., Greenberg, R.G., Boyce, D. et al. Large language models and clinical calculations: to err is human and machines are not exempt. Pediatr Res (2025). https://doi.org/10.1038/s41390-025-04166-y
Image Credits: AI Generated