Study Reveals AI Language Models Encounter Challenges with Basic Hospital Data Tasks

SCIENMAG — Thu, 07 May 2026 21:48:33 +0000

A recent investigation into the practical capabilities of large language models (LLMs) reveals significant limitations in their use for routine administrative tasks within hospital environments. Conducted by Eyal Klang and colleagues at the Icahn School of Medicine at Mount Sinai in New York, the study critically evaluates the performance of state-of-the-art LLMs on essential number-crunching operations that healthcare administrators depend on daily. Published in PLOS Digital Health, these findings provide essential technical insights into the challenges facing AI implementation in clinical administrative workflows.

Hospitals nowadays rely heavily on electronic health records (EHRs) – structured datasets that capture patient information, resource availability, and care events. Administrators utilize this data to monitor patient loads, allocate resources, and generate operational reports. Traditionally, these tasks are performed by specialized data analysts deploying programming languages and database queries, a process often fraught with delays when rapid answers are needed for decision-making. The promise of LLMs like GPT-4o and Llama has been to democratize data access by allowing non-technical staff to query these datasets directly using natural language prompts.

In the study, researchers subjected nine leading LLMs to a rigorous battery of tests designed to emulate two foundational administrative functions: counting how many patients meet a specific clinical condition and filtering records based on multiple inclusion criteria simultaneously. The data itself was sourced from a substantial real-world dataset of over 50,000 emergency department visits within the Mount Sinai Health System, grounding the evaluation in practical, messy clinical data rather than synthetic or simplified examples.

The initial experiments employed straightforward prompting techniques, where models were simply asked direct questions such as “How many patients were admitted from this table?” Across the board, all tested LLMs demonstrated subpar accuracy, failing to provide reliable answers when handling these structured queries. This underlines a fundamental disconnect between LLM training—and their practical applicability to real-world numerical and logical operations in healthcare datasets.

To enhance performance, the researchers explored a chain-of-thought prompting approach. This method instructs the model to transparently reason through the problem step-by-step before arriving at the final answer, theoretically enabling more accurate and consistent outputs. However, the results were underwhelming; only modest improvements were observed on smaller tables, and as the size and complexity of the data increased, accuracy declined precipitously. For instance, even GPT-4o, the best performing model under this regime, saw accuracy plummet from approximately 95% on small datasets to below 60% when confronted with larger tables.

Recognizing that prompting alone may not suffice, the research shifted focus to a tool-based model execution approach. Here, LLMs were tasked with generating executable code, such as SQL or Python scripts, to process the data programmatically. This method leverages the LLM’s natural language understanding to translate queries into precise machine-readable commands, which are then run directly against the EHR data for guaranteed accuracy. Impressively, this approach substantially improved results for the most advanced models. GPT-4o and Qwen-2.5-72B demonstrated near-perfect accuracy under these conditions, successfully navigating the intricacies of complex filters and large datasets.

Despite these successes, not all models fared well. LLMs optimized for speed and efficiency, such as distilled variants of DeepSeek, struggled to produce usable outputs even when provided with the ability to generate and run code. Furthermore, the Llama-3.1-8B model encountered major difficulties, failing to produce functional results in the majority of assessments and being ultimately excluded from further analysis. These discrepancies highlight the diverse capabilities within the current LLM ecosystem and caution against broad assumptions regarding their utility in structured data environments.

The study’s findings carry critical implications for the future deployment of LLMs in healthcare administration. Benjamin Glicksberg, one of the authors, emphasized that without integrating tool-based strategies—combining LLM-generated code with actual execution—large language models remain fundamentally unsuitable for standalone use in clinical administrative settings. Clinical workflows frequently involve complex structured data requiring absolute reliability and precision, conditions under which straightforward natural language query processing by LLMs falls short.

Moreover, the requirement for “agentic” approaches is underscored by this work. Agentic AI involves systems that act semi-autonomously, leveraging external tools and code execution capabilities to ensure results remain consistent and verifiable. By integrating LLMs with backend code execution engines, hospitals could dramatically accelerate administrative processes while maintaining data integrity. Such hybrid solutions may bridge the gap between cutting-edge AI capabilities and the stringent accuracy demands of healthcare operations.

This study shines a spotlight on the often-overlooked challenges of applying AI in clinical data environments. While the hype around LLMs centers on their conversational fluency and general knowledge, the ability to perform precise numerical computations and filtered data retrieval within complex EHR systems requires a fundamentally different kind of model reliability. The researchers’ meticulous experimental design and real-world data usage offer a vital reality check for the healthcare sector’s ongoing AI ambitions.

Lastly, the authors note that their work did not receive any external funding, and no competing interests were declared. The open-access publication ensures that the full details, along with extensive methodological descriptions and results, remain available to researchers, clinicians, and AI developers aiming to advance safe and effective AI integration into hospital administration.

Overall, these findings caution healthcare providers and AI developers alike to calibrate expectations around LLMs’ current abilities in administrative contexts. They also highlight the powerful potential unlocked by hybrid human-AI systems that combine natural language understanding with robust programming and execution frameworks. As digital healthcare continues to evolve, researchers and practitioners will need to navigate these complex trade-offs to harness AI’s benefits without compromising accuracy and trustworthiness.

Web References:
https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0001326

Subject of Research: Not applicable
Article Title: Large language models are poor clinical administrators: An evaluation of structured queries in real-world electronic health records
News Publication Date: 7-May-2026
References: Klang E, Sorin V, Korfiatis P, Sawant AS, Freeman R, Charney AW, et al. (2026) Large language models are poor clinical administrators: An evaluation of structured queries in real-world electronic health records. PLOS Digit Health 5(5): e0001326. DOI: 10.1371/journal.pdig.0001326

Keywords

Large Language Models, Electronic Health Records, Clinical Administration, Artificial Intelligence, GPT-4o, Tool-based AI, Chain-of-Thought Prompting, Healthcare Data Analytics, AI Reliability, Code Generation, Hospital Resource Management, Clinical Workflow Automation

EHR data querying by non-technical staff – Science

Study Reveals AI Language Models Encounter Challenges with Basic Hospital Data Tasks

Keywords