evaluating AI in clinical settings – Science

Evaluating AI Nursing Care Plans: Readability, Reliability, Quality

SCIENMAG — Mon, 12 Jan 2026 14:56:10 +0000

In a groundbreaking exploration of the intersection between artificial intelligence and nursing practice, researchers Gokalp and Yucel have conducted a comparative analysis of nursing care plans generated by three prominent AI models: ChatGPT, Gemini, and DeepSeek. This study, titled “Comparative analysis of nursing care plans produced by artificial intelligence models in terms of readability, reliability, and quality,” sets a new standard in evaluating how AI can enhance, or potentially disrupt, traditional nursing practices. As artificial intelligence continues to weave itself into various facets of healthcare, the implications of this research extend far beyond mere academic inquiry.

The methodology employed in this study is particularly noteworthy. The researchers meticulously generated nursing care plans using each of the three AI models, leveraging advanced natural language processing algorithms to ensure that the resulting documentation adhered to clinical guidelines. By systematically assessing each model’s output, Gokalp and Yucel aimed to identify their strengths and weaknesses specifically regarding readability, reliability, and overall quality. This rigorous approach not only highlights the capabilities of these AI models but also underscores the necessity for a careful evaluation of their applications in real-world clinical settings.

Readability is a critical factor in the adoption of nursing care plans by healthcare professionals. The researchers utilized various readability scoring formulas to quantify how easily a healthcare provider could comprehend the generated documents. Their findings indicate that while all three AI models produced text that met basic readability standards, nuances emerge when evaluating the complexity and terminology employed. For instance, ChatGPT tended to use more straightforward language, making it particularly accessible for nursing staff across various experience levels, while DeepSeek occasionally incorporated more technical jargon that might not be universally understood.

Reliability in nursing care plans is paramount, as these documents serve as cornerstones for patient care and decision-making processes. The researchers applied a robust framework for assessing reliability through expert reviews, where health professionals evaluated the clinical soundness of the AI-generated plans. This aspect of the study demonstrates that while each model produced reliable care plans, variances were observed. Gemini’s outputs, for example, received commendation for their thoroughness and adherence to best practices, indicating the model’s potential applicability in high-stakes healthcare environments where precision is crucial.

Quality, another crucial element in the evaluation framework, encompasses various factors such as comprehensiveness, contextual relevance, and alignment with patient-centered care principles. The study found that while each AI model demonstrated strengths in producing quality care plans, there were significant differences in how well each adhered to the principles of holistic nursing care. This is particularly important in nursing, which emphasizes not just biological aspects of care but also psychosocial and cultural factors that contribute to a patient’s well-being. The ability of AI to grasp and articulate these nuances is essential as the healthcare landscape evolves towards more integrated and personalized approaches.

Furthermore, the implications of this research raise substantial questions about the role of AI in nursing practice. The positive aspects of enhanced efficiency and the potential for improved patient outcomes must be weighed against concerns about the depersonalization of care and the potential for over-reliance on technology. As sophisticated AI tools become more prevalent, striking a balance between technological support and the inherently human aspects of nursing will be necessary. This delicate balance will likely be a point of focus for nursing professionals and educators as they integrate AI into training curricula and clinical practice.

Interestingly, the study also delves into the ethical considerations surrounding AI-generated care plans. Questions arise about accountability when care plans produced by algorithms influence clinical decision-making. If a care plan generated by an AI model leads to a medical oversight or error, who bears the responsibility? This inquiry resonates deeply within the healthcare community, prompting dialogues about the ethical implications of integrating artificial intelligence into everyday clinical workflows. The need for a clear framework surrounding accountability and transparency in AI applications is critical as healthcare moves forward.

The findings from Gokalp and Yucel’s research are especially timely, resonating with current discourse on the adoption of technology in healthcare. As healthcare systems strive for efficiency and accuracy in patient care, the use of AI models like ChatGPT, Gemini, and DeepSeek could offer valuable resources, provided that their integration is approached with caution and thorough oversight. The role of policymakers will be vital in ensuring that clear regulations and standards are established to govern the use of AI in clinical settings.

Moreover, this research sheds light on the training and support required for nursing professionals to utilize AI-generated care plans effectively. Continuous professional development and education will be needed to equip nurses with the necessary skills to critically assess AI outputs. While AI can facilitate numerous aspects of care planning, the human touch remains irreplaceable. Ensuring that nurses are confident in leveraging these technological advancements while maintaining a patient-first approach will be essential for future healthcare models.

In conclusion, the comparative analysis conducted by Gokalp and Yucel serves as a significant milestone in understanding the potential and challenges of AI in nursing. By evaluating AI-generated care plans through lenses of readability, reliability, and quality, the researchers offer a comprehensive insight into how these tools can complement, rather than replace, the critical work that nurses perform. Achieving nursing excellence in the age of artificial intelligence demands an ongoing commitment to evaluation, adaptation, and ethical scrutiny. The landscape of healthcare is undoubtedly shifting, and studies like this pave the way for a more informed, thoughtful embrace of technology in nursing practice.

Subject of Research: The comparative analysis of nursing care plans produced by artificial intelligence models.

Article Title: Comparative analysis of nursing care plans produced by artificial intelligence models (ChatGPT, Gemini, and DeepSeek) in terms of readability, reliability, and quality.

Article References:

Gokalp, M.G., Yucel, S.C. Comparative analysis of nursing care plans produced by artificial intelligence models (ChatGPT, Gemini, and DeepSeek) in terms of readability, reliability, and quality.
BMC Nurs (2026). https://doi.org/10.1186/s12912-026-04295-7

Image Credits: AI Generated

DOI: 10.1186/s12912-026-04295-7

Keywords: artificial intelligence, nursing care plans, readability, reliability, quality, healthcare, ChatGPT, Gemini, DeepSeek.

Study Reveals AI Chatbots Prone to Medical Misinformation, Underscoring Urgent Need for Enhanced Safeguards

SCIENMAG — Wed, 06 Aug 2025 14:13:50 +0000

A groundbreaking study conducted by scientists at the Icahn School of Medicine at Mount Sinai has exposed a significant vulnerability in widely used artificial intelligence (AI) chatbots, particularly within healthcare settings. The research reveals that these sophisticated language models are alarmingly prone to repeating and expanding on false medical details when presented with inaccurate or fabricated information. This discovery raises profound concerns about the unguarded integration of AI in clinical decision-making, emphasizing an urgent need for stringent safeguards before these tools can be reliably deployed in patient care.

The research team embarked on a meticulous evaluation of popular large language models (LLMs) by crafting controlled experimental scenarios featuring entirely fictional medical terms—fake diseases, symptoms, and diagnostic tests. These fabricated patient cases were designed to assess if AI chatbots would blindly accept and elaborate on erroneous information embedded within queries. Initial results were troubling: without intervention, the chatbots not only regurgitated the false data but often augmented it with confident, detailed explanations about these invented conditions, effectively demonstrating a hallmark behavior known as “hallucination,” where models produce convincing yet entirely fabricated content.

Crucially, the investigators identified a simple yet powerful mitigation strategy. By appending a concise cautionary prompt to the AI input—alerting the system that the provided information might be inaccurate—they observed a marked decrease in the frequency and severity of hallucinated responses. This modification effectively halved the incidence of erroneous elaborations, suggesting that prompt engineering and built-in safety warnings can serve as practical countermeasures to reduce misinformation propagation in AI-enabled healthcare applications.

Mahmud Omar, MD, the lead author and an independent consultant collaborating on this study, emphasized the ease with which these systems can be derailed. “Our experiments showed that AI chatbots are highly susceptible to being misled by false medical details—whether introduced accidentally or deliberately. The danger lies not only in parroting misinformation but also in crafting detailed, plausible narratives around these untrue facts,” he explained. Omar noted that the intervention of a simple one-line warning in the prompt dramatically diminished such hazardous behaviors, underscoring that even incremental design changes can have outsized impacts on safety.

The study’s methodology involved a two-pronged approach. In the first phase, chatbots were prompted with fabricated clinical vignettes devoid of any safety instructions, allowing researchers to observe natural responses. In the subsequent phase, a brief disclaimer was embedded within the prompt, cautioning the model about potential inaccuracies in the input data. The comparative analysis clearly demonstrated that the presence of the warning significantly curtailed hallucination rates, reinforcing the concept that proactive prompt design is an indispensable component of responsible AI deployment in medical contexts.

Eyal Klang, MD, Chief of Generative AI in the Windreich Department of Artificial Intelligence and Human Health at Mount Sinai and co-corresponding senior author, highlighted the gravity of their findings. “Even a single fabricated term injected into a medical question can trigger the model to generate an authoritative-sounding but entirely fictional medical explanation. However, our results also provide a roadmap for safer AI use: carefully timed safety prompts can meaningfully mitigate those errors, pointing to a future where AI can augment clinical workflows without compromising accuracy,” he stated.

Beyond immediate practical implications, the researchers intend to extend their “fake-term” testing paradigm to real-world datasets, involving de-identified patient records. This next phase aims to stress-test AI systems against misinformation within authentic clinical contexts, allowing for further refinement of safety prompts and integration of retrieval-based tools that may cross-validate the chatbot’s outputs against trustworthy medical knowledge bases. Their iterative validation approach aspires to create robust mechanisms that prevent AI hallucinations from influencing patient care decisions.

Girish N. Nadkarni, MD, MPH, Chair of the Windreich Department of Artificial Intelligence and Human Health and co-corresponding senior author, underscored the broader significance of this research. “Our study shines a spotlight on a blind spot within current AI models—their inadequate handling of false medical information, which can generate dangerously misleading responses. The solution lies not in abandoning AI but in engineering systems designed to recognize questionable input, respond with appropriate caution, and preserve essential human oversight,” he reflected. Nadkarni emphasized that deliberate safety measures and thoughtful AI prompt design are critical levers to unlocking the potential of AI in healthcare while mitigating risks.

The pressing issues revealed by this investigation resonate widely across the rapidly expanding intersection of AI and medicine. As clinicians and patients increasingly adopt AI tools for decision support, the need for transparent, reliable, and safe systems is paramount. Hallucinations—defined as confident AI fabrications—could lead not only to diagnostic errors but also to erosion of trust in technology-assisted medicine. This study’s findings prompt a call for regulatory frameworks, stringent validation protocols, and responsible AI integration practices that prioritize patient safety above all.

Technically, this research also advances our understanding of the cognitive vulnerabilities inherent in large language models. These transformer-based architectures rely heavily on patterns learned from expansive datasets rather than grounded medical fact verification, making them inherently vulnerable to propagating misinformation when presented deceptively. The study’s insight that a relatively minimal prompt addition can effectively curb hallucination frequency suggests that the problem can be partially addressed at the interface between human input and AI generation, rather than requiring entirely new model architectures.

In addition to safety prompt engineering, the study hints at the importance of developing AI systems capable of uncertainty quantification and fact-checking. Future models may incorporate retrieval augmented generation (RAG) techniques, linking generated responses to verified medical literature or electronic health records in real-time to validate outputs. Such approaches, combined with real-time human intervention, could transform AI chatbots from mere language predictors into reliable clinical assistants supporting complex decision making.

Mount Sinai’s Windreich Department of Artificial Intelligence and Human Health, under the leadership of Drs. Nadkarni and Klang, is at the forefront of pioneering responsible AI applications in biomedical contexts. This study not only exemplifies their commitment to ethical AI but also provides a foundational framework for other institutions seeking to evaluate and enhance the safety of AI-driven clinical decision tools. Their continued collaboration with the Hasso Plattner Institute for Digital Health underscores a multidisciplinary approach that bridges computational science, engineering, and clinical medicine.

As artificial intelligence continues to permeate healthcare, studies like this expose the critical challenges posed by AI hallucinations and misinformation. Nevertheless, the promising results around simple safety prompts reflect an optimistic path forward where AI tools can be refined and rigorously tested to meet the high standards essential for clinical reliability. The balance between innovation and caution articulated by the Mount Sinai team paves the way for transformative yet responsible AI advancements that ultimately benefit patient outcomes and the future of medicine.

Subject of Research: Evaluation of misinformation propagation and hallucination tendencies in AI chatbots for clinical decision support, and the effectiveness of safety prompt interventions.

Article Title: Large Language Models Demonstrate Widespread Hallucinations for Clinical Decision Support: A Multiple Model Assurance Analysis

News Publication Date: August 6, 2025

Web References:
https://dx.doi.org/10.1038/s43856-025-01021-3
https://ai.mssm.edu/

References:

Omar, M., Sorin, V., Collins, J.D., Reich, D., Freeman, R., Charney, A., Gavin, N., Stump, L., Bragazzi, N.L., Nadkarni, G.N., & Klang, E. (2025). Large Language Models Demonstrate Widespread Hallucinations for Clinical Decision Support: A Multiple Model Assurance Analysis. Communications Medicine, August 2, 2025.

Keywords:
Machine Learning, Artificial Intelligence, Large Language Models, AI Hallucinations, Clinical Decision Support, Medical Misinformation, Prompt Engineering, AI Safety, Digital Health, Biomedical Informatics