In the rapidly evolving landscape of artificial intelligence (AI), its deployment in environments where safety is paramount—such as hospitals and aviation—demands a nuanced understanding beyond mere technological performance. Recent research led by engineers at The Ohio State University reveals that even highly accurate AI algorithms and minimal user training are insufficient to guarantee safe and effective operation in these high-stakes settings. Instead, evaluating AI systems must involve a joint assessment of both the technology and the human operators who rely on it, in order to fully grasp AI’s impact on critical decision-making processes.
The study, published in npj Digital Medicine, underscores the importance of simultaneous evaluation frameworks that consider how humans interact with AI across a spectrum of algorithmic performance, from optimal accuracy to significant errors. This joint assessment approach moves away from traditional evaluations that isolate machine capability from user response, aiming instead to emulate complex real-world scenarios where AI recommendations may be inconsistent or flawed.
At the heart of the research lies an experimental study involving 450 nursing students and 12 licensed nurses. These participants engaged with AI-assisted remote patient-monitoring interfaces designed to simulate clinical decision-making related to urgent medical care needs. Over a sequence of ten patient cases, participants contended with varying experimental conditions, including the absence of AI help, presentation of AI-generated risk predictions, AI-annotated clinical data, and a combination of both predictions and annotations. The goal was to measure how AI performance influenced the participants’ ability to accurately assess the urgency of patient conditions.
Quantitatively, the findings revealed that when the AI algorithm correctly predicted an impending medical emergency, participant decision-making improved dramatically—by as much as 50 to 60 percent. However, the flip side of this coin was stark. In instances where the AI made erroneous predictions—especially when the algorithm’s confidence was misleadingly high—human decision-making accuracy didn’t just falter; it collapsed by over 100 percent. This massive degradation in performance highlights a dangerous overreliance on AI outputs, even when such outputs are demonstrably incorrect.
This phenomenon draws attention to a cognitive bias known as automation bias: the tendency of human operators to favor computer-generated suggestions over their own judgment, particularly when under pressure or facing complex data. The study further illuminated the limitations of AI-generated explanatory annotations intended to justify the algorithm’s predictions. Despite being designed to provide interpretability and additional context, these explanations had surprisingly little influence on participant decisions, which were overwhelmingly dominated by the primary risk indicators presented as bold, conspicuous cues in the interface.
Dane Morey, a research scientist in Ohio State’s Department of Integrated Systems Engineering and the study’s lead author, emphasizes the critical insight that effective AI deployment in safety-critical settings transcends the quest for ever-better algorithms. “An AI algorithm can never be perfect,” Morey stated. “So if you want an AI algorithm that’s ready for safety-critical systems, that means something about the team, about the people and AI together, has to be able to cope with a poor-performing AI algorithm.”
This insight challenges conventional paradigms that focus predominantly on engineering flawless AI systems, redirecting attention to the development of resilient human-machine teams. The research team, including Mike Rayo and David Woods, both experts in integrated systems engineering, developed the Joint Activity Testing (JAT) program to further explore this interaction. JAT represents an innovative research framework designed to empirically test and refine the dynamic between humans and AI in environments where errors carry potentially fatal consequences.
Underpinning these efforts is a set of evolving evidence-based principles aimed at guiding the design of AI systems with joint human-machine activity in mind. Among the most striking recommendations is that AI-enabled systems must transparently communicate the domains and scenarios where their outputs are likely to be inaccurate or misaligned with reality—even when the AI itself is not fully aware of these shortcomings. This transparency is essential for fostering human vigilance and critical engagement rather than blind trust in automated recommendations.
Mike Rayo articulates the broader implications: “Even if a technology does well on those heuristics, it probably still isn’t quite ready. We need to do some form of empirical evaluation because those are risk-mitigation steps, and our safety-critical industries deserve at least those two steps of measuring performance of people and AI together and examining a range of challenging cases.”
The scale and methodology of this study mark a notable advance in the field. With 462 participants—much larger than typical human-in-the-loop AI studies that often engage fewer than 30 individuals—the researchers achieved robust statistical confidence in their findings. Notably, the participant cohort was drawn from nursing students enrolled in a clinical course and practicing nurses, ensuring relevance to real-world medical environments where AI applications are increasingly common.
The AI-assisted interface presented participants with a rich visualization of patient data—demographics, vital signs, and laboratory results—complemented by AI-generated predictions and annotations. Participants rated concern for patient deterioration on a continuous scale, allowing researchers to assess the calibration of human trust relative to AI input. The results confirmed that neither AI nor human decision-making was universally superior, demonstrating a nuanced interplay where clinical experience modulated responses but could not entirely counteract misleading algorithmic signals.
Despite the researchers’ anticipation that explanatory information accompanying AI predictions would moderate user trust and enhance decision accuracy, the data suggested otherwise. The dominant red indicator bar representing elevated AI-predicted risk effectively “swept away” subtle annotation cues, overpowering any mitigating effects those secondary explanations might have had. This finding points to the powerful design implications of user interface elements in shaping human reliance on AI outputs.
The study’s implications resonate well beyond healthcare. As AI systems increasingly permeate other safety-critical domains like aviation, nuclear power, and defense, the principle that human-AI teams must be evaluated jointly becomes essential to advancing responsible AI deployment. The Ohio State team’s publicly available experimental technology provides a valuable model and toolkit for industries pursuing such integrative evaluations.
Moreover, the researchers have disseminated their work and insights through platforms such as AI-frontiers.org, further advocating for a paradigm shift: from seeking the best AI performance in isolation toward cultivating optimal team performance. As Morey concluded, “What we’re advocating for is a way to help people better understand the variety of effects that may come about from technologies. Basically, the goal is not the best AI performance. It’s the best team performance.”
This research was supported by the American Nurses Foundation Reimagining Nursing Initiative, reflecting the growing recognition that AI’s promise to enhance healthcare outcomes must be tempered by rigorous attention to human factors and systemic resilience. As AI technologies become more embedded in life-critical decision pathways, this study’s joint activity evaluation framework offers a blueprint for safer, more reliable human-machine partnerships.
Subject of Research: People
Article Title: Empirically derived evaluation requirements for responsible deployments of AI in safety-critical settings
Web References:
- https://doi.org/10.1038/s41746-025-01784-y
- https://u.osu.edu/csel/joint-activity-testing-jat/
- https://human-machine.team/
- https://ai-frontiers.org/articles/how-ai-can-degrade-human-performance-in-high-stakes-settings
References:
Morey, D., Rayo, M., Woods, D. (2025). Empirically derived evaluation requirements for responsible deployments of AI in safety-critical settings. npj Digital Medicine. https://doi.org/10.1038/s41746-025-01784-y
Keywords: Artificial Intelligence, Safety-Critical Systems, Human-Machine Teaming, Automation Bias, AI Evaluation, Healthcare AI, Decision-Making, Joint Activity Testing