In a groundbreaking advancement for artificial intelligence in healthcare, a team of researchers has demonstrated that a collaborative approach involving a council of AI models can dramatically enhance the accuracy of medical knowledge assessments. This innovative study reveals that a group of AI agents, based on OpenAI’s GPT-4, working together through structured deliberation, outperforms individual AI models on the notoriously challenging United States Medical Licensing Examination (USMLE). The research, published in the open-access journal PLOS Digital Health, marks a significant step forward in harnessing collective intelligence to address the complexities and variable responses typically encountered in AI-generated medical decisions.
The USMLE is a rigorous, three-stage examination that assesses a physician’s ability to apply knowledge, concepts, and principles fundamental to the practice of medicine. Traditionally, the exam poses a substantial challenge to AI, partly due to the inconsistency of answers when a single large language model (LLM) is queried multiple times on the same material. These responses often vary in quality, sometimes containing inaccuracies or hallucinations—fabricated information presented as fact—which compromise trustworthiness in clinical settings.
To overcome these challenges, researchers designed a council of AI agents. This ensemble consists of multiple instances of GPT-4, each providing initial answers independently. What sets this approach apart is a novel facilitator algorithm tasked with mediating disagreements among the AI instances. When responses diverge, the facilitator orchestrates a deliberative dialogue that encourages the council to synthesize the varying answers. This iterative exchange results in the generation of a consensus response, which, as the study shows, tends to be notably more accurate than any single model’s reply.
Testing this collaborative AI framework on a set of 325 publicly available USMLE questions spanning Step 1 (foundational biomedical sciences), Step 2 Clinical Knowledge (clinical diagnosis and management), and Step 3 (advanced clinical scenarios), the council achieved remarkable accuracy rates of 97%, 93%, and 94%, respectively. These figures represent a significant improvement over solitary GPT-4 models, demonstrating the potential of collective AI reasoning in complex domains. Even more strikingly, when the council initially failed to reach unanimous agreement, subsequent deliberations resulted in correct consensus answers 83% of the time, showcasing the power of structured dialogue for self-correction.
The implications of these findings extend far beyond standardized exams. In healthcare, where nuanced understanding and precision are paramount, AI tools must be not only powerful but reliable and trustworthy. By enabling multiple AI agents to “converse” and refine their outputs, this collaborative approach introduces a quality control mechanism inherently absent in single-model systems. The authors of the study argue that such collective intelligence may redefine how we evaluate AI’s effectiveness and reliability, especially in high-stakes environments like medical diagnosis and treatment planning.
Yahya Shaikh, the lead author from Baltimore, emphasizes the transformative potential of this methodology. “Our research establishes that when AI systems engage in a structured dialogue, they can surpass the accuracy of any one system, achieving unprecedented performance on complex medical licensing exams without specialized training on medical data,” Shaikh says. This insight underscores the viability of leveraging dialogue and diversity among AI models as means to minimize errors and harness the strengths of their varied reasoning pathways.
Importantly, the study also addresses fundamental misconceptions about AI reliability. Conventional wisdom favors consistency in AI outputs as a hallmark of quality, yet the work by Shaikh and colleagues reveals that variability among responses, when channeled appropriately, becomes an asset rather than a liability. Instead of expecting uniformity, the council model embraces diverse perspectives, allowing AI agents to weigh different interpretations and evidence before converging on a final answer. This dynamic mirrors human expert panel discussions and decision-making processes, lending credibility to the concept of AI teamwork.
Another intriguing aspect of the research involves the concept of semantic entropy, a measure used to quantify the diversity and uncertainty in AI-generated answers. Zainab Asiyah, a co-author of the study, notes that semantic entropy provides a narrative of the AI’s internal struggle and eventual resolution, akin to a human cognitive journey. This metric reveals how individual AI agents can influence one another’s viewpoints through conversation, sometimes even challenging incorrect answers and steering the group toward consensus.
While the results are promising, the authors caution that the council approach has yet to be validated in real-world clinical practice. Practical applications will require addressing challenges such as integrating AI councils into existing healthcare workflows, ensuring transparency of deliberation processes, and managing the ethical implications of AI-influenced decisions. Nonetheless, the demonstrated ability of AI systems to self-correct and improve through interaction opens a new horizon for AI deployment in medicine, education, and possibly other knowledge-intensive fields.
Zishan Siddiqui, another co-author, highlights the pragmatic nature of this work by emphasizing that the focus is not merely on showcasing AI’s test-taking abilities but on developing a method that leverages AI’s natural variation to enhance accuracy. “This system’s capability to ‘take a few tries, compare notes, and self-correct’ should be incorporated into future tools designed for education and healthcare, where correctness is non-negotiable,” Siddiqui notes. By fostering redundancy and collaborative reasoning, the council model could serve as a foundation for safer, more effective AI-based decision support systems.
This study stands as a compelling example of the evolving landscape of artificial intelligence research, where the focus shifts from individual model performance to synergistic interactions among multiple agents. Collaborative intelligence presents a promising avenue to overcome the current limitations of LLMs, offering robustness against errors and amplifying strengths through collective insight. As AI continues to integrate into critical domains, approaches that emulate human-like collaboration among AI systems are likely to shape the future trajectory of machine learning applications.
In summary, the research conducted by Yahya Shaikh and colleagues presents a transformative approach by demonstrating that a council of GPT-4-based AI models, working through iterative discussion and consensus-building, can significantly enhance accuracy on medical licensing exams. This work challenges preconceived notions of AI consistency and reliability, introducing a paradigm where variability and constructive debate are harnessed to achieve superior outcomes. As AI tools advance, embracing collaborative intelligence may hold the key to unlocking new levels of accuracy, trust, and utility in both medical and broader scientific contexts.
Subject of Research: Not applicable
Article Title: Collaborative intelligence in AI: Evaluating the performance of a council of AIs on the USMLE
News Publication Date: October 9, 2024
Web References: http://dx.doi.org/10.1371/journal.pdig.0000787
References: Shaikh Y, Jeelani-Shaikh ZA, Jeelani MM, Javaid A, Mahmud T, Gaglani S, et al. (2025) Collaborative intelligence in AI: Evaluating the performance of a council of AIs on the USMLE. PLOS Digit Health 4(10): e0000787.
Image Credits: Nguyen Dang Hoang Nhu, Unsplash (CC0)
Keywords: Artificial Intelligence, Medical Licensing Exam, USMLE, GPT-4, Collaborative Intelligence, Large Language Models, AI Deliberation, Medical AI Accuracy, Structured Dialogue, AI Self-Correction