In a groundbreaking observational study published recently in BMJ Open, researchers performed an extensive audit on the accuracy, referencing, and readability of medical information provided by five leading generative AI chatbots. These platforms, rapidly integrated across sectors like research, education, business, marketing, and medicine, are increasingly being used by the public for everyday health queries, often functioning as substitutes for traditional search engines. The study underscores alarming concerns regarding the reliability of medical advice dispensed by these AI-driven chatbots, revealing that half of their responses to clear, evidence-based medical questions were rated as somewhat or highly problematic.
The research specifically targeted five widely used chatbots available as of February 2025: Gemini by Google, DeepSeek by High-Flyer, Meta AI by Meta, OpenAI’s ChatGPT, and Grok from xAI. To assess their propensity for misinformation, the investigators crafted 50 tailored prompts covering five pivotal health topics—cancer, vaccines, stem cells, nutrition, and athletic performance. These prompts were carefully designed to mimic common health inquiries and incorporated known misinformation tropes to ‘stress test’ the chatbots’ behavioral vulnerabilities. This methodological approach is vital in understanding how AI processes and communicates complex health-related content under adversarial conditions, revealing critical weaknesses in the current generation of conversational AI.
The study utilized both open-ended and closed question formats. Closed prompts required chatbots to select from predefined response options, with a distinct correct answer aligning with scientific consensus. Open-ended prompts necessitated multiple response generations, promoting a more elaborate and informative discourse. Analysis demonstrated that open-ended questions frequently elicited highly problematic responses—around 40 in total—while concurrently producing fewer non-problematic answers compared to closed prompts. This dichotomy highlights a significant challenge: the greater creative latitude given to AI models increases the risk of delivering inaccurate or misleading medical advice, which could misdirect individuals seeking trustworthy health information.
Critically, the overall quality of responses did not significantly differ among most chatbots; however, the AI developed by xAI, Grok, was found to generate a notably higher proportion of highly problematic answers—58% of total responses—raising questions about its fitness for public-facing medical communication. Conversely, Gemini, Google’s AI offering, showed the lowest rate of flawed responses and the highest proportion of safe and scientifically accurate information. These findings suggest that different architectures, training data, or fine-tuning methodologies can substantially impact the quality of AI-generated medical advice.
Performance varied across medical domains: chatbots demonstrated relatively strong knowledge and adherence to scientific consensus in subjects such as vaccines and cancer. This may reflect the well-established and extensively studied nature of these topics, with abundant high-quality data supporting model training. Contrarily, chatbot outputs on stem cells, athletic performance, and nutrition were less reliable, marked by more frequent misinformation or incomplete answers. These areas often involve emerging research, contradictory studies, and marketing hype—elements that challenge AI models still prone to conflating scientific evidence with pseudoscientific claims.
Compounding the issue, the study found that AI responses were invariably delivered with an unwavering tone of confidence and certainty, seldom accompanied by disclaimers or acknowledgments of uncertainty. Out of 250 total questions, chatbots declined to answer only twice, both refusals originating from Meta AI when queried about anabolic steroids and alternative cancer therapies. The uncritical confidence expressed by AI systems could dangerously mislead users who lack the expertise to discern the nuances or trustworthiness of the information, potentially fostering false security around dubious health interventions.
The audit further exposed profound deficiencies in reference quality accompanying chatbot responses. On average, the completeness of citation lists was a mere 40%, with no chatbot providing fully accurate, verifiable reference data. This is exacerbated by frequent AI hallucinations and fabricated citations—non-existent or misleading sources presented as credible evidence. Such hallucinatory behavior undermines the integrity of AI-mediated health communication and complicates users’ ability to verify information independently.
Another critical limitation pertains to the readability of chatbot-generated content. The researchers applied the Flesch Reading Ease score, a standard metric to evaluate textual complexity, and determined that outputs consistently fell within ‘difficult’ comprehension levels equivalent to college graduate proficiency. This elevated cognitive demand poses a barrier for many laypersons, who may struggle to interpret and use health information correctly, potentially aggravating existing disparities in health literacy.
The study authors acknowledge limitations to their findings. The analysis focused on only five chatbots, with rapidly evolving AI technology meaning results may vary with newer iterations or different platforms. Moreover, the deliberate use of adversarial and misinformation-laden prompts to stress-test the models may have exaggerated the prevalence of inaccuracies, as not all real-world user queries adopt such antagonistic framing.
Nonetheless, the researchers emphasize the imperative need to critically re-evaluate how generative AI chatbots are deployed in public health and medical communication. They stress that by default, current models operate by inferring statistical patterns from their training data rather than reasoning or ethically weighing evidence. Consequently, these chatbots can generate responses that sound authoritative but may be factually flawed—a phenomenon rooted in intrinsic behavioral limitations of large language models.
Furthermore, the reliance of these models on training data from Q&A forums, social media, and open-access scientific publications—constituting only 30 to 50% of all published studies—can enhance conversational fluency yet compromise scientific rigor. Such partial data visibility engenders gaps and biases that, coupled with the absence of real-time data integration, further restrict the model’s capacity to provide timely and accurate medical guidance.
As generative AI becomes increasingly embedded in health information ecosystems, this study serves as a critical wake-up call. The researchers advocate for comprehensive public education to raise awareness about the limitations and potential risks of AI-generated advice. They also call for professional training to equip healthcare providers to navigate and counter misinformation propagated through AI chatbots. Above all, the study underscores the urgent necessity for robust regulatory oversight to ensure generative AI functions as a tool that supports, rather than undermines, public health.
These findings chart a crucial path forward, outlining the scientific community’s responsibility to harness AI’s transformative potential while safeguarding the accuracy, reliability, and accessibility of health communication. The convergence of technological innovation and healthcare mandates an interdisciplinary effort to refine AI systems, improve transparency, mitigate misinformation, and promote equitable dissemination of trustworthy medical knowledge. As AI continues its rapid evolution, only through vigilant stewardship can it become a true ally in advancing global public health outcomes.
Subject of Research: People
Article Title: Generative artificial intelligence-driven chatbots and medical misinformation: an accuracy, referencing and readability audit
News Publication Date: 14-Apr-2026
Web References: http://dx.doi.org/10.1136/bmjopen-2025-112695
References: BMJ Open
Keywords: Generative AI, Public health, Science communication, Science education

