A groundbreaking study published in the esteemed journal JMIR Mental Health has unveiled alarming evidence on the frequent occurrence of fabricated and erroneous citations generated by advanced Large Language Models (LLMs) like GPT-4o in the realm of mental health research. This investigation, orchestrated by a team led by Jake Linardon, PhD, from Deakin University, meticulously exposes a critical vulnerability in the way these increasingly popular AI tools produce academic content—casting serious doubts on the reliability of AI-generated bibliographies and challenging the integrity of scholarly communication in specialized domains.
The research is motivated by the accelerating integration of LLMs, particularly GPT-4o, into the workflows of researchers who harness these models to assist with literature reviews and knowledge synthesis. While LLMs demonstrate remarkable proficiency in text generation, this study highlights the concerning phenomenon of “hallucinated” references—citations that are outright fabricated and cannot be traced back to legitimate scientific sources. The scale of this issue is quantified with striking statistical clarity: 19.9% of all AI-generated citations were entirely fictitious, failing to correspond to any existing publication, and a remarkable 45.4% of those that appeared genuine contained substantial bibliographic inaccuracies, such as invalid or incorrect Digital Object Identifiers (DOIs).
These findings surface at a time when academic publishing is seeing a spike in the submission of manuscripts containing AI-generated content, a trend that increasingly tests the boundaries of peer review and editorial scrutiny. The phenomenon of fabricated citations is not a superficial formatting error; it fundamentally disrupts the chain of scientific verification. Such inaccuracies threaten to mislead readers, distort the scientific record, and ultimately undermine the cumulative foundation of knowledge upon which future research depends. The study emphatically argues for the imperative of rigorous human verification across all AI-assisted academic outputs, especially in fields where nuanced expertise is critical to discerning valid references.
An important dimension of the study involves the exploration of how the reliability of GPT-4o’s citations varies according to topic familiarity and prompt specificity. The researchers simulated literature reviews across three mental health topics with differing levels of public and scientific recognition: major depressive disorder, a well-studied and widely recognized condition; binge eating disorder, with moderate familiarity; and body dysmorphic disorder, a relatively obscure topic with limited research coverage. This stratification revealed a clear gradient in fabrication rates, with the least familiar topics suffering the highest incidence of false citations—peaking at nearly 29% for body dysmorphic disorder. In contrast, the well-established field of major depressive disorder recorded a much lower fabrication rate of around 6%.
Moreover, the study delved into the impact of prompt specificity on citation accuracy. When GPT-4o was given highly specialized review prompts, such as focusing exclusively on digital interventions for binge eating disorder, the frequency of fabricated citations increased significantly compared to more general overview prompts. This suggests that the complexity and specificity of the requested information can exacerbate the model’s tendency to “hallucinate” references, compounding the risks posed to academic integrity. Thus, while LLMs can be valuable aides, the nature of the prompts and the subject matter substantially influence the trustworthiness of their bibliographic outputs.
Beyond simply cataloging these errors, the study offers a robust critique of current scholarly reliance on AI tools without adequate safeguards. It underscores that the reliability of AI-generated citations is neither static nor universally dependable but fluctuates depending on the domain knowledge embedded within the training data and the precision of how inquiries are framed. This underscores an acute need for academic institutions, journals, and editorial boards to recognize these shortcomings and institute proactive measures to detect and mitigate the risks of citation fabrication.
Given the persistence of these issues, the authors issue a clarion call for systematic human oversight. They advocate for mandatory verification protocols whereby researchers and students critically appraise every AI-generated citation to confirm its authenticity. Editorial workflows must be enhanced with technological solutions, such as automated detection systems designed to flag references that do not correspond to actual publications or bear suspicious metadata. These measures should be integrated alongside traditional peer review to maintain the scientific rigor and quality standards that underpin credible research.
Training and policy development form another cornerstone of the recommendations. Institutions must equip scholars with the competencies required to engage critically with LLM-generated outputs—teaching them how to devise precise prompts that minimize hallucinations and how to interpret AI assistance with a discerning eye. Clear guidance and ethical frameworks should govern the use of AI in scholarly work, emphasizing transparency and accountability. Without these educational and procedural upgrades, the risk of injecting fabricated or misleading citations into the academic corpus will only grow.
The implications of this study resonate broadly across the scientific communication ecosystem. It presents an urgent narrative that the integration of sophisticated AI tools, although tremendously beneficial in accelerating research workflows, carries latent challenges that, if unaddressed, may degrade the trustworthiness of published knowledge. Researchers utilizing LLMs must, therefore, embrace a cautious and informed approach, viewing these models as supplements rather than replacements for meticulous scholarship.
In conclusion, Linardon and colleagues’ experimental study not only quantifies a troubling phenomenon but also galvanizes the academic community to adopt a vigilant posture when interfacing with AI-generated literature. The nuanced understanding of how topic familiarity and prompt specificity shape citation quality equips stakeholders with critical insights to refine AI usage strategies. This pioneering work marks a significant milestone in acknowledging and confronting the pitfalls of AI hallucination within scientific literature, reinforcing the essential role of human judgment in safeguarding research integrity.
As the landscape of academic publishing continues to evolve under the influence of AI technologies, collaborative efforts between researchers, publishers, and technologists will be crucial in developing robust frameworks and tools to ensure that innovation does not come at the cost of reliability. This study serves as an indispensable wake-up call—and a roadmap—for maintaining the sanctity of citations, the bedrock upon which credible science is founded.
Subject of Research: Not applicable
Article Title: Influence of Topic Familiarity and Prompt Specificity on Citation Fabrication in Mental Health Research Using Large Language Models: Experimental Study
News Publication Date: November 17, 2025
Web References: http://dx.doi.org/10.2196/80371
References:
Linardon J, Jarman H, McClure Z, Anderson C, Liu C, Messer M. Influence of Topic Familiarity and Prompt Specificity on Citation Fabrication in Mental Health Research Using Large Language Models: Experimental Study. JMIR Ment Health 2025;12:e80371
Image Credits: JMIR Publications
Keywords: Academic publishing, Academic ethics, Science communication, Scientific method, Retractions, Medical journals, Scientific journals, Academic journals, Citation analysis

