In a groundbreaking study conducted at the University of California San Diego, researchers have delivered the first empirical proof that modern artificial intelligence systems can indeed pass the Turing Test, a pivotal scientific evaluation that challenges machines to replicate human conversation with such authenticity that it becomes impossible for people to reliably distinguish them from human interlocutors. This milestone represents a historic moment in artificial intelligence research, demonstrating that advanced large language models (LLMs) can engage in dialogue so convincingly that human observers frequently fail to identify them as machines.
The research, published in the prestigious Proceedings of the National Academy of Sciences, constitutes the first rigorous application of the Turing Test methodology originally conceived by Alan Turing, one of the foundational figures of computer science, in 1950. Importantly, this study is unprecedented in finding that AI models were judged to be human as often as actual human participants under the strict conditions of the Turing Test framework, marking a paradigm shift in our understanding of conversational AI capabilities.
Cameron Jones, the study’s corresponding author and cognitive science doctoral candidate at UC San Diego at the time of research, articulated that when equipped with the appropriate prompts, sophisticated LLMs can adeptly mimic human conversational traits, encompassing tonal variation, humor, directness, and even fallibility—the very imperfections that characterize human dialogue. This discovery underscores that beyond factual knowledge retrieval, AI systems are now capable of exhibiting nuanced social behaviors critical to natural human interaction, raising profound implications for AI integration in society.
The test framework employed involved participants simultaneously communicating with two conversational partners: one a human and the other an AI language model. An “interrogator” tasked with identifying the real human had to discern between the two based solely on text-based exchanges. This three-party setup constitutes a notably robust adaptation of Turing’s original binary dialogue test, allowing for more symmetric and controlled comparison scenarios.
To ensure comprehensive validation, the study engaged two distinct participant cohorts: undergraduates from UC San Diego’s own community and a demographically diverse online group recruited via Prolific. These participants conducted rapid-fire conversations, spanning brief 5 to 15-minute interactions, with four distinct AI systems including the cutting-edge GPT-4.5 and LLaMa-3.1-405B, as well as older established models like GPT-4o and the classic 1960s chatbot ELIZA, providing a baseline for comparative analysis.
Among these, GPT-4.5 emerged as the most convincing, convincing evaluators it was human in 73% of trials—more often than real humans were identified. LLaMa-3.1-405B also performed remarkably, with a 56% human-likeness rating, statistically indistinguishable from actual human conversation partners. Conversely, baseline systems such as ELIZA and GPT-4o lagged significantly, being mistaken for humans only around 21-23% of the time, highlighting the rapid advancement in modern generative AI’s conversational proficiency.
This study provocatively invites the scientific community to reassess the traditional interpretation of the Turing Test. As Ben Bergen, a cognitive science professor at UC San Diego and co-author of the study, elucidates, the test was originally conceptualized as a challenge to replicate human intellectual intelligence. However, with AI systems now surpassing humans in knowledge access and speed, the essence of the test’s success appears to have shifted towards measuring “humanlikeness” in social and behavioral communication rather than raw cognitive power.
A pivotal component enabling this seismic shift lies in the use of persona prompts, which instruct language models to adopt specific human characters and communication styles. Remarkably, AI models succeed not through an exhibition of perfect logic or knowledge but rather by emulating human idiosyncrasies and imperfections, including making mistakes—behaviors not generally associated with the problem-solving intelligence that Turing originally theorized. This strategic adoption of fallibility contributes to their perceived authenticity, fundamentally altering how intelligence is measured in conversational AI.
The researchers further stress the critical role prompted instructions play in rendering AI systems humanlike. When deprived of explicit persona directives, models like GPT-4.5 and LLaMa-3.1 showed dramatically reduced “win rates,” dropping to roughly 36-38%, while baseline models remained low. This stark contrast strongly suggests that while LLMs possess latent capacities for human-like interaction, they require precise human-crafted guidance to effectively manifest such behaviors in dialogue, indicating a synergy between human input and machine output.
Pragmatically, these findings carry weighty implications for online trust and interaction safety. As Cameron Jones reflects, the ability of AI systems to convincingly impersonate humans over extended conversations necessitates heightened vigilance in digital communication, as individuals can no longer be assured they are engaging with real people. This reality presents new challenges for cybersecurity, misinformation, and social engineering risks, as AI systems become adept at what Jones calls “a game about lying,” capable of sophisticated deception.
Ben Bergen echoes these concerns, emphasizing the tangible danger of counterfeit personas in digital environments. The difficulties in reliably distinguishing human from AI interlocutors enable malicious actors to exploit AI bots in manipulating behavior, extracting sensitive information, influencing political opinions, or promoting commercial interests covertly. The blurring boundaries between human and artificial agents thus prompt urgent calls for robust ethical frameworks and detection safeguards in AI deployment.
To conduct this landmark study, the researchers developed a custom online messaging interface designed to simulate familiar chat experiences, simplifying participant engagement. Interrogators utilized split-screen setups, allowing them to simultaneously converse with both the AI and human “witnesses,” with equal opportunity to probe each entity. Following conversations lasting 5 or 15 minutes, interrogators rendered their judgments on which correspondent was the genuine human, enabling precise statistical analysis of AI indistinguishability.
Efforts to verify generalizability employed participant pools from both localized academic communities and a broader online population totaling nearly 500 individuals. Notably, the UC San Diego undergraduates demonstrated marginally superior detection performance, likely attributable to shared cultural and contextual knowledge that facilitated deeper conversational probing. This nuance highlights the complex interplay of social context and shared experience in Turing Test dynamics.
In closing, the full Turing Test experience as operationalized in this study is accessible via the public platform turingtest.live, inviting broader public engagement and transparency around AI conversational capabilities. This openness complements the research’s ambition not only to chart the evolving frontiers of AI but also to foster informed societal discourse around its profound implications.
Subject of Research: People
Article Title: Large Language Models Pass a Standard Three-Party Turing Test.
News Publication Date: 19-May-2026
Web References: https://www.pnas.org/doi/10.1073/pnas.2524472123, https://turingtest.live
References: Cameron Jones et al., Proceedings of the National Academy of Sciences, DOI: 10.1073/pnas.2524472123
Keywords: Artificial intelligence, Generative AI, Large language models, Turing Test, Human-computer interaction, Machine deception, Cognitive science
