In a groundbreaking endeavor that pushes the boundaries of scientific rigor in the social and behavioral sciences, the Systematizing Confidence in Open Research and Evidence (SCORE) program has recently unveiled a monumental body of findings. Coordinated by a collaborative network of 865 researchers worldwide, these findings have been published as a suite of papers in the prestigious journal Nature, accompanied by a release of five supplementary preprints. The work represents one of the most ambitious examinations to date of the reproducibility, robustness, and replicability of research — crucial pillars that underpin scientific credibility but have remained challenging to assess systematically and comprehensively across disciplines.
The SCORE initiative, funded by the U.S. Defense Advanced Research Projects Agency (DARPA), undertook an unprecedented multi-method study that evaluated nearly 3,900 research claims drawn from 62 leading journals spanning a broad spectrum of quantitative social and behavioral sciences, including criminology, economics, education, health, psychology, political science, sociology, and beyond. By analyzing research published during the ten-year period from 2009 to 2018, SCORE offers sweeping insights into how reliably scientific findings hold up when subjected to rigorous scrutiny. This massive dataset and associated analyses offer critical evidence to guide future research practices, policymaking, and public understanding.
At the heart of SCORE is the nuanced unpacking of scientific repeatability into three distinct but interrelated dimensions: reproducibility, robustness, and replicability. Reproducibility examines whether an independent team can recreate the exact same results by running the original analysis on the original data, thereby testing transparency and clarity of methods. Robustness probes whether varying analytical approaches on the same data yield consistent findings, assessing the stability of conclusions amid reasonable alternative methods. Replicability extends the challenge beyond the original dataset, asking whether similar results emerge when the same hypotheses are tested using new, independent data. The program’s development of precise definitions and frameworks for these concepts helps standardize dialogue around scientific trustworthiness.
The first major SCORE paper, authored by Miske et al. and involving 127 co-researchers, revealed sobering realities about data availability and transparency. Data necessary to assess reproducibility was accessible in only about one-quarter of examined papers, severely limiting the ability to verify original findings. Among papers where data was obtainable, 74% showed at least approximate reproducibility—but only 54% were reproduced precisely. Importantly, the probability of successful reproduction was strongly tied to sharing both data and code, highlighting the critical role of openness in research workflows. Conversely, when analysts had to reconstruct data from public sources independently, reproducibility rates plummeted, underscoring the challenges faced when original materials are not fully disclosed.
Building upon this, Aczel et al. conducted the largest-ever robustness evaluation of social science research, wherein a minimum of five independent analysts reanalyzed data from 100 papers, applying their subjective judgment in selecting analytical methods. The findings were revealing: less than 35% of these reanalyses matched the original findings within a strict threshold of effect size similarity. When the tolerance window was relaxed, the agreement rose to 57%, indicating considerable variation in conclusions depending on analytical choices. Critically, while 74% of analyses reached the same overall conclusions as the initial papers, nearly a quarter produced inconclusive or null evidence, and a small fraction (2%) pointed in the opposite direction. These results illuminate the intrinsic uncertainty and methodological flexibility embedded in data interpretation.
Perhaps most strikingly, the replicability study, detailed by Tyner et al., confronted the formidable challenge of testing research hypotheses against entirely new datasets. Out of 164 replication attempts, only about half (49%) fulfilled the common criterion of successful replication—significant effects in the same direction as originally reported. Moreover, the observed effect sizes in replication studies were less than half those reported initially (0.10 compared to 0.25 in correlation metrics), signaling potential inflation of original findings or contextual dependencies. This stark outcome underscores the difficulties of generalizing scientific claims and emphasizes the need for cautious interpretation of novel results before building policy or theory on potentially unstable foundations.
Alongside these pivotal empirical contributions, the SCORE program critically interrogated the potential to forecast replicability. Two human-centered prediction methods—the repliCATS project and Replication Markets—demonstrated encouraging accuracy, with success rates exceeding 75%. These crowdsourcing approaches leverage expert judgment and collective intelligence to anticipate which findings are likely to stand the test of replication. In contrast, automated machine learning systems, including Synthetic Markets, MACROSCORE, and A+, showed inconsistent performance, suggesting that current algorithmic techniques require refinement before they can reliably supplant human intuition in replication forecasting.
Underpinning SCORE is a commitment to fostering a culture of transparency and open science. Its outputs, encompassing an extensive publicly accessible database, analytical code, and project materials, invite the broader scientific community to explore, validate, and extend these findings. This openness not only advances reproducibility but also accelerates innovation by enabling others to develop improved tools and indicators for assessing research credibility. As noted by project leaders such as Tim Errington and Fiona Fidler, the program’s immense scope and accumulated data spotlight the complexity and labor intensity of converting initial discoveries into robust knowledge.
The multidimensional nature of scientific credibility uncovered by SCORE resists simple reductionist metrics; credibility is shown to be multifaceted and context-dependent. No single measure suffices to capture whether a research claim is genuinely trustworthy. This raises imperative questions about how peer review, funding decisions, and public communication of science might evolve to better accommodate uncertainty and diverse forms of evidence validation. Encouragingly, the program identified fields like Economics and Political Science exhibiting relatively greater data availability and reproducibility—a likely result of their evolving open data mandates and transparency initiatives, which provide a blueprint for other disciplines seeking to enhance research robustness.
Importantly, SCORE’s dataset relates to research produced between 2009 and 2018, a period before many recent reforms aimed at improving research transparency had fully taken root. It is plausible that ongoing efforts, such as journal policies enforcing mandatory data and code sharing and integrating formal reproducibility checks into editorial workflows, will yield higher repeatability in subsequent cohorts of research. Continuous monitoring and follow-up studies building on SCORE’s foundation will be crucial for evaluating the effectiveness of these interventions and guiding best practices.
In reflecting on the program’s findings, it becomes evident that scientific progress is inherently iterative and challenging. Confirming the reliability and generalizability of findings demands sustained collaborative effort across disparate researcher communities, methodologies, and disciplines. The SCORE team’s work provides a candid exposition of the hurdles and opportunities in social and behavioral science research. At the same time, it offers a roadmap—replete with rich empirical evidence and open tools—to strengthen the scientific enterprise’s reliability, ultimately ensuring research findings more effectively inform societal decisions and innovation.
The SCORE initiative embodies a paradigm shift in how scientific confidence is measured and communicated. By rigorously differentiating reproducibility from robustness and replicability, it fosters a more sophisticated appreciation of what it means for research to be credible. Its unprecedented scale and methodological diversity, encompassing expert human judgment and emerging machine learning techniques, demonstrate both the promise and current limits of approaches to augment transparency and trust. As a living legacy, SCORE catalyzes ongoing inquiry into research evaluation, inspiring the next generation of open science reforms and technological aids in the quest to establish enduring scientific truths.
As shared by Sarah Rajtmajer, a leading figure in the program, the collaborative nature of SCORE highlights the power of large-scale cooperative research endeavors. Mobilizing nearly 900 contributors to interrogate scientific findings at this level of granularity sets a new standard for collective scholarly responsibility. The open release of SCORE’s data and methodologies invites global research communities to engage critically and constructively with these results, driving forward improvements in research quality and ultimately enriching the foundations of knowledge.
Subject of Research:
Scientific Credibility and Repeatability in Social and Behavioral Sciences
Article Title:
Systematizing Confidence in Open Research and Evidence (SCORE): Empirical Insights into Reproducibility, Robustness, and Replicability in the Social Sciences
News Publication Date:
April 1, 2026
Web References:
- SCORE Program Overview: http://cos.io/score
- Reproducibility Paper by Miske et al.: https://doi.org/10.1038/s41586-026-10203-5
- Robustness Paper by Aczel et al.: https://doi.org/10.1038/s41586-025-09844-9
- Replicability Paper by Tyner et al.: https://doi.org/10.1038/s41586-025-10078-y
- SCORE Preprints Collection: https://osf.io/preprints/metaarxiv/
References:
All cited papers are published or available as preprints linked above.
Keywords:
SCORE, reproducibility, robustness, replicability, open science, social sciences, behavioral sciences, research credibility, replication crisis, DARPA, transparency, data sharing, machine learning, scientific integrity

