In a groundbreaking study recently published in Nature, a research team led by Zhen-Xing Endowed Professor Jian Yang at Westlake University’s School of Life Sciences has unveiled a pioneering genome assembly method that ushers in a new era for population-scale pangenomics. By innovatively synthesizing a cost-effective hybrid sequencing approach integrating both long-read and short-read data, this initiative has successfully constructed an unprecedented Chinese pangenome encompassing over 1,100 diploid genomes. This monumental achievement shatters previous sample size constraints that limited pangenome utility and sets a new benchmark for biomedical and population genetics research infrastructure.
Since the landmark completion of the Human Genome Project, the biomedical field has relied primarily on single linear reference genomes, such as the widely used GRCh38, as the standard template for genetic studies. While invaluable, these singular consensus sequences are inherently limited by their inability to capture the breadth of human genetic diversity. Human populations exhibit immense variability, with complex genetic variations including structural variants (SVs) and tandem repeats (TRs) often escaping detection within traditional single-reference frameworks. Recognizing these limitations has driven researchers toward the conceptual framework of a pangenome—a comprehensive genomic repository that embodies the collective genetic landscape of a population or species, thereby encompassing variants that are rare or absent in standard references.
Despite advances in long-read sequencing technologies that enable high-fidelity diploid genome assemblies, the prohibitive costs have historically restricted pangenomic sampling to narrow cohorts numbering only a few dozen individuals. Such small datasets lack sufficient statistical power to accurately discern variant frequencies or resolve variants existing at low population frequencies and within highly complex genomic regions. The urgent need for scalable and cost-efficient sequencing methodologies has become a critical bottleneck in the large-scale exploration and functional characterization of genomic diversity.
Addressing this challenge, Jian Yang’s research group, renowned for its methodological innovations in statistical genetics, computational genomics, and big data analytics related to human complex traits, has devised the Pangenome-Informed Genome Assembly (PIGA) workflow. Unlike traditional de novo assembly methods that solely depend on sequencing data from individual genomes, PIGA implements a pangenome-guided framework that integrates sequencing information holistically across the entire cohort. This approach capitalizes on a hybrid sequencing strategy combining moderate-coverage Illumina short reads with PacBio long reads, effectively reducing sequencing expenses without compromising assembly quality. Such a balanced strategy enables the assembly of high-quality genomes from modest-coverage data, presenting a practical and scalable path for future population-scale hybrid sequencing projects.
Applying the PIGA workflow, the researchers constructed the largest human pangenome published to date—the 1,000 Chinese Pangenome—encompassing 1,116 diploid assemblies with an impressive mean quality value (QV) of 46. This robust collection yielded an astonishing 405.3 million base pairs of non-reference genomic sequences absent from current leading human references such as GRCh38 and the newer CHM13 assembly. Remarkably, within this non-reference fraction, the team annotated 26.2 million base pairs as functional genic elements and predicted regulatory regions. This unprecedented expansion into the territory of non-reference genome sequences significantly enhances our understanding of genomic complexity beyond canonical references.
Harnessing such a comprehensive and high-quality dataset, the team generated an exhaustive catalog of genomic variation spanning multiple variant types and scales. This included more than 35 million small variants alongside a vast repertoire of complex structural variants (110,530 SVs), tandem repeats (485,575 TRs), and nearly 860,000 nested variants embedded within non-reference sequences. This rich catalog provides an invaluable resource for dissecting genomic variation with greater resolution than previously possible, capturing the multifaceted nature of human genomic architecture.
Of particular clinical importance, the researchers mapped medically relevant variations at diverse scales, illustrating how the 1,000 Chinese Pangenome (1KCP) variant catalog serves as a critical reference for clinical genetics. Noteworthy findings included gene-altering structural variants that disrupt coding sequences, expansions of pathogenic tandem repeats implicated in disease, variable gene clusters, and highly polymorphic human leukocyte antigen (HLA) gene haplotypes that affect immune function. Collectively, these insights herald significant improvements in the screening and understanding of pathogenic mutations across populations.
Complementing the genomic identification of variants, the study integrated extensive gene expression data to perform pan-variant expression quantitative trait loci (eQTL) mapping. This innovative analysis uncovered 3,256 eQTLs linked to complex variant types—specifically structural variants, tandem repeats, and nested variants—highlighting the elaborate regulatory interplay between diverse forms of genetic variation and gene expression. These findings underscore the complexity embedded within regulatory landscapes sculpted by intricate genomic architectures.
The implications of this work extend broadly beyond the immediate findings. By pioneering a cost-effective hybrid sequencing approach married to a cohort-wide pangenome-guided framework, this study charts a transformative course for future genomic investigations across human populations and other species alike. The ability to characterize complex genomic variation comprehensively—and to functionally elucidate its biological impact—opens new frontiers in precision medicine, evolutionary biology, and genetic epidemiology.
Ph.D. student Yifei Wang and Research Assistant Professor Zhongqu Duan share co-first authorship on the study, with Professor Jian Yang serving as the senior author. The ambitious project received financial support from key national funding bodies, including the National Natural Science Foundation of China, the National Key R&D Program, and Zhejiang’s “Pioneer & Leading Goose” Program, alongside the New Cornerstone Science Foundation. High-performance computation for the analyses was facilitated by the Westlake University High-Performance Computing Center.
Professor Jian Yang’s research group continues to push the boundaries of statistical genetics and bioinformatics method development, probing the genetic underpinnings of complex diseases through large-scale multi-omic population datasets. Their mission is to unravel the intricate genetic architecture governing disease traits, developing translational approaches that advance diagnostic accuracy, therapeutic target discovery, and tailored precision medicine strategies for improved human health.
This landmark publication not only sets a new standard for human pangenome assembly but also heralds a paradigm shift, emphatically demonstrating that integrating cost-effective hybrid sequencing with innovative computational frameworks can unlock the vast genomic diversity requisite for cutting-edge biomedical discoveries in the genomic era.
Subject of Research: Not applicable
Article Title: The 1000 Chinese Pangenome empowers medical and population genetics
News Publication Date: 1-Apr-2026
References:
DOI: 10.1038/s41586-026-10315-y
Image Credits: Jian Yang Lab at Westlake University
Keywords: Computational biology, genome assembly, pangenome, structural variants, tandem repeats, population genetics, hybrid sequencing, genomic diversity, precision medicine

