Unraveling the Intricacies of Human Genomic Variation with Near-Complete Long-Read Assemblies
In a groundbreaking advance, researchers have leveraged the power of long-read genome assemblies to dissect the most complex forms of structural variation within the human genome. These intricate alterations, coined complex structural variants (CSVs), represent singular genomic events composed of simpler structural variants that extend across multiple repair junctions, often embedded within highly repetitive DNA regions. This novel approach opens an unprecedented window into the hidden landscape of genomic complexity, with profound implications for understanding human evolution, disease susceptibility, and genomic instability.
The challenge in identifying CSVs has long been their tendency to arise within genomic regions laden with segmental duplications (SDs) and mobile element insertions (MEIs). These repetitive sequences notoriously confound traditional sequencing and mapping efforts, masking the true architecture of these variants. By upgrading the existing PAV tool, the research team enabled a heightened sensitivity in capturing CSVs embedded in these large, complex repeats. Applying this enhanced method against the telomere-to-telomere assembled CHM13 reference genome, an average of 72 CSVs were detected per genome, revealing a rich spectrum of 1,247 distinct CSV events with 128 unique complex reference signatures across human populations.
Further interrogation revealed that a substantial fraction of these CSVs embodies local sequence duplications and inversions — approximately 27% exhibited duplications while 38% contained inversions. Intriguingly, many of these variants are orchestrated through mechanisms involving SDs which mediate elaborate architectures, such as INVDUP-INV-DEL, DEL-INV-DEL, and INVDUP-INV-INVDUP. These configurations combine deletions, inversions, and duplications in a complex interplay of genomic rearrangements. One remarkable example highlights CSVs involving the NOTCH2NL and NBPF gene families, loci intrinsically tied to the expansion of the human brain and its evolutionary trajectory.
Previously intractable to resolution through conventional methods such as optical mapping, these CSVs now reveal at least three distinct haplotypes: a reference haplotype with a 13.7% allele frequency, a 930-kilobase inversion-deletion variant affecting NBPF8 and deleting NOTCH2NLR and NBPF26 found at 35.9% frequency, and a 513-kilobase variant involving a distal template switch that replaces NBPF8 with NBPF9, seen in over half of sampled genomes. These findings not only underscore the variability of human haplotypes but also provide precise molecular characterization of loci previously obscured by genomic complexity.
The study’s scope extends beyond CSVs to structurally challenging gene regions implicated in disease. One prime example is the SMN locus, comprising SMN1 and SMN2 gene copies, central to spinal muscular atrophy pathogenesis and therapeutic targeting. These genes reside within an approximately 1.5 megabase segmental duplication hotspot, historically resistant to full sequence resolution. By successfully assembling and validating 101 complete haplotypes, the team achieved comprehensive characterization of SMN1/2 copy number and structure, alongside related genes such as SERF1A/B, NAIP, and GTF2H2/C.
Intriguingly, nearly half of these haplotypes maintain exactly two copies of SMN1/2 and its associated gene cluster members, reflecting a conserved genomic architecture. However, deviations in this copy number highlight the landscape of structural genomic diversity with potential clinical ramifications. Comparative analysis with short-read genotyping tools Parascopy and SMNCopyNumberCaller affirmed the accuracy of the long-read assembly-derived copy number calls, ensuring reliability. Moreover, findings revealed rare haplotypes lacking SMN1, potentially representing genomic configurations predisposing individuals to disease risk via mechanisms such as interlocus gene conversion.
Expanding the inquiry to other complex, multi-copy genes, the team delved into the amylase gene locus on chromosome 1, which features genes AMY1A, AMY1B, AMY1C, AMY2A, and AMY2B. This locus spans over 200 kilobases and exhibits high structural variability critical to dietary adaptation and metabolic phenotypes. Analysis of 65 fully resolved genome assemblies yielded 39 distinct amylase haplotypes, covering a significant majority of the population’s haplotype diversity. A remarkable breadth of haplotype lengths—from ~111 kb to over 580 kb—reflects evolutionary expansion and contraction events shaping this locus.
Among these diverse haplotypes, four dominate in prevalence, collectively making up over half of all observed haplotypes, affirming a blend of common and rare structural configurations in modern humans. The study is particularly notable for fully resolving the largest known amylase haplotype, containing eleven tandem AMY1 gene copies, a locus previously only partially characterized via optical genome mapping. The resolution of such intricate haplotypes illuminates the evolutionary complexity and functional implications embedded within seemingly inscrutable genomic regions.
The collective impact of this research lies in bridging the gap between reference-quality genome assemblies and the nuanced, individualized structural variation defining human genomic diversity. By applying refined computational tools to ultra-long read data, the study surfaces a wealth of complex variation concealed within difficult genomic terrain. The resulting catalogs not only expand our understanding of structural genomic diversity but also provide invaluable resources for future studies dissecting genotype-to-phenotype relationships, disease mechanisms, and evolutionary history.
Furthermore, the delineation of distinct haplotype structures across populations paves the way for population-scale assessments of genetic risk, improving the resolution of genetic diagnostics and personalized medicine. For disorders like spinal muscular atrophy, where gene copy number and arrangement are crucial for prognosis and therapy, such precision genomics can revolutionize patient care. Moreover, the ability to phase and characterize complex loci across thousands of genomes opens novel vistas in evolutionary biology, functional genomics, and bioinformatics.
By overcoming the limitations of short-read sequencing and the ambiguity of optical mapping, this integrative approach represents a paradigm shift in human genomics. It underscores the critical need for high-fidelity, contiguous genome assemblies capturing the full spectrum of structural variation. These insights reveal the inherent plasticity of the human genome and its capacity for generating complexity that shapes both health and disease.
As the field marches toward comprehensive population-scale long-read sequencing efforts, tools like the enhanced PAV and tailored assembly pipelines will be indispensable. They empower researchers to not only detect but precisely delineate the architecture of complex variants. Ultimately, this progress enriches the foundational knowledge of human genetic variation and drives forward the promise of truly personalized genomics.
Subject of Research: Complex structural variants and genomic diversity in near-complete human genome assemblies
Article Title: Complex genetic variation in nearly complete human genomes
Article References:
Logsdon, G.A., Ebert, P., Audano, P.A. et al. Complex genetic variation in nearly complete human genomes. Nature (2025). https://doi.org/10.1038/s41586-025-09140-6
Image Credits: AI Generated