A groundbreaking study published in Nature underlines a transformative insight into the human proteome, revealing a landscape enriched by microproteins and peptideins originating from non-canonical open reading frames (ncORFs). Historically, the biological significance of ncORFs and their protein products has been shrouded in mystery, primarily due to their elusive evolutionary conservation patterns. This work, spearheaded by Deutsch, Kok, Mudge, and colleagues, challenges traditional paradigms by introducing an innovative computational approach—ORBL—that robustly captures evolutionary constraints specific to ORF structures, thereby redefining our understanding of protein-coding potential in the human genome.
At the crux of this research lies the conception of ‘ORFness’—a distinctive evolutionary signature characterized by the conserved presence of initiation codons, termination codons, and uninterrupted reading frames across species, independent of amino acid sequence conservation. Leveraging extensive multispecies whole-genome alignments, ORBL quantifies this signature to discern evolutionary preservation of functional ORFs. This methodological innovation addresses a longstanding gap where conventional sequence conservation metrics, like PhyloCSF, often overlook constraints not evident at the amino acid level, especially in short or recently emerged ncORFs.
The ORBL tool outputs two critical metrics: ORBLv, representing the branch length of species maintaining an intact ORF relative to overall phylogenetic breadth, and ORBLq, a constraint score normalizing ORBLv by comparing against untranslated ORFs of similar biotype and length. This dual scoring system offers nuanced discrimination between genuine ORF conservation and incidental nucleotide retention, which is particularly prevalent in short ORFs. The authors applied this framework to analyze human protein-coding sequences alongside GENCODE-annotated ncORFs, revealing pronounced patterns that challenge previous assumptions regarding microprotein evolutionary dynamics.
Interestingly, canonical protein-coding CDSs, notably shorter ones, displayed elevated ORBLv scores, with primate clade-based comparisons yielding higher conservation than analyses over broader placental mammal clades. This observation highlights evolutionary pressures operating at different phylogenetic depths. Additionally, ncORFs overlapping established CDSs—upstream overlapping ORFs (uoORFs), internal ORFs (intORFs), and downstream overlapping ORFs (doORFs)—exhibited comparatively increased ORBLv scores, likely reflecting constraints imposed to preserve the primary CDS integrity.
Despite these trends, the mere presence of conserved nucleotide sequences did not unequivocally indicate functional protein-coding potential. Employing ORBLq unveiled that a striking proportion of ncORFs demonstrate evolutionary signatures surpassing chance expectation. Over 30% of analyzed ncORFs, with an impressive 45.8% of upstream ORFs (uORFs), attained ORBLq scores exceeding 0.9. This enrichment starkly contrasts with the anticipated 10% cutoff for untranslated ORFs, suggesting extensive purifying selection preserving ORF structure. Such findings implicate a substantial cadre of conserved regulatory uORFs that may contribute to proteomic diversity and functional complexity.
By contrast, traditional amino acid conservation metrics remained sparse in highlighting these ORFs. Merely 2.0% of ncORFs, including 2.5% of uORFs, surpassed a PhyloCSF score threshold indicative of strong protein-coding potential. This disparity underscores the unique sensitivity of ORBL’s ORFness-based analysis in detecting evolutionary constraints invisible to amino acid-centric approaches. It advances the proposition that many ncORFs may encode biologically relevant microproteins despite lacking pronounced amino acid sequence conservation.
The study further examined the functional relevance of these evolutionary signals by integrating immunopeptidomics data targeting HLA-I peptides, a proxy for microprotein detectability. Remarkably, ncORFs with microprotein products detected via immunopeptidomics possessed significantly higher ORBLq scores compared to undetected counterparts. This trend was particularly strong within the uORF and intORF categories, reinforcing the premise that evolutionary conservation of ORFness aligns closely with translational output and peptide presentation. Statistical analyses confirmed the robustness of these associations even after correcting for multiple testing.
Two prototypical examples vividly illustrate this dynamic. The ncORF c8riboseqorf102 manifests a high ORBLq score (0.98), indicative of persistent structural conservation throughout placental mammals, yet it exhibits a negative PhyloCSF score (-30), highlighting the disconnect between ORF and amino acid constraints. Crucially, its encoded microprotein was detected through immunopeptidomics, validating ORBL’s predictive capacity. Conversely, c11norep1, a recently emerged ncORF in the BET1L gene, lacks significant evolutionary conservation by ORBL measures but still produces detectable HLA-I peptides, suggesting a spectrum of proteomic emergence and functional adoption.
These data collectively argue for refining genomic annotations by incorporating evolutionary signatures of ORFness to reveal otherwise cryptic protein-coding elements. Microproteins derived from ncORFs emerge not as incidental translation noise but as evolutionarily selected features with potential regulatory and immunological roles. ORBL thus establishes a critical axis for deciphering the complexity of the human proteome beyond canonical gene models, with significant implications spanning molecular biology, immunology, and evolutionary genomics.
The integration of high-resolution comparative genomics with functional proteomics exemplifies a powerful approach to decoding the cryptic yet expansive layer of microprotein biology. This work importantly tempers the skepticism around ncORFs by providing quantitative evidence that a sizeable fraction is under purifying selection specifically against loss of ORF integrity. Such constraints likely reflect biologically meaningful roles, including in translational regulation, peptide-mediated signaling, or immune surveillance, inviting future research to unravel their mechanistic underpinnings.
Moreover, ORBL’s framework and its associated datasets open avenues to explore species-specific innovations and adaptive short ORF repertoires. The discovery of recently emerged ncORFs with detectable peptide products hints at ongoing genomic experimentation with novel peptide functions, potentially linked to lineage-specific traits or pathological states. Continued refinement and application of ORBL across diverse taxa promise to shed light on the evolution, function, and disease relevance of this emerging proteomic frontier.
In conclusion, Deutsch et al. provide a compelling paradigm shift in human proteomics, emphasizing the evolutionary conservation of ORF structure as a powerful determinant of microprotein biology. Their ORBL tool not only fills a methodological void but also broadens the conceptual framework for interpreting ncORFs, highlighting their integral role in expanding the functional human proteome. This research heralds a new chapter in genomics and proteomics, where small yet evolutionarily preserved open reading frames claim their rightful place as contributors to biological complexity and diversity.
Subject of Research:
Non-canonical open reading frames (ncORFs), microproteins, and their evolutionary conservation in the human genome.
Article Title:
Expanding the human proteome with microproteins and peptideins.
Article References:
Deutsch, E.W., Kok, L.W., Mudge, J.M. et al. Expanding the human proteome with microproteins and peptideins. Nature (2026). https://doi.org/10.1038/s41586-026-10459-x
DOI:
https://doi.org/10.1038/s41586-026-10459-x
