In the vast expanse of potential proteins, the known universe of natural proteins represents only a minute fraction, an intriguing fact that reshapes how scientists approach protein design and evolutionary biology. This fundamental disparity between the observable protein world and the theoretical sequence space underpins a groundbreaking study recently published in the prestigious Proceedings of the National Academy of Sciences. A global collaboration spearheaded by researchers from the Okinawa Institute of Science and Technology (OIST), the Institute of Science and Technology Austria (ISTA), the University of Vienna, and the Centro de Astrobiología (CAB) has produced a computational model that delves deep into the evolutionary forces and constraints that dictate protein diversity and their exploration of sequence space.
Proteins, composed of amino acid chains, can theoretically exist in astronomically vast permutations. However, the functional viability of these sequences depends critically on their ability to fold into specific three-dimensional structures that mediate precise biological activities. This foldability and functional capacity drastically narrow the functional landscape within the broader sequence space. By mathematically formalizing the sequence space occupied by extant natural proteins, the researchers sought to understand how biological evolution constrains diversification and explore whether existing proteins adequately represent the possible universe of functional proteins.
The study’s premise challenges the assumptions underlying many state-of-the-art artificial intelligence (AI) methods in protein engineering. Although recent advances, notably exemplified by AlphaFold, have revolutionized the ability to predict protein structure from sequences, these AI models predominantly rely on training from the existing library of natural proteins. The question that arises is whether this training data—inherently limited and biased by evolutionary history—can enable the generation of truly novel and diverse proteins, or if these models are intrinsically constrained by the depth and breadth of known protein sequences.
Central to the findings is the concept of “point-of-origin” effects, which profoundly influence the limits of protein diversification. Through sophisticated evolutionary simulations, the team demonstrated that the starting conditions of protein evolution—the ancestral sequences from which present proteins descend—impose far greater restrictions on exploring sequence space than traditionally appreciated evolutionary forces such as natural selection and epistasis. This implies that the evolutionary trajectory is heavily biased by historical contingencies, curtailing the diversification and exploration of sequence space.
Contrary to longstanding views emphasizing the dominant role of selection pressure and genetic interactions, or epistasis, the model revealed that these factors play surprisingly minor roles in limiting the diversity of protein sequences. Instead, the heritage encoded in ancestral proteins emerges as the chief determinant that restricts exploration. This profound insight reorients how evolutionary biologists interpret the pace and pattern of molecular evolution across the tree of life.
The implications extend beyond evolutionary theory into the origins of life itself. The simulations suggest that the emergence of the very first proteins in the last universal common ancestor (LUCA) could not have been a simple matter of sequential mutations stemming from a single primordial sequence. Instead, early protein formation likely resulted from an intricate process involving the recombination and rearrangement of smaller DNA fragments, creating novel sequences more rapidly than mutation alone could allow. This reinforces theories of DNA recombination as an evolutionary driver fundamental to the diversification of life’s molecular machinery.
The study further emphasizes a call to action for experimentalists in synthetic biology and protein engineering. Present AI approaches for functional prediction are tethered tightly to existing data, meaning the models are limited in their ability to extrapolate far beyond current knowledge. There exists an enormous, untapped expanse of sequence space that remains unexplored by natural evolution. Unlocking these uncharted territories demands extensive experimental efforts to generate new protein sequences and functional data that can broaden the horizons of AI algorithms.
This collaborative investigation showcases the profound synergy between computational modeling and evolutionary biology, providing nuanced insights into the protein fitness landscape that will help refine both theoretical frameworks and practical protein design. The exploration unveiled here also bears significant consequences for biotechnology sectors including therapeutics, enzymatic catalysis, and biomaterials, where tailored proteins with novel functions are actively sought.
As our understanding of protein sequence space deepens, future research must navigate the interplay between chance ancestral sequences, evolutionary constraints, and the capacity for innovation. The recognition that protein evolution is fundamentally constrained by historical lineage rather than solely by natural selection or interactive mutational effects marks a paradigm shift. Such knowledge recalibrates expectations for the efficacy of novel protein design methodologies that do not account for these ingrained limitations.
Moreover, these results prompt reconsideration of how datasets are constructed for machine learning in order to anticipate protein functionality more realistically. By expanding experimental data beyond the narrow confines of evolutionarily sampled sequences, researchers can guide AI-driven models toward uncovering truly novel biofunctional sequences. This, in turn, could accelerate the rational design of proteins tailored for unprecedented applications.
Fundamentally, the research bridges the gap between abstract sequence theory and tangible biological reality, illuminating how descent from a common ancestor dramatically restricts exploration in sequence space. It underscores the complex tapestry of evolutionary pressures coupled with historical constraints that shape molecular evolution’s landscape—a compelling narrative that redefines our grasp of life’s molecular diversity.
This pioneering study was supported by the Japan Science and Technology Agency’s Adopting Sustainable Partnerships for Innovative Research Ecosystem (ASPIRE) initiative, fostering international collaborations to propel scientific progress. As protein science stands on the cusp of revolutionary advancements, such integrative efforts are vital to decoding the limits and possibilities embedded in the protein universe.
Subject of Research: Protein evolution and sequence space exploration
Article Title: Descent from a common ancestor restricts exploration of protein sequence space
News Publication Date: 30-Mar-2026
Web References: http://dx.doi.org/10.1073/pnas.2532018123
References: Proceedings of the National Academy of Sciences
Image Credits: Andrew Scott/OIST
Keywords: protein evolution, sequence space, computational modeling, artificial intelligence, protein design, natural selection, epistasis, origins of life, DNA recombination, protein diversity, evolutionary constraints, synthetic biology

