In an era where biodiversity research is more critical than ever, a persistent challenge continues to hamper the full utilization of natural history collections: the painstaking process of georeferencing specimen records. Traditionally a labor-intensive and time-consuming endeavor, georeferencing involves assigning precise geographic coordinates to data on biological specimens, a step that is fundamental to many ecological and conservation studies. However, a pioneering breakthrough has emerged from the intersection of artificial intelligence and natural history that promises to revolutionize how scientists unlock the spatial context locked within vast museum archives.
Recent research by Xie, Park, Sinnott-Armstrong, and colleagues, published in Nature Plants, demonstrates how large language models (LLMs) – the very AI architectures powering today’s most advanced text generators – can be harnessed to overcome the georeferencing bottleneck that constrains natural history collections globally. The approach offers an innovative, scalable solution that intelligently automates the extraction and interpretation of locality descriptions from specimen labels and converts these into accurate, machine-readable geographic coordinates. This leap forward has significant implications for conservation biology, ecology, and a swath of related disciplines that depend on precise spatial datasets.
Specimen labels housed in natural history collections often contain hand-written or typed locality descriptions that are ambiguous, inconsistent, and sometimes incomplete. This variability presents a formidable challenge for traditional georeferencing methods, which rely heavily on human expertise and manual cross-referencing against geographic gazetteers and historical maps. The manual tasks are not just slow; they require specialized knowledge, making it impossible to efficiently process millions of specimens cataloged over centuries. The costs associated with these efforts can be prohibitive, resulting in many valuable collections remaining underutilized.
The study explores how LLMs, trained on extensive corpora encompassing diverse forms of natural language, can decipher complex and context-dependent descriptions from specimen labels. Through natural language understanding and contextual inference, these models can disambiguate place names, infer missing geographic components, and resolve inconsistencies more effectively than prior automation attempts. The researchers fine-tuned various LLM architectures to translate narrative locality records into structured geospatial data, a task that leverages the AI’s prowess in pattern recognition and semantic interpretation.
This novel application of LLMs transcends prior georeferencing efforts primarily based on rule-based algorithms or keyword matching, which have often struggled to handle the nuances and ambiguities characteristic of historical specimen data. By instead leveraging the depth of understanding embedded in large language models, the process becomes less dependent on rigid heuristics and more adaptive to the idiosyncrasies found across datasets. Additionally, these models improve through continuous learning, gaining better accuracy as they process more collections and receive human-in-the-loop corrections.
Significantly, the approach incorporates a probabilistic framework that quantifies the uncertainty in each georeferencing prediction. This transparency allows curators and researchers to prioritize records for manual review where confidence is low and to trust automated annotations when confidence is high. Such a feedback mechanism aligns perfectly with the needs of scientific work, ensuring higher data integrity without sacrificing throughput.
The implications of this breakthrough extend far beyond inventory management of botanical or zoological specimens. High-quality georeferenced data is integral to modeling species distributions under climate change scenarios, identifying biodiversity hotspots for conservation priority, and tracking invasive species spread. Yet, the inability to rapidly georeference millions of archived specimens has historically limited these endeavors to localized or well-studied taxa. Automating georeferencing with AI effectively democratizes access to millions—if not billions—of specimen data points for global scientific usage.
Moreover, the use of LLMs introduces a new paradigm in how we may approach the digitization and computational analysis of natural history collections. Rather than simply scanning and archiving specimen images and label texts, institutions can now unlock semantic meaning at scale, bridging the gap between historical human annotation and modern computational capacities. This conversational AI capability also opens the door for future integrations where virtual assistants could interactively assist curators and researchers in annotating, verifying, and enriching collection metadata in real time.
Despite the promise, the study acknowledges the need for ongoing evaluation across different taxonomic groups and geographic regions, given that locality descriptions vary widely in language, historical context, and detail. The authors call for the development of standardized benchmarks and collaborative community efforts to refine the algorithms and datasets, ensuring that error propagation is minimized in downstream analyses that rely heavily on geospatial accuracy.
Furthermore, the research highlights the ethical dimensions of deploying AI in natural history. Transparency in algorithmic decision-making, documentation of training data, and openness to critique are essential to uphold scientific rigor and accountability. There is also a recognition that human expertise will remain invaluable, particularly in resolving edge cases and guiding improvements, illustrating that AI is not a replacement but a powerful augmentation tool.
The work by Xie et al. thus stands at the vanguard of a growing movement to harness AI for enhancing biodiversity science. By leveraging the state-of-the-art in natural language processing, they provide a blueprint for overcoming one of the thorniest technical challenges in digitizing and activating centuries of biological knowledge. The emergent synergy between human knowledge and artificial intelligence exemplifies a promising future where technological innovation accelerates our understanding and stewardship of Earth’s natural heritage.
As biodiversity crises intensify worldwide, the ability to rapidly deploy vast natural history datasets in research and policy will be indispensable. The integration of large language models into georeferencing workflows offers a tangible, scalable pathway to unlock this potential. It is a vivid reminder that the best solutions often arise at the intersection of disciplines—in this case, computer science, linguistics, and biology—working hand in hand to decode the stories embedded in specimens and ensure their relevance for generations to come.
The advent of AI-generated georeferencing tools also stimulates new questions around data curation and stewardship. What standards will collections adopt to preserve data quality as automation scales? How will collaborative efforts between museums, AI researchers, and domain scientists evolve to continuously improve both tools and underlying data environments? Addressing these issues will be central to sustaining the momentum of this technological breakthrough and maximizing its societal impact.
In conclusion, the introduction of large language models into the georeferencing process fundamentally reconfigures the landscape of natural history collection management. By transforming unstructured, often cryptic locality descriptions into actionable geographic data, these AI-driven methods not only speed up processes but also enhance the breadth and depth of research possibilities. This paradigm shift underscores an exhilarating chapter in biodiversity informatics, poised to empower scientists and conservationists with unprecedented data-driven insights into the complexity and richness of life on our planet.
Article References:
Xie, Y., Park, D.S., Sinnott-Armstrong, M.A. et al. Using large language models to address the bottleneck of georeferencing natural history collections. Nat. Plants (2025). https://doi.org/10.1038/s41477-025-02162-y
Image Credits: AI Generated

