TaxonoPy
Reproducibly Aligned Biological Taxonomies
TaxonoPy (taxon-o-py) is a command-line tool for creating reproducibly aligned biological taxonomies using the Global Names Verifier (gnverifier).
Package Purpose
TaxonoPy aligns data to a single, internally consistent 7-rank Linnaean taxonomic hierarchy across large biodiversity datasets assembled from multiple providers, each of which may use overlapping but nonuniform taxonomies. The goal is AI-ready biodiversity data with clean, aligned taxonomy.
Its development has been driven by its application in the TreeOfLife-200M dataset. This dataset contains over 200 million labeled images of organisms from four core data providers:
- The Global Biodiversity Information Facility (GBIF)
- BIOSCAN-5M
- FathomNet
- The Encyclopedia of Life (EOL)
Across these resources, taxon names and classifications often conflict. TaxonoPy resolves those differences into a coherent, standardized taxonomy for the combined dataset.
Challenges
The taxonomy information is provided by each data provider and original sources, but the classification can be:
- Inconsistent — between and within sources (e.g., kingdom Metazoa vs. Animalia)
- Incomplete — missing ranks or containing "holes"
- Incorrect — spelling errors, nonstandard terms, or outdated classifications
- Ambiguous — homonyms, synonyms, and terms with multiple interpretations
Taxonomic authorities exist to standardize classification, but:
- There are multiple authorities
- They may disagree
- A given organism may be missing from some
Solution
TaxonoPy uses the the taxonomic lineages provided by diverse sources to submit batched queries to GNVerifier and resolve to a standardized classification path for each sample in the dataset. It is currently configured to prioritize alignment to the GBIF Backbone Taxonomy. Where GBIF misses, backup sources of the Catalogue of Life and Open Tree of Life (OTOL) Reference Taxonomy are used.
Getting Started
To get started with TaxonoPy, see the Quick Reference guide.
Warning
Taxonomic classifications are human-constructed models of biological diversity, not direct representations of biological reality. Names and ranks reflect taxonomic concepts that may vary between authorities, evolve over time, and differ in scope or interpretation.
TaxonoPy aims to produce a consistent, transparent, and fit-for-purpose classification suitable for large-scale data integration and AI workflows. It prioritizes internal coherence and interoperability across datasets and providers by aligning source data to a selected reference taxonomy.
It is a progressive effort to improve taxonomic alignment in an evolving landscape. If you have suggestions or encounter bugs, please see the Contributing page.