Skip to content

Quick Reference

Install

pip install taxonopy

For detailed setup instructions including GNVerifier and troubleshooting, see Installation.

Sample Input

Download the same sample dataset in either format and place it in examples/input/:

Sample input: Note the divergence in kingdoms (Metazoa vs Animalia), missing interior ranks, and fully null entry.

uuid kingdom phylum class order family genus species scientific_name
bc2a3f9f-c1f9-48df-9b01-d045475b9d5f Metazoa Chordata Mammalia Primates Hominidae Homo Homo sapiens Homo sapiens
21ed76d8-9a3b-406e-a1a3-ef244422bf8e Plantae Tracheophyta null Fagales Fagaceae Quercus Quercus alba Quercus alba
4d166a61-b6e5-4709-91ba-b623111014e9 Animalia null null Hymenoptera Apidae Apis Apis mellifera Apis mellifera
85b96dc2-70ab-446e-afb5-6a4b92b0a450 null null null null null null Amanita muscaria null
38327554-ffbf-4180-b4cf-63c311a26f4e Animalia null null null null null Laelia rosea null
8f688a17-1f7a-42b2-b3dc-bd4c8fc0eee3 Plantae null null null null null Laelia rosea null
a95f3e29-ed48-41f4-9577-64d4243a0396 null null null null null null null null

In the final example entry, there is no available taxonomic data, which can happen in large datasets where there may be a corresponding image but incomplete annotation.

Execute a Basic Resolution

taxonopy resolve --input examples/input --output-dir examples/output

Input values

There are three kinds of values you can pass to --input:

  • A single file path (CSV or Parquet).
  • A flat directory of partitioned files (TaxonoPy will glob everything inside).
  • A directory tree (TaxonoPy will glob recursively and preserve the folder structure in the output).

In all three cases, the base filename is preserved in the output. That is, the output keeps the original filename(s) and adds .resolved / .unsolved before the extension.

If you download both sample.csv and sample.parquet into examples/input/, resolve will fail due to mixed input formats; keep only one format per input directory.

The command above will read in the sample data from examples/input/, execute resolution, and write the results to examples/output/.

By default, outputs are written to Parquet format, whether the input is CSV or Parquet. To set the output format to CSV, use the --output-format csv flag.

The output files consist of:

  • sample.resolved.parquet
  • sample.unsolved.parquet
  • resolution_stats.json

The sample.resolved.parquet file contains all the entries where some resolution strategy was applied. In this example, it contains:

Sample resolved output (selected columns): Green highlights show values added during resolution. Yellow highlights indicate values that changed from the input.

uuid kingdom phylum class order family genus species scientific_name common_name
bc2a3f9f-c1f9-48df-9b01-d045475b9d5f Animalia? Chordata Mammalia Primates Hominidae Homo Homo sapiens Homo sapiens null
21ed76d8-9a3b-406e-a1a3-ef244422bf8e Plantae Tracheophyta Magnoliopsida Fagales Fagaceae Quercus Quercus alba Quercus alba null
4d166a61-b6e5-4709-91ba-b623111014e9 Animalia Arthropoda Insecta Hymenoptera Apidae Apis Apis mellifera Apis mellifera null
85b96dc2-70ab-446e-afb5-6a4b92b0a450 Fungi Basidiomycota Agaricomycetes Agaricales Amanitaceae Amanita Amanita muscaria null null
38327554-ffbf-4180-b4cf-63c311a26f4e Animalia Arthropoda Insecta Lepidoptera Erebidae Laelia Laelia rosea null null
8f688a17-1f7a-42b2-b3dc-bd4c8fc0eee3 Plantae Tracheophyta Liliopsida Asparagales Orchidaceae Laelia Laelia rosea null null

The sample.unsolved.parquet file contains entries that could not be resolved (for example, rows with no usable taxonomy information). In this example, it contains:

Sample unsolved output: Sequestered entries with no usable taxonomy information.

uuid kingdom phylum class order family genus species scientific_name common_name
a95f3e29-ed48-41f4-9577-64d4243a0396 null null null null null null null null null

The resolution_stats.json file summarizes counts of how many entries from the input fell into each final status across the resolved and unsolved files.

TaxonoPy also writes cache data to disk (default: ~/.cache/taxonopy) so it can trace provenance and avoid reprocessing. Use --show-cache-path, --cache-stats, or --clear-cache if you want to inspect or manage it, or see the Cache guide for details.

Trace an Entry

You can trace how a single UUID was resolved. For example, let's trace one of the Laelia rosea entries:

taxonopy trace entry --uuid 8f688a17-1f7a-42b2-b3dc-bd4c8fc0eee3 --from-input examples/input/sample.csv

TaxonoPy uses whatever rank context you provide (even if sparse) to disambiguate identical names. Laelia rosea resolves differently under Animalia vs. Plantae as a hemihomonym. If higher ranks are missing, TaxonoPy would not have been able to disambiguate.

Excerpt (incomplete) from the trace output:

{
  "query_plan": {
    "term": "Laelia rosea",
    "rank": "species",
    "source_id": 11
  },
  "resolution_attempts": [
    {
      "status": "EXACT_MATCH_PRIMARY_SOURCE_ACCEPTED_INNER_RANK_DISAMBIGUATION",
      "resolution_strategy_name": "ExactMatchPrimarySourceAcceptedInnerRankDisambiguation",
      "resolved_classification": {
        "kingdom": "Plantae",
        "phylum": "Tracheophyta",
        "class_": "Liliopsida",
        "order": "Asparagales",
        "family": "Orchidaceae",
        "genus": "Laelia",
        "species": "Laelia rosea"
      }
    }
  ]
}