Quick Reference

Install

pip install taxonopy

For detailed setup instructions including GNVerifier and troubleshooting, see Installation.

Sample Input

Download the same sample dataset in either format and place it in examples/input/:

Sample input: Note the divergence in kingdoms (Metazoa vs Animalia), missing interior ranks, and fully null entry.

uuid	kingdom	phylum	class	order	family	genus	species	scientific_name
bc2a3f9f-c1f9-48df-9b01-d045475b9d5f	Metazoa	Chordata	Mammalia	Primates	Hominidae	Homo	Homo sapiens	Homo sapiens
21ed76d8-9a3b-406e-a1a3-ef244422bf8e	Plantae	Tracheophyta	`null`	Fagales	Fagaceae	Quercus	Quercus alba	Quercus alba
4d166a61-b6e5-4709-91ba-b623111014e9	Animalia	`null`	`null`	Hymenoptera	Apidae	Apis	Apis mellifera	Apis mellifera
85b96dc2-70ab-446e-afb5-6a4b92b0a450	`null`	`null`	`null`	`null`	`null`	`null`	Amanita muscaria	`null`
38327554-ffbf-4180-b4cf-63c311a26f4e	Animalia	`null`	`null`	`null`	`null`	`null`	Laelia rosea	`null`
8f688a17-1f7a-42b2-b3dc-bd4c8fc0eee3	Plantae	`null`	`null`	`null`	`null`	`null`	Laelia rosea	`null`
a95f3e29-ed48-41f4-9577-64d4243a0396	`null`	`null`	`null`	`null`	`null`	`null`	`null`	`null`

In the final example entry, there is no available taxonomic data, which can happen in large datasets where there may be a corresponding image but incomplete annotation.

Execute a Basic Resolution

taxonopy resolve --input examples/input --output-dir examples/output

Input values

There are three kinds of values you can pass to --input:

A single file path (CSV or Parquet).
A flat directory of partitioned files (TaxonoPy will glob everything inside).
A directory tree (TaxonoPy will glob recursively and preserve the folder structure in the output).

In all three cases, the base filename is preserved in the output. That is, the output keeps the original filename(s) and adds .resolved / .unsolved before the extension.

If you download both sample.csv and sample.parquet into examples/input/, resolve will fail due to mixed input formats; keep only one format per input directory.

The command above will read in the sample data from examples/input/, execute resolution, and write the results to examples/output/.

By default, outputs are written to Parquet format, whether the input is CSV or Parquet. To set the output format to CSV, use the --output-format csv flag.

The output files consist of:

sample.resolved.parquet
sample.unsolved.parquet
resolution_stats.json

The sample.resolved.parquet file contains all the entries where some resolution strategy was applied. In this example, it contains:

Sample resolved output (selected columns): Green highlights show values added during resolution. Yellow highlights indicate values that changed from the input.

uuid	kingdom	phylum	class	order	family	genus	species	scientific_name	common_name
bc2a3f9f-c1f9-48df-9b01-d045475b9d5f	Animalia^?	Chordata	Mammalia	Primates	Hominidae	Homo	Homo sapiens	Homo sapiens	`null`
21ed76d8-9a3b-406e-a1a3-ef244422bf8e	Plantae	Tracheophyta	Magnoliopsida	Fagales	Fagaceae	Quercus	Quercus alba	Quercus alba	`null`
4d166a61-b6e5-4709-91ba-b623111014e9	Animalia	Arthropoda	Insecta	Hymenoptera	Apidae	Apis	Apis mellifera	Apis mellifera	`null`
85b96dc2-70ab-446e-afb5-6a4b92b0a450	Fungi	Basidiomycota	Agaricomycetes	Agaricales	Amanitaceae	Amanita	Amanita muscaria	`null`	`null`
38327554-ffbf-4180-b4cf-63c311a26f4e	Animalia	Arthropoda	Insecta	Lepidoptera	Erebidae	Laelia	Laelia rosea	`null`	`null`
8f688a17-1f7a-42b2-b3dc-bd4c8fc0eee3	Plantae	Tracheophyta	Liliopsida	Asparagales	Orchidaceae	Laelia	Laelia rosea	`null`	`null`

The sample.unsolved.parquet file contains entries that could not be resolved (for example, rows with no usable taxonomy information). In this example, it contains:

Sample unsolved output: Sequestered entries with no usable taxonomy information.

uuid	kingdom	phylum	class	order	family	genus	species	scientific_name	common_name
a95f3e29-ed48-41f4-9577-64d4243a0396	`null`	`null`	`null`	`null`	`null`	`null`	`null`	`null`	`null`

The resolution_stats.json file summarizes counts of how many entries from the input fell into each final status across the resolved and unsolved files.

TaxonoPy also writes cache data to disk (default: ~/.cache/taxonopy) so it can trace provenance and avoid reprocessing. Use --show-cache-path, --cache-stats, or --clear-cache if you want to inspect or manage it, or see the Cache guide for details.

Trace an Entry

You can trace how a single UUID was resolved. For example, let's trace one of the Laelia rosea entries:

taxonopy trace entry --uuid 8f688a17-1f7a-42b2-b3dc-bd4c8fc0eee3 --from-input examples/input/sample.csv

TaxonoPy uses whatever rank context you provide (even if sparse) to disambiguate identical names. Laelia rosea resolves differently under Animalia vs. Plantae as a hemihomonym. If higher ranks are missing, TaxonoPy would not have been able to disambiguate.

Excerpt (incomplete) from the trace output:

{
  "query_plan": {
    "term": "Laelia rosea",
    "rank": "species",
    "source_id": 11
  },
  "resolution_attempts": [
    {
      "status": "EXACT_MATCH_PRIMARY_SOURCE_ACCEPTED_INNER_RANK_DISAMBIGUATION",
      "resolution_strategy_name": "ExactMatchPrimarySourceAcceptedInnerRankDisambiguation",
      "resolved_classification": {
        "kingdom": "Plantae",
        "phylum": "Tracheophyta",
        "class_": "Liliopsida",
        "order": "Asparagales",
        "family": "Orchidaceae",
        "genus": "Laelia",
        "species": "Laelia rosea"
      }
    }
  ]
}