BioCLIP Software

Tools and applications for working with BioCLIP models and TreeOfLife data, from simple Python packages to large-scale data processing pipelines.

Developer Tool

pybioclip

A user-friendly Python package and command-line tool designed to make BioCLIP models accessible to everyone, regardless of ML expertise. pybioclip provides easy-to-use interfaces for taxonomic classification, custom label prediction, and image embedding generation through your choice of BioCLIP model. This can be accomplished through either a command-line interface (CLI) or a Python API for custom scripts.

Key Features:

  • Taxonomic label prediction for images across ranks in the Linnaean hierarchy (tunable from kingdom to species).
  • Custom label predictions from user-supplied classification categories.
  • Image embedding generation in a text-aligned feature space.
  • Batch image processing with performance optimizations.
  • Containers provided to simplify use in computational pipelines.

Read Documentation View on GitHub
Python CLI
Data Processing

TreeOfLife-Toolbox

Source-specific tools for processing data (images) downloaded using distributed downloader. The post-download processing (e.g., exact duplicate filtering and extracting FathomNet detections) is packaged and uses MPI. All other image content-based tools used in creating the TreeOfLife-200M dataset are included as scripted pipelines listed below.

Key Features:

  • Packaged tools for processing GBIF, EOL, FathomNet, and LILA BC data downloaded through distributed-downloader.
  • Human-face identification pipeline for image filtering.
  • Museum specimen image filtering pipeline: remove non-specimen images (e.g., maps, folders, text documents).
  • Citizen science image filtering pipeline: remove near duplicate images in single observations.
  • Camera trap image filtering pipeline: remove images with no focal species.
  • PDQ hash-based image deduplication pipeline: remove duplicate images not identified through traditional hashes.

View on GitHub
Data Tools
Taxonomy

TaxonoPy

A Python package for efficiently aligning organismal taxonomic hierarchies using the Global Names Verifier. TaxonoPy resolves massive taxonomic datasets, aligning to a single source taxonomy for uniformity. The motivation for this package is to create an internally consistent and standardized classification set for organisms in large biodiversity datasets composed from different data providers.

Key Features:

  • Automatically resolve inconsistencies between source taxonomies (e.g. kingdom Metazoa vs. Animalia).
  • Complete records missing one or more ranks where the chosen authority has it defined.
  • Resolve ambiguities: homonyms, synonyms, and other terms that can be interpreted in multiple ways unless handled systematically.
  • Inspect resolution decisions (and failures).

Read Documentation View on GitHub
Data Access

Distributed Downloader

An MPI-based distributed downloading tool for retrieving data from diverse domains. Initially developed to handle the massive scale of downloading all images from the monthly Global Biodiversity Information Facility (GBIF) occurrence snapshot. It was used for the May 2024 snapshot which contained approximately 200 million images distributed across 545 servers.

Key Features:

  • Server-friendly operation: Implements sophisticated rate limiting to avoid overloading source servers
  • Enhanced control: Provides fine-grained control over dataset construction and metadata management
  • Scalability: Handles massive datasets that exceed the capabilities of single-machine solutions
  • Fault tolerance: Robust checkpoint and recovery system for long-running downloads
  • Flexibility: Supports diverse output formats and custom processing pipelines

View on GitHub
MPI-based FAIR Downloader