Tools and applications for working with BioCLIP models and TreeOfLife data, from simple Python packages to large-scale data processing pipelines.
A user-friendly Python package and command-line tool designed to make BioCLIP models accessible to everyone, regardless of ML expertise. pybioclip provides easy-to-use interfaces for taxonomic classification, custom label prediction, and image embedding generation through your choice of BioCLIP model. This can be accomplished through either a command-line interface (CLI) or a Python API for custom scripts.
Key Features:
Source-specific tools for processing data (images) downloaded using distributed downloader. The post-download processing (e.g., exact duplicate filtering and extracting FathomNet detections) is packaged and uses MPI. All other image content-based tools used in creating the TreeOfLife-200M dataset are included as scripted pipelines listed below.
Key Features:
A Python package for efficiently aligning organismal taxonomic hierarchies using the Global Names Verifier. TaxonoPy resolves massive taxonomic datasets, aligning to a single source taxonomy for uniformity. The motivation for this package is to create an internally consistent and standardized classification set for organisms in large biodiversity datasets composed from different data providers.
Key Features:
An MPI-based distributed downloading tool for retrieving data from diverse domains. Initially developed to handle the massive scale of downloading all images from the monthly Global Biodiversity Information Facility (GBIF) occurrence snapshot. It was used for the May 2024 snapshot which contained approximately 200 million images distributed across 545 servers.
Key Features: