Vision Foundation Models
for the Tree of Life

Bridging computer vision and biology. BioCLIP models learn hierarchical representations of the natural world, enabling advanced species classification, trait prediction, and more.

Which BioCLIP model do you need?

🚀

State-of-the-Art (Huge)

I need the latest model with the highest accuracy (ViT-H) and emergent biological capabilities. Speed and size are not an issue.

Go to BioCLIP 2.5 Huge →
🚀

State-of-the-Art (Large)

I have inference speed or size constraints, but need the latest model with the highest accuracy (ViT-L) and emergent biological capabilities.

Go to BioCLIP 2 →
📃

Trained with Captions

I need a ViT-B/16 model for inference speed or direct comparison; looking for base model trained with captions and taxonomic labels.

Go to BioCAP →
📜

Original

I need the original BioCLIP ViT-B model described in the 2024 CVPR paper.

Go to BioCLIP →
💾

Raw Models

I want to download weights or try demos.

Go to HF Collection →
State-of-the-Art

BioCLIP 2.5 Huge

The largest model in the BioCLIP collection, BioCLIP 2.5 Huge was trained on 233-million images across more than 950-thousand taxa. Training was accelerated through an updated version of the BioCLIP 2 repository (v2.0.0). BioCLIP 2.5 exhibits emergent properties, extending beyond simple classification, to distinguish between life stages, sexes, and align embeddings with ecological traits like beak size.

The model achieves new state-of-the-art performance on both species classification and broader biological visual tasks, surpassing BioCLIP 2 by 5.7% and 3.5%, respectively. Especially in FishNet, which requires the model to distinguish different habitats, BioCLIP 2.5 Huge demonstrates a 8.7% performance improvement over BioCLIP 2.

  • Architecture: ViT-H/14 (Huge)
  • Training Data: 233 Million images (TreeOfLife-200M)
  • Best for: High-accuracy tasks, fine-grained trait analysis, zero-shot learning, no inference speed or memory constraints.
Visit BioCLIP 2 Site Model Card Read Paper
BioCLIP 2 model visualization showing the model architecture, a clustered embedding plot with organism thumbnails, showing the separation by age and sex orthogonal to the species axis
State-of-the-Art

BioCLIP 2

The next generation model, BioCLIP 2 was trained on 214-million images across more than 950-thousand taxa. It outperforms BioCLIP by 18.0% and provides a 30.1% improvement over the CLIP (ViT-L/14) model used as weight initialization.

Beyond simple classification, it exhibits emergent properties, such as distinguishing between life stages, sexes, and aligning embeddings with ecological traits like beak size.

  • Architecture: ViT-L/14 (Large)
  • Training Data: 214 Million images (TreeOfLife-200M, Revision a8f38b4)
  • Best for: High-accuracy tasks, fine-grained trait analysis, zero-shot learning when size or inference speed are constrained.
Visit BioCLIP 2 Site Model Card Read Paper
BioCLIP 2 model visualization showing the model architecture, a clustered embedding plot with organism thumbnails, showing the separation by age and sex orthogonal to the species axis
Trained with Captions

BioCAP

BioCAP introduces descriptive captions to the original BioCLIP training regimen as complementary supervision, aligning visual and textual representations within the latent morphospace of species. This added context improves performance by +8.8% on classification and +21.3% on retrieval, demonstrating that descriptive language enriches biological foundation models beyond labels.

Trained on the TreeOfLife-10M dataset paired with TreeOfLife-10M Captions, providing synthetic, trait-rich captions generated based on morphological context and taxon-specific examples, it covers over 390,000 taxa and learns a hierarchical representation that aligns with the biological taxonomy.

  • Architecture: ViT-B/16 (Base)
  • Training Data: 10 Million images (TreeOfLife-10M) with associated taxonomic and morphological context-based synthetic captions (TreeOfLife-10M Captions)
  • Best for: General purpose baselines (ViT-B/16 needed, for inference speed or direct comparison), caption generation pipelines, reproducing paper results.
Visit BioCAP Site Model Card Read Paper
BioCAP model visualization showing the caption generation process, a clustered embedding plot with organism thumbnails, showing the separation by sex orthogonal to the 'fly' axis
Original Model

BioCLIP (Original)

The original foundation model presented in "BioCLIP: A Vision Foundation Model for the Tree of Life". It established the standard for using CLIP architectures in organismal biology.

Trained on the TreeOfLife-10M dataset, it covers over 450,000 taxa and learns a hierarchical representation that aligns with the biological taxonomy.

  • Architecture: ViT-B/16 (Base)
  • Training Data: 10 Million images (TreeOfLife-10M)
  • Best for: Reproducing original paper results.
Visit BioCLIP Site Model Card Read Paper
BioCLIP model visualization showing the model architecture
Repository

Hugging Face Collection

The central warehouse for all BioCLIP assets. This collection aggregates all versions of the models, the training datasets, benchmarks, and interactive demos.

Use this if you need direct access to raw model weights (SafeTensors/PyTorch), want to access the TreeOfLife or benchmark datasets, or are looking for easy by-image predictions.

  • Models: BioCLIP, BioCLIP 2, BioCAP.
  • Datasets: TreeOfLife-200M, TreeOfLife-10M, TreeOfLife-10M Captions.
  • Benchmarks: Rare Species, IDLE-OO Camera Traps.
  • Demos: Interactive Gradio apps for zero-shot and open-ended classification.
Browse Collection
🤗 Hugging Face Collection