BioCLIP Ecosystem | Foundation Models for the Tree of Life

Which BioCLIP model do you need?

🚀

State-of-the-Art (Huge)

I need the latest model with the highest accuracy (ViT-H) and emergent biological capabilities. Speed and size are not an issue.

Go to BioCLIP 2.5 Huge →

🚀

State-of-the-Art (Large)

I have inference speed or size constraints, but need the latest model with the highest accuracy (ViT-L) and emergent biological capabilities.

Go to BioCLIP 2 →

📃

Trained with Captions

I need a ViT-B/16 model for inference speed or direct comparison; looking for base model trained with captions and taxonomic labels.

Go to BioCAP →

📜

Original

I need the original BioCLIP ViT-B model described in the 2024 CVPR paper.

Go to BioCLIP →

💾

Raw Models

I want to download weights or try demos.

Go to HF Collection →

State-of-the-Art

BioCLIP 2.5 Huge

The largest model in the BioCLIP collection, BioCLIP 2.5 Huge was trained on 233-million images across more than 950-thousand taxa. Training was accelerated through an updated version of the BioCLIP 2 repository (v2.0.0). BioCLIP 2.5 exhibits emergent properties, extending beyond simple classification, to distinguish between life stages, sexes, and align embeddings with ecological traits like beak size.

The model achieves new state-of-the-art performance on both species classification and broader biological visual tasks, surpassing BioCLIP 2 by 5.7% and 3.5%, respectively. Especially in FishNet, which requires the model to distinguish different habitats, BioCLIP 2.5 Huge demonstrates a 8.7% performance improvement over BioCLIP 2.

Architecture: ViT-H/14 (Huge)
Training Data: 233 Million images (TreeOfLife-200M)
Best for: High-accuracy tasks, fine-grained trait analysis, zero-shot learning, no inference speed or memory constraints.

Visit BioCLIP 2 Site Model Card Read Paper

State-of-the-Art

BioCLIP 2

The next generation model, BioCLIP 2 was trained on 214-million images across more than 950-thousand taxa. It outperforms BioCLIP by 18.0% and provides a 30.1% improvement over the CLIP (ViT-L/14) model used as weight initialization.

Beyond simple classification, it exhibits emergent properties, such as distinguishing between life stages, sexes, and aligning embeddings with ecological traits like beak size.

Architecture: ViT-L/14 (Large)
Training Data: 214 Million images (TreeOfLife-200M, Revision a8f38b4)
Best for: High-accuracy tasks, fine-grained trait analysis, zero-shot learning when size or inference speed are constrained.

Visit BioCLIP 2 Site Model Card Read Paper

Trained with Captions

BioCAP

BioCAP introduces descriptive captions to the original BioCLIP training regimen as complementary supervision, aligning visual and textual representations within the latent morphospace of species. This added context improves performance by +8.8% on classification and +21.3% on retrieval, demonstrating that descriptive language enriches biological foundation models beyond labels.

Trained on the TreeOfLife-10M dataset paired with TreeOfLife-10M Captions, providing synthetic, trait-rich captions generated based on morphological context and taxon-specific examples, it covers over 390,000 taxa and learns a hierarchical representation that aligns with the biological taxonomy.

Architecture: ViT-B/16 (Base)
Training Data: 10 Million images (TreeOfLife-10M) with associated taxonomic and morphological context-based synthetic captions (TreeOfLife-10M Captions)
Best for: General purpose baselines (ViT-B/16 needed, for inference speed or direct comparison), caption generation pipelines, reproducing paper results.

Visit BioCAP Site Model Card Read Paper

BioCAP model visualization showing the caption generation process, a clustered embedding plot with organism thumbnails, showing the separation by sex orthogonal to the 'fly' axis

Original Model

BioCLIP (Original)

The original foundation model presented in "BioCLIP: A Vision Foundation Model for the Tree of Life". It established the standard for using CLIP architectures in organismal biology.

Trained on the TreeOfLife-10M dataset, it covers over 450,000 taxa and learns a hierarchical representation that aligns with the biological taxonomy.

Architecture: ViT-B/16 (Base)
Training Data: 10 Million images (TreeOfLife-10M)
Best for: Reproducing original paper results.

Visit BioCLIP Site Model Card Read Paper

BioCLIP model visualization showing the model architecture

Vision Foundation Modelsfor the Tree of Life

Which BioCLIP model do you need?

State-of-the-Art (Huge)

State-of-the-Art (Large)

Trained with Captions

Original

Raw Models

BioCLIP 2.5 Huge

BioCLIP 2

BioCAP

BioCLIP (Original)

Vision Foundation Models
for the Tree of Life