BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning

¹Jianyang Gu, ¹Samuel Stevens, ¹Elizabeth G Campolongo, ¹Matthew J Thompson, ¹Net Zhang, ¹Jiaman (Lisa) Wu, ¹Andrei Kopanev, ¹Zheda Mai, ²Alexander E. White, ³James Balhoff, ⁴Wasila M Dahdul, ⁵Daniel Robenstein, ⁶Hilmar Lapp, ¹Tanya Berger-Wolf, ¹Wei-Lun (Harry) Chao, ¹Yu Su

¹The Ohio State University, ²Smithsonian Institution, ³University of North Carolina at Chapel Hill, ⁴University of California, Irvine ⁵Princeton University ⁶Duke University

gu.1220@osu.edu, su.809@osu.edu

Data Model Demo Paper

BioCLIP 2 Code TreeOfLife-toolbox

TaxonoPy distributed-downloader

Figure 1: While BioCLIP 2 is trained to distinguish between species, it demonstrates emergent properties from scaling hierarchical contrastive learning. Left: The embeddings of Darwin's finches arrange themselves by beak size from left to right. The embedding distribution aligns with ecological relationships between species. Right: The intra-species variations are preserved in subspaces orthogonal to the inter-species variations (the black lines point from the mean embedding of one variant to that of the other variant).

BioCLIP 2

Foundation models trained at scale exhibit emergent properties beyond their initial training objectives. In this work, we look at the biology imagery and ask a simple but intriguing question: what properties emerge if we scale up the hierarchical contrastive training in BioCLIP?

We curate and release TreeOfLife-200M (the largest and most diverse available dataset of biology images). The new BioCLIP 2 model is evaluated on a diverse set of biological tasks. Through training at scale, BioCLIP 2 improves species classification by 18.1% over BioCLIP. More importantly, we demonstrate that BioCLIP 2 generalizes to diverse biological questions beyond species classification solely through species-level supersvision. Further analysis reveals that BioCLIP 2 acquires two emergent properties through scaling up hierarchical contrastive learning: inter-species ecological alignment and intra-species variation separation.

Demo

Experiments

We first evaluate BioCLIP 2 on species classification tasks. We compared against CLIP (ViT-L/14 pre-trained on LAION-2B), SigLIP (ViT-L/16, 256px), BioTrove-CLIP (ViT-B/16), and BioCLIP (ViT-B/16) on zero-shot classification. Bold indicates the best performance for each task. Check out the paper for one-shot and five-shot results.

BioCLIP 2 outperforms BioCLIP by 18.0% and provides a 30.1% improvement over the CLIP model used as weight initialization.

Scroll to see all results.

Model	Animals					Plants & Fungi				Rare Species	Mean
Model	NABirds	Plankton	Insects	Insects 2	Camera Trap	PlantNet	Fungi	PlantVillage	Med. Leaf	Rare Species	Mean
CLIP (ViT-L/14)	66.5	1.3	9.0	11.7	29.5	61.7	7.6	6.5	25.6	35.2	25.5
SigLIP	61.7	2.4	27.3	20.7	33.7	81.8	36.9	28.5	54.5	47.6	39.5
BioTrove-CLIP	39.4	1.0	20.5	15.7	10.7	64.4	38.2	15.7	31.6	24.6	26.2
BioCLIP	58.8	6.1	34.9	20.5	31.7	88.2	40.9	19.0	38.5	37.1	37.6
BioCLIP 2	74.9	3.9	55.3	27.7	53.9	96.8	83.8	25.1	57.8	76.8	55.6

Biology's organization extends beyond species taxonomy. We collect and benchmark models on five visual benchmarks that push past species ID: FishNet (habitat classification), NeWT (trait prediction), AwA2 (trait prediction), Herbarium19 (new-species identification), and PlantDoc (agricultural disease detection).

Although no information on these tasks is explicitly described during training, BioCLIP 2 yields a 10.2% performance gap over DINOv2, which is commonly used for diverse visual tasks.

Model	Animals			Plants		Mean
Model	FishNet	NeWT	AwA2	Herbarium19	PlantDoc	Mean
CLIP (ViT-L/14)	27.9	83.4	61.6	18.2	22.3	42.7
SigLIP	31.9	83.2	67.3	18.6	28.2	45.8
Supervised-IN21K	29.4	75.8	52.7	14.9	25.1	39.6
DINOv2	37.4	83.7	48.6	28.1	38.6	47.3
BioTrove-CLIP	22.1	82.5	45.7	20.4	37.7	41.7
BioCLIP	30.1	82.7	65.9	26.8	39.5	49.0
BioCLIP 2	39.8	89.1	69.5	48.6	40.4	57.5

Emergent Properties

Why does BioCLIP 2 work well when none of the information is explicitly described during training? We look into the embedding space of BioCLIP 2 and identify two emergent properties that generalize beyond species classification as the training scales up.

The embedding space of different species aligns with their ecological and functional relationships. In the following figure, we show t-SNE plots of the FishNet test set at different training scales. Larger training scales progressively separate the species that can live in freshwater from those that cannot.

Figure 2: As the training data scales, freshwater fish become more distinct from other fish, despite no explicit supersvision being provided during training.

The intra-species variations are preserved and separated. In the following figure, we show the embeddings of three species from NeWT exhibiting life-stage variation. In the top row (2D plots), the intra-species variations are preserved and better separated than the baseline CLIP model. The bottom row (3D plots) further reveals that the intra-species variations lie in subspaces orthogonal to the species span (gray planes). The orthogonality improves with data scale.

Figure 3: As training scales, intra-species variations are preserved and better separated in subspaces orthogonal to the inter-species differences. The 3D space is constructed by the first two principal components of embeddings and an additional orthogonal dimension. Straight lines connect the mean embeddings of two variants of the same species. Check out our paper for more visualizations.

Dataset

TreeOfLife-200M is the largest and most diverse available dataset of biology organismal images. We combine images from four providers, The Global Biodiversity Information Facility (GBIF), BIOSCAN-5M, Encyclopedia of Life (EOL, accessed Aug 2024), and FathomNet to create a dataset of 214M images, spanning 952K taxa. We train BioCLIP 2 on TreeOfLife-200M and release the weights for public use.

Reference

Please cite our paper and associated artifact(s) if you use our code, data, model or results.

@article{gu2025bioclip,
      title = {{B}io{CLIP} 2: Emergent Properties from Scaling Hierarchical Contrastive Learning}, 
      author = {Jianyang Gu and Samuel Stevens and Elizabeth G Campolongo and Matthew J Thompson and Net Zhang and Jiaman Wu and Andrei Kopanev and Zheda Mai and Alexander E. White and James Balhoff and Wasila M Dahdul and Daniel Rubenstein and Hilmar Lapp and Tanya Berger-Wolf and Wei-Lun Chao and Yu Su},
      year = {2025},
      eprint={2505.23883},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.23883},
}

@dataset{treeoflife_200m,
      title = {{T}ree{O}f{L}ife-200{M}}, 
      author = {Jianyang Gu and Samuel Stevens and Elizabeth G Campolongo and Matthew J Thompson and Net Zhang and Jiaman Wu and Andrei Kopanev and Zheda Mai and Alexander E. White and James Balhoff and Wasila M Dahdul and Daniel Rubenstein and Hilmar Lapp and Tanya Berger-Wolf and Wei-Lun Chao and Yu Su},
      year = {2025},
      url = {https://huggingface.co/datasets/imageomics/TreeOfLife-200M},
      doi = {},
      publisher = {Hugging Face}
}

@software{Gu_BioCLIP_2,
      author = {Gu, Jianyang and Stevens, Samuel and Campolongo, Elizabeth G. and Thompson, Matthew J. and Zhang, Net and Wu, Jiaman and Mai, Zheda},
      license = {MIT},
      title = {{BioCLIP 2}},
      url = {https://github.com/Imageomics/bioclip-2},
      doi = {10.5281/zenodo.15644363},
      version = {1.0.0},
      month = {jun},
      year = {2025}
}

@software{Gu_BioCLIP_2_model,
      author = {Jianyang Gu and Samuel Stevens and Elizabeth G Campolongo and Matthew J Thompson and Net Zhang and Jiaman Wu and Andrei Kopanev and Zheda Mai and Alexander E. White and James Balhoff and Wasila M Dahdul and Daniel Rubenstein and Hilmar Lapp and Tanya Berger-Wolf and Wei-Lun Chao and Yu Su},
      license = {MIT},
      title = {{BioCLIP 2}},
      url = {https://huggingface.co/imageomics/bioclip-2},
      version = {1.0.0},
      doi = {10.57967/hf/5765},
      publisher = {Hugging Face},
      month = {jun},
      year = {2025}
}

Also consider citing OpenCLIP, GBIF, BIOSCAN-5M, EOL, and FathomNet:

@software{ilharco_gabriel_2021_5143773,
  author={Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig},
  title={OpenCLIP},
  year={2021},
  doi={10.5281/zenodo.5143773},
}

@misc{GBIF,
  title = {{GBIF} Occurrence Download},
  author = {GBIF.org},
  doi = {10.15468/DL.BFV433},
  url = {https://doi.org/10.15468/dl.bfv433},
  keywords = {GBIF, biodiversity, species occurrences},
  publisher = {The Global Biodiversity Information Facility},
  month = {May},
  year = {2024},
  copyright = {Creative Commons Attribution Non Commercial 4.0 International}
}

@inproceedings{gharaee2024bioscan5m,
  title={{BIOSCAN-5M}: A Multimodal Dataset for Insect Biodiversity},
  author={Zahra Gharaee and Scott C. Lowe and ZeMing Gong and Pablo Millan Arias
      and Nicholas Pellegrino and Austin T. Wang and Joakim Bruslund Haurum
      and Iuliia Zarubiieva and Lila Kari and Dirk Steinke and Graham W. Taylor
      and Paul Fieguth and Angel X. Chang
  },
  booktitle={NeurIPS},
  pages={36285--36313},
  year={2024},
  volume={37}
}

@misc{eol,
  author = {{Encyclopedia of Life (EOL)}},
  url = {https://eol.org},
  note = {Accessed August 2024}
}

@article{katija_fathomnet_2022,
  title = {{FathomNet}: {A} global image database for enabling artificial intelligence in the ocean},
  author = {Katija, Kakani and Orenstein, Eric and Schlining, Brian and Lundsten, Lonny and Barnard, Kevin and Sainz, Giovanna and Boulais, Oceane and Cromwell, Megan and Butler, Erin and Woodward, Benjamin and Bell, Katherine L. C.},
  journal = {Scientific Reports},
  volume = {12},
  number = {1},
  pages = {15914},
  issn = {2045-2322},
  url = {https://www.nature.com/articles/s41598-022-19939-2},
  doi = {10.1038/s41598-022-19939-2},
  month = sep,
  year = {2022},
}

Acknowledgements

We would like to thank Zhiyuan Tao, Shuheng Wang, Ziheng Zhang, Zhongwei Wang, and Leanna House for their help with the TreeOfLife-200M dataset, Charles (Chuck) Stewart, Sara Beery, and other Imageomics Team members for their constructive feedback. We also thank Sergiu Sanielevici, Tom Maiden, and TJ Olesky for their dedicated assistance with arranging the necessary computational resources.

We are grateful to Kakani Katija and Dirk Steinke for helpful conversations regarding use and integration of FathomNet and BIOSCAN-5M, respectively, as well as Stephen Formel and Markus Döring for GBIF. We thank Marie Grosjean for comparative methods for filtering citizen science images and Dylan Verheul for assistance with acquiring images from observation.org from GBIF. We thank Suren Byna for a helpful conversation on early dataset design decisions. We thank Doug Johnson for his collaboration in hosting this large dataset on the Ohio Supercomputer Center research storage file system.

Our research is supported by NSF OAC 2118240 and resources from the Ohio Supercomputer Center. This work used the Bridges-2 system, which is supported by NSF award number OAC-1928147 at the Pittsburgh Supercomputing Center (PSC), under the auspices of the NAIRR Pilot program. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.