More Research

BioCLIP: A Vision Foundation Model for the Tree of Life

1Samuel Stevens*, 1Jiaman (Lisa) Wu*, 1Matthew J Thompson, 1Elizabeth G Campolongo, 1Chan Hee (Luke) Song, 1David Edward Carlyn, 2Li Dong, 3Wasila M Dahdul, 4Charles Stewart, 1Tanya Berger-Wolf, 1Wei-Lun (Harry) Chao, 1Yu Su,

1The Ohio State University, 2Microsoft Research 3University of California, Irvine 4Rensselaer Polytechnic Institute

*Sam and Lisa are co-first authors and contributed equally to BioCLIP.,

Data Models Code Paper

Figure 1: We use the CLIP objective (c) to train a ViT-B/16 on over 450K different class labels, all of which are taxonomic labels from the Tree of Life (a). Because the text encoder is an autoregressive language model, the order representation can only depend on higher ranks like class, phlyum and kingdom (b). This naturally leads to hierarchical representations for labels, helping the vision encoder learn image representations that are more aligned to the tree of life.


Images of the natural world are a super-abundant source of biological information. There are many computational methods and tools, particularly computer vision, for extracting information from images. But, existing methods are bespoke models for a specific task and are not adaptable or extendable to new questions, contexts, and datasets.

We develop the first large-scale multimodal model, BioCLIP, for general biology questions on images. We leverage the unique properties of biology as the the application domain for computer vision:

  1. The abundance and variety of images about plants, animals, and fungi.
  2. The availability of rich structured biological knowledge.
We curate and release TreeOfLife-10M (the largest and most diverse available dataset of biology images), train BioCLIP, rigorously benchmark our approach on diverse fine-grained biology classification tasks, and find that BioCLIP consistently and substantially outperforms existing baselines (by 17% to 20% absolute). Intrinsic evaluation further reveals that BioCLIP has learned a hierarchical representation conforming to the tree of life, shedding light on its strong generalizability.



We evaluate BioCLIP and three baselines (CLIP, OpenCLIP, and an iNat-only model that uses the same procedure as BioCLIP but trained only on iNat21) on a diverse set of biology-related classification tasks. We do zero-shot classification with all models and report accuracy on the validation sets. Bold indicates the best performance for each task.

BioCLIP outperforms both general-domain baselines and our new iNat-only ViT model.

Check out the paper for one-shot and five-shot results.

Scroll to see all results.

Model Animals Plants & Fungi Rare Species Mean
Birds 525 Plankton Insects Insects 2 PlantNet Fungi PlantVillage Med. Leaf PlantDoc
CLIP 49.9 3.2 9.1 9.8 58.5 10.2 5.4 15.9 26.1 26.6 21.4
OpenCLIP 54.7 2.2 6.5 9.6 50.2 5.7 8.0 12.4 25.8 31.0 20.6
BioCLIP 72.1 6.1 34.8 20.4 91.4 40.7 24.4 38.6 28.4 37.8 39.4
iNat21 Only 56.1 2.6 30.7 11.5 88.2 43.0 18.4 25.6 20.5 19.4 31.6

Evaluation Examples

BioCLIP correctly labels this Clitocybe fragrans from the Fungi task.
BioCLIP correctly labels this Jasmine leaf (Jasminum) from the Medicinal Leaf task.
BioCLIP correctly labels this Rhizosolenia from the Plankton task. BioCLIP performs well on a variety of image sources, like this microscope image.
CLIP mislabels this Russula ochroleuca as "the prince" (Agaricus augustus), which isn't even in the same family as Russula ochroleuca, while BioCLIP correctly labels it.
CLIP mislabels this Roxburg fig (Ficus auriculata) as a Peepal tree (Ficus Religiosa), while BioCLIP correctly labels it.
CLIP mislabels this Strombidium concicum as a ciliate mix, while BioCLIP correctly labels it.

Intrinsic Evaluation

Why does BioCLIP work so well? We conduct an intrinsic evaluation to understand the representations learned by BioCLIP. We visualize BioCLIP and CLIP's representations for the 100K unseen images in the iNat21 validation set, using T-SNE to plot them in two-dimensions, coloring the points based on their class. In the figure below, (B) means BioCLIP and (O) means OpenAI's CLIP.

At higher ranks like phylum, both CLIP and BioCLIP have good separation, but you can see that BioCLIP's representation is more fine-grained and contains a richer clustering structure. At lower ranks, BioCLIP produces far more separable features, while CLIP's features tend to be cluttered and lack a clear structure. This shows that BioCLIP has learned a rich feature representation following the hierarchical structure of the taxonomy, which helps explain its strong generalization across the tree of life.

We show that (B)ioCLIP's representations are more fine-grained and contain a richer clustering structure than (O)penAI's CLIP. Click on the image to see the full resolution PDF, or check out our paper for more details.


TreeOfLife-10M is the largest and most diverse available dataset of biology images. We combine images from three sources, iNaturalist, BIOSCAN-1M, and Encyclopedia of Life (EOL, accessed 29 July 2023), to create a dataset of 10M images, spanning 450K+ species. We train BioCLIP on TreeOfLife-10M and release the weights for public use.


Please cite our paper if you use our code, data, model or results.

      title = {{B}io{CLIP}: A Vision Foundation Model for the Tree of Life}, 
      author = {Samuel Stevens and Jiaman Wu and Matthew J Thompson and Elizabeth G Campolongo and Chan Hee Song and David Edward Carlyn and Li Dong and Wasila M Dahdul and Charles Stewart and Tanya Berger-Wolf and Wei-Lun Chao and Yu Su},
      booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
      year = {2024}

Also consider citing OpenCLIP, iNat21 and BIOSCAN-1M:

  author={Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig},
  author={Van Horn, Grant and Mac Aodha, Oisin},
  title={iNat Challenge 2021 - FGVC8},
  author={Gharaee, Z. and Gong, Z. and Pellegrino, N. and Zarubiieva, I. and Haurum, J. B. and Lowe, S. C. and McKeown, J. T. A. and Ho, C. Y. and McLeod, J. and Wei, Y. C. and Agda, J. and Ratnasingham, S. and Steinke, D. and Chang, A. X. and Taylor, G. W. and Fieguth, P.},
  title={A Step Towards Worldwide Biodiversity Assessment: The {BIOSCAN-1M} Insect Dataset},
  booktitle={Advances in Neural Information Processing Systems ({NeurIPS}) Datasets \& Benchmarks Track},


The authors would like to thank Josef Uyeda, Jim Balhoff, Dan Rubenstein, Hank Bart, Hilmar Lapp, Sara Beery, and colleagues from the Imageomics Institute and the OSU NLP group for their valuable feedback. We also thank the BIOSCAN-1M team and the iNaturalist team for making their data available and easy to use, and Jennifer Hammack at EOL for her invaluable help in accessing EOL’s images.

This work was supported by the Imageomics Institute, which is funded by the US National Science Foundation's Harnessing the Data Revolution (HDR) program under Award #2118240 (Imageomics: A New Frontier of Biological Information Powered by Knowledge-Guided Machine Learning). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.