And More Research
BioCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models
1Ziheng Zhang, 1Xinyue Ma, 1Arpita Chowdhury, 1Elizabeth G Campolongo, 1Matthew J Thompson, 1Net Zhang, 1Samuel Stevens, 2Hilmar Lapp, 1Tanya Berger-Wolf, 1Yu Su, 1Wei-Lun (Harry) Chao, 1Jianyang Gu
1The Ohio State University, 2Duke University
zhang.13617@osu.edu, gu.1220@osu.edu,
BioCAP
Foundation models in biology have relied mainly on taxonomic labels. BioCAP introduces descriptive captions as complementary supervision, aligning visual and textual representations within the latent morphospace of species.
We curate and release TreeOfLife-10M-Captions, a large-scale collection of synthetic, trait-rich captions generated by multimodal LLMs guided by Wikipedia context and taxon-specific examples. These captions provide accurate, instance-level descriptions at scale. The BioCAP model is evaluated on species classification and text-image retrieval. Training BioCAP with these captions improves performance by +8.8% on classification and +21.3% on retrieval, demonstrating that descriptive language enriches biological foundation models beyond labels.
Further analysis shows that BioCAP learns a more structured and interpretable representation space. In the embedding space, BioCAP clearly separates species, sexes, and behavioral variants, while Grad-CAM visualizations reveal attention aligned with biologically meaningful traits, demonstrating that descriptive captions enhance both semantic structure and visual grounding.
Demo
Coming Soon
Experiments
We first evaluate BioCAP on species classification tasks. We compared against CLIP (ViT-B/16 pre-trained by openai), SigLIP (ViT-B/16, 224px), BioTrove-CLIP (ViT-B/16), FG-CLIP (ViT-B/16), and BioCLIP (ViT-B/16) on zero-shot classification. Bold indicates the best performance for each task.
BioCAP outperforms BioCLIP by 8.8% and provides a 21.6% improvement over the CLIP model used as weight initialization.
Scroll to see all results.
| Model | Animals | Plants & Fungi | Rare Species | Mean | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| NABirds | Plankton | Insects | Insects 2 | Camera Trap | PlantNet | Fungi | PlantVillage | Med. Leaf | |||
| CLIP (ViT-B/16) | 39.0 | 3.3 | 7.4 | 9.3 | 28.1 | 52.5 | 8.6 | 5.1 | 15.0 | 25.7 | 19.4 |
| SigLIP | 50.2 | 3.7 | 17.6 | 9.6 | 26.7 | 76.3 | 28.3 | 26.1 | 45.4 | 30.7 | 32.3 |
| FG-CLIP | 48.3 | 1.9 | 6.9 | 9.3 | 26.4 | 55.6 | 7.3 | 5.9 | 15.7 | 29.4 | 20.7 |
| BioTrove-CLIP | 39.4 | 1.0 | 20.5 | 15.7 | 10.7 | 64.4 | 38.2 | 15.7 | 31.6 | 24.6 | 26.2 |
| BioCLIP | 58.8 | 6.1 | 34.9 | 20.5 | 31.7 | 88.2 | 40.9 | 19.0 | 38.5 | 37.1 | 37.6 |
| BioCAP (Ours) | 67.6 | 7.2 | 41.9 | 23.7 | 37.4 | 93.6 | 64.4 | 33.0 | 51.4 | 44.2 | 46.4 |
Beyond species classification, we evaluate BioCAP on a series of biological retrieval tasks. These include INQUIRE-Rerank, as well as Cornell Bird and PlantID, two image-text retrieval benchmarks that we curated from paired biological observations. Together, these datasets assess a model’s ability to retrieve and organize biologically relevant images based on descriptive queries.
BioCAP achieves a +21.3% improvement in overall retrieval performance over BioCLIP and outperforms SigLIP by +13.1%, demonstrating stronger visual–language alignment.
| Model | INQUIRE Rerank | Cornell Bird | PlantID | Mean | |||||
|---|---|---|---|---|---|---|---|---|---|
| Appear. | Behav. | Context | Species | I2T | T2I | I2T | T2I | ||
| CLIP (ViT-B/16) | 30.8 | 32.9 | 37.2 | 37.1 | 33.8 | 29.1 | 25.0 | 22.1 | 31.0 |
| SigLIP | 34.6 | 37.2 | 41.4 | 36.2 | 47.7 | 50.2 | 42.1 | 38.1 | 40.9 |
| FG-CLIP | 28.8 | 31.1 | 32.5 | 41.0 | 49.4 | 48.1 | 28.7 | 27.4 | 35.9 |
| BioTrove-CLIP | 28.5 | 22.2 | 30.5 | 39.5 | 16.5 | 13.8 | 47.4 | 50.1 | 31.1 |
| BioCLIP | 27.4 | 27.2 | 30.8 | 41.1 | 15.1 | 16.2 | 47.8 | 45.0 | 31.3 |
| BioCAP (Ours) | 37.1 | 33.6 | 37.0 | 43.0 | 54.0 | 52.0 | 81.4 | 83.0 | 52.6 |
Representation Analysis
Beyond instance-level understanding, we examine how BioCAP organizes relationships among individuals by visualizing the t-SNE embeddings of three bird species, annotated with both behaviors (perch, fly, stand) and sex (male, female/immature). General-purpose models such as CLIP and DINOv3 form loose species clusters and conflate sex distinctions, often aligning female or immature red-winged blackbirds with brown-headed cowbirds.
While BioCLIP learns to separate species, it fails to distinguish behavior variations. In contrast, BioCAP produces compact, well-structured clusters and clearly separates biological semantics across sex and behavior. These results highlight how descriptive captions enhance the model’s understanding of fine-grained biological concepts.
Why are captions helpful for classification? To understand how BioCAP benefits from descriptive supervision, we visualize model attention using Grad-CAM, given species names and high-frequency biological traits mentioned in their captions. The visualization reveals that BioCAP learns to localize biologically meaningful regions and associate them with the correct species.
Dataset
TreeOfLife-10M-Captions extends the TreeOfLife-10M dataset by providing a descriptive caption for every image. Using multimodal large language models guided by Wikipedia-derived visual information and taxon-specific examples, we generate instance-level, trait-rich captions that accurately describe each organism’s visible characteristics. In addition to the captions, we also include the corresponding Wikipedia descriptions for all available taxa.
We train BioCAP on the TreeOfLife-10M dataset with our new TreeOfLife-10M-Captions to align biological images with textual descriptions, and release the pretrained weights publicly for downstream use in biological vision and multimodal learning research.
Reference
Please cite our paper and associated artifact(s) if you use our code, data, model or results.
@article{zhang2025biocap,
title = {{B}io{CAP}},
author = {Ziheng Zhang and Xinyue Ma and Arpita Chowdhury and Elizabeth G Campolongo and Matthew J Thompson and Net Zhang and Samuel Stevens and Hilmar Lapp and Tanya Berger-Wolf and Yu Su and Wei-Lun (Harry) Chao and Jianyang Gu},
year = {2025},
eprint={2510.20095},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.20095},
}
@dataset{treeoflife_10m_captions,
title = {{TreeOfLife-10M Captions (Revision c048cd2)}},
author = {Ziheng Zhang and Xinyue Ma and Arpita Chowdhury and Elizabeth G Campolongo and Matthew J Thompson and Net Zhang and Samuel Stevens and Hilmar Lapp and Tanya Berger-Wolf and Yu Su and Wei-Lun (Harry) Chao and Jianyang Gu},
year = {2025},
url = {https://huggingface.co/datasets/imageomics/TreeOfLife-10M-Captions},
doi = {10.57967/hf/6801},
publisher = {Hugging Face}
}
@software{Zhang_BioCAP_model,
author = {Ziheng Zhang and Xinyue Ma and Arpita Chowdhury and Elizabeth G Campolongo and Matthew J Thompson and Net Zhang and Samuel Stevens and Hilmar Lapp and Tanya Berger-Wolf and Yu Su and Wei-Lun Chao and Jianyang Gu},
license = {MIT},
title = {{BioCAP (Revision af8db7a)}},
url = {https://huggingface.co/imageomics/biocap},
version = {1.0.0},
doi = {10.57967/hf/6799},
publisher = {Hugging Face},
year = {2025}
}
Also consider citing OpenCLIP, PlantID, Cornell Bird, INQURE, iNat21 and BIOSCAN-1M:
@software{ilharco_gabriel_2021_5143773,
author={Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig},
title={OpenCLIP},
year={2021},
doi={10.5281/zenodo.5143773},
}
@misc{plantid,
author = {{Bruce Homer-Smith and contributors to PlantID.net}},
title = {PlantID -- Online Plant Identification Resource},
year = {2025},
url = {https://plantid.net/},
note = {Content licensed under CC BY-NC 3.0. Developed and produced by Bruce Homer-Smith with contributions from Dave Long, Doreen Smith, Kristin Jakob, John Malpas, and others.}
}
@misc{macaulay2025,
author = {{Macaulay Library, Cornell Lab of Ornithology}},
title = {Macaulay Library: Multimedia Resources for Birds and Other Animals},
year = {2025},
url = {https://www.macaulaylibrary.org},
}
@article{vendrow2024inquire,
title={INQUIRE: A Natural World Text-to-Image Retrieval Benchmark},
author={Vendrow, Edward and Pantazis, Omiros and Shepard, Alexander and Brostow, Gabriel and Jones, Kate E and Mac Aodha, Oisin and Beery, Sara and Van Horn, Grant},
journal={NeurIPS},
year={2024},
}
@misc{inat2021,
author={Van Horn, Grant and Mac Aodha, Oisin},
title={iNat Challenge 2021 - FGVC8},
publisher={Kaggle},
year={2021},
url={https://kaggle.com/competitions/inaturalist-2021}
}
@inproceedings{gharaee2023step,
author={Gharaee, Z. and Gong, Z. and Pellegrino, N. and Zarubiieva, I. and Haurum, J. B. and Lowe, S. C. and McKeown, J. T. A. and Ho, C. Y. and McLeod, J. and Wei, Y. C. and Agda, J. and Ratnasingham, S. and Steinke, D. and Chang, A. X. and Taylor, G. W. and Fieguth, P.},
title={A Step Towards Worldwide Biodiversity Assessment: The {BIOSCAN-1M} Insect Dataset},
booktitle={Advances in Neural Information Processing Systems ({NeurIPS}) Datasets \& Benchmarks Track},
year={2023},
}
Acknowledgements
We would like to thank Wasila Dahdul, Zhiyuan Tao, Yifan Liu, Fangxun Liu, Shuheng Wang, Ziqi Li, David Carlyn, Quang-Huy Nguyen, Yintie Lei, Junke Yang for their help with the human evaluation,and Imageomics Team members for their constructive feedback.
We sincerely thank PlantID and its contributors, as well as the Cornell Lab of Ornithology for providing access to their biological media collections. The paired image–text data from PlantID and the Cornell Bird Macaulay Library made our retrieval evaluation possible.
Our research is supported by NSF OAC 2118240 and resources from the Ohio Supercomputer Center. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.