BioCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models

¹Ziheng Zhang, ¹Xinyue Ma, ¹Arpita Chowdhury, ¹Elizabeth G Campolongo, ¹Matthew J Thompson, ¹Net Zhang, ¹Samuel Stevens, ²Hilmar Lapp, ¹Tanya Berger-Wolf, ¹Yu Su, ¹Wei-Lun (Harry) Chao, ¹Jianyang Gu

¹The Ohio State University, ²Duke University

zhang.13617@osu.edu, gu.1220@osu.edu,

Data Model Demo (Coming Soon)

Paper BioCAP Code

Figure 1: Left: Different strategies to create captions for biological images. Wikipedia offers rich domain knowledge, but descriptions are often generic and not directly grounded in the given image. Multimodal large language models (MLLMs) may hallucinate details when conditioned solely by images (wrong color description in this example). Incorporating Wikipedia-derived visual information and taxon-tailored format examples as contexts helps generate accurate, image-specific captions. Right: Using these descriptive captions as additional supervision, BioCAP captures fine-grained biological semantics.

BioCAP

Foundation models in biology have relied mainly on taxonomic labels. BioCAP introduces descriptive captions as complementary supervision, aligning visual and textual representations within the latent morphospace of species.

We curate and release TreeOfLife-10M-Captions, a large-scale collection of synthetic, trait-rich captions generated by multimodal LLMs guided by Wikipedia context and taxon-specific examples. These captions provide accurate, instance-level descriptions at scale. The BioCAP model is evaluated on species classification and text-image retrieval. Training BioCAP with these captions improves performance by +8.8% on classification and +21.3% on retrieval, demonstrating that descriptive language enriches biological foundation models beyond labels.

Further analysis shows that BioCAP learns a more structured and interpretable representation space. In the embedding space, BioCAP clearly separates species, sexes, and behavioral variants, while Grad-CAM visualizations reveal attention aligned with biologically meaningful traits, demonstrating that descriptive captions enhance both semantic structure and visual grounding.

Demo

Coming Soon

Experiments

We first evaluate BioCAP on species classification tasks. We compared against CLIP (ViT-B/16 pre-trained by openai), SigLIP (ViT-B/16, 224px), BioTrove-CLIP (ViT-B/16), FG-CLIP (ViT-B/16), and BioCLIP (ViT-B/16) on zero-shot classification. Bold indicates the best performance for each task.

BioCAP outperforms BioCLIP by 8.8% and provides a 21.6% improvement over the CLIP model used as weight initialization.

Scroll to see all results.

Model	Animals					Plants & Fungi				Rare Species	Mean
Model	NABirds	Plankton	Insects	Insects 2	Camera Trap	PlantNet	Fungi	PlantVillage	Med. Leaf	Rare Species	Mean
CLIP (ViT-B/16)	39.0	3.3	7.4	9.3	28.1	52.5	8.6	5.1	15.0	25.7	19.4
SigLIP	50.2	3.7	17.6	9.6	26.7	76.3	28.3	26.1	45.4	30.7	32.3
FG-CLIP	48.3	1.9	6.9	9.3	26.4	55.6	7.3	5.9	15.7	29.4	20.7
BioTrove-CLIP	39.4	1.0	20.5	15.7	10.7	64.4	38.2	15.7	31.6	24.6	26.2
BioCLIP	58.8	6.1	34.9	20.5	31.7	88.2	40.9	19.0	38.5	37.1	37.6
BioCAP (Ours)	67.6	7.2	41.9	23.7	37.4	93.6	64.4	33.0	51.4	44.2	46.4

Beyond species classification, we evaluate BioCAP on a series of biological retrieval tasks. These include INQUIRE-Rerank, as well as Cornell Bird and PlantID, two image-text retrieval benchmarks that we curated from paired biological observations. Together, these datasets assess a model’s ability to retrieve and organize biologically relevant images based on descriptive queries.

BioCAP achieves a +21.3% improvement in overall retrieval performance over BioCLIP and outperforms SigLIP by +13.1%, demonstrating stronger visual–language alignment.

Model	INQUIRE Rerank				Cornell Bird		PlantID		Mean
Model	Appear.	Behav.	Context	Species	I2T	T2I	I2T	T2I	Mean
CLIP (ViT-B/16)	30.8	32.9	37.2	37.1	33.8	29.1	25.0	22.1	31.0
SigLIP	34.6	37.2	41.4	36.2	47.7	50.2	42.1	38.1	40.9
FG-CLIP	28.8	31.1	32.5	41.0	49.4	48.1	28.7	27.4	35.9
BioTrove-CLIP	28.5	22.2	30.5	39.5	16.5	13.8	47.4	50.1	31.1
BioCLIP	27.4	27.2	30.8	41.1	15.1	16.2	47.8	45.0	31.3
BioCAP (Ours)	37.1	33.6	37.0	43.0	54.0	52.0	81.4	83.0	52.6

Representation Analysis

Beyond instance-level understanding, we examine how BioCAP organizes relationships among individuals by visualizing the t-SNE embeddings of three bird species, annotated with both behaviors (perch, fly, stand) and sex (male, female/immature). General-purpose models such as CLIP and DINOv3 form loose species clusters and conflate sex distinctions, often aligning female or immature red-winged blackbirds with brown-headed cowbirds.

While BioCLIP learns to separate species, it fails to distinguish behavior variations. In contrast, BioCAP produces compact, well-structured clusters and clearly separates biological semantics across sex and behavior. These results highlight how descriptive captions enhance the model’s understanding of fine-grained biological concepts.

Figure 2: Embedding distribution of three bird species with sex and behavior annotations. Example images corresponding to each label are shown on the right. DINOv3 and CLIP fail to align male and female red-winged blackbirds while mixing male and female hummingbirds. BioCLIP does not capture semantic differences between behaviors. With the guidance of descriptive captions, BioCAP distinguishes subtle variations such as perch and stand, accurately separating behavior variants in the embedding space.

Why are captions helpful for classification? To understand how BioCAP benefits from descriptive supervision, we visualize model attention using Grad-CAM, given species names and high-frequency biological traits mentioned in their captions. The visualization reveals that BioCAP learns to localize biologically meaningful regions and associate them with the correct species.

Figure 3: Grad-CAM visualizations of CLIP, BioCLIP, and BioCAP, given species names and biological concepts frequently mentioned in their captions. Compared to other models, BioCAP provides a more comprehensive understanding of these concepts and correctly grounds them to the corresponding visual regions of each species. Check out our paper for more visualizations.

Dataset

TreeOfLife-10M-Captions extends the TreeOfLife-10M dataset by providing a descriptive caption for every image. Using multimodal large language models guided by Wikipedia-derived visual information and taxon-specific examples, we generate instance-level, trait-rich captions that accurately describe each organism’s visible characteristics. In addition to the captions, we also include the corresponding Wikipedia descriptions for all available taxa.

We train BioCAP on the TreeOfLife-10M dataset with our new TreeOfLife-10M-Captions to align biological images with textual descriptions, and release the pretrained weights publicly for downstream use in biological vision and multimodal learning research.

Reference

Please cite our paper and associated artifact(s) if you use our code, data, model or results.

@article{zhang2025biocap,
      title = {{B}io{CAP}}, 
      author = {Ziheng Zhang and Xinyue Ma and Arpita Chowdhury and Elizabeth G Campolongo and Matthew J Thompson and Net Zhang and Samuel Stevens and Hilmar Lapp and Tanya Berger-Wolf and Yu Su and Wei-Lun (Harry) Chao and Jianyang Gu},
      year = {2025},
      eprint={2510.20095},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.20095}, 
    }

@dataset{treeoflife_10m_captions,
      title = {{TreeOfLife-10M Captions (Revision c048cd2)}}, 
      author = {Ziheng Zhang and Xinyue Ma and Arpita Chowdhury and Elizabeth G Campolongo and Matthew J Thompson and Net Zhang and Samuel Stevens and Hilmar Lapp and Tanya Berger-Wolf and Yu Su and Wei-Lun (Harry) Chao and Jianyang Gu},
      year = {2025},
      url = {https://huggingface.co/datasets/imageomics/TreeOfLife-10M-Captions},
      doi = {10.57967/hf/6801},
      publisher = {Hugging Face}
}

@software{Zhang_BioCAP_model,
  author = {Ziheng Zhang and Xinyue Ma and Arpita Chowdhury and Elizabeth G Campolongo and Matthew J Thompson and Net Zhang and Samuel Stevens and Hilmar Lapp and Tanya Berger-Wolf and Yu Su and Wei-Lun Chao and Jianyang Gu},
  license = {MIT},
  title = {{BioCAP (Revision af8db7a)}},
  url = {https://huggingface.co/imageomics/biocap},
  version = {1.0.0},
  doi = {10.57967/hf/6799},
  publisher = {Hugging Face},
  year = {2025}
}

Also consider citing OpenCLIP, PlantID, Cornell Bird, INQURE, iNat21 and BIOSCAN-1M:

@software{ilharco_gabriel_2021_5143773,
  author={Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig},
  title={OpenCLIP},
  year={2021},
  doi={10.5281/zenodo.5143773},
}

@misc{plantid,
  author       = {{Bruce Homer-Smith and contributors to PlantID.net}},
  title        = {PlantID -- Online Plant Identification Resource},
  year         = {2025},
  url          = {https://plantid.net/},
  note         = {Content licensed under CC BY-NC 3.0. Developed and produced by Bruce Homer-Smith with contributions from Dave Long, Doreen Smith, Kristin Jakob, John Malpas, and others.}
}

@misc{macaulay2025,
  author       = {{Macaulay Library, Cornell Lab of Ornithology}},
  title        = {Macaulay Library: Multimedia Resources for Birds and Other Animals},
  year         = {2025},
  url          = {https://www.macaulaylibrary.org},
}

@article{vendrow2024inquire,
  title={INQUIRE: A Natural World Text-to-Image Retrieval Benchmark},
  author={Vendrow, Edward and Pantazis, Omiros and Shepard, Alexander and Brostow, Gabriel and Jones, Kate E and Mac Aodha, Oisin and Beery, Sara and Van Horn, Grant},
  journal={NeurIPS},
  year={2024},
}

@misc{inat2021,
  author={Van Horn, Grant and Mac Aodha, Oisin},
  title={iNat Challenge 2021 - FGVC8},
  publisher={Kaggle},
  year={2021},
  url={https://kaggle.com/competitions/inaturalist-2021}
}

@inproceedings{gharaee2023step,
  author={Gharaee, Z. and Gong, Z. and Pellegrino, N. and Zarubiieva, I. and Haurum, J. B. and Lowe, S. C. and McKeown, J. T. A. and Ho, C. Y. and McLeod, J. and Wei, Y. C. and Agda, J. and Ratnasingham, S. and Steinke, D. and Chang, A. X. and Taylor, G. W. and Fieguth, P.},
  title={A Step Towards Worldwide Biodiversity Assessment: The {BIOSCAN-1M} Insect Dataset},
  booktitle={Advances in Neural Information Processing Systems ({NeurIPS}) Datasets \& Benchmarks Track},
  year={2023},
}

Acknowledgements

We would like to thank Wasila Dahdul, Zhiyuan Tao, Yifan Liu, Fangxun Liu, Shuheng Wang, Ziqi Li, David Carlyn, Quang-Huy Nguyen, Yintie Lei, Junke Yang for their help with the human evaluation,and Imageomics Team members for their constructive feedback.

We sincerely thank PlantID and its contributors, as well as the Cornell Lab of Ornithology for providing access to their biological media collections. The paired image–text data from PlantID and the Cornell Bird Macaulay Library made our retrieval evaluation possible.

Our research is supported by NSF OAC 2118240 and resources from the Ohio Supercomputer Center. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.