Abstract

Accurately generating images across the Tree of Life is difficult: there are over 10M distinct species on Earth, many of which differ only by subtle visual traits. Despite the remarkable progress in text-to-image synthesis, existing models often fail to capture the fine-grained visual cues that define species identity, even when their outputs appear photo-realistic. To this end, we propose TaxaAdapter, a simple and lightweight approach that incorporates Vision Taxonomy Models (VTMs) such as BioCLIP to guide fine-grained species generation. Our method injects VTM embeddings into a frozen text-to-image diffusion model, improving species-level fidelity while preserving flexible text control over attributes such as pose, style, and background. Extensive experiments demonstrate that TaxaAdapter consistently improves morphology fidelity and species-identity accuracy over strong baselines, with a cleaner architecture and training recipe. To better evaluate these improvements, we also introduce a multimodal Large Language Model-based metric that summarizes trait-level descriptions from generated and real images, providing a more interpretable measure of morphological consistency. Beyond this, we observe that TaxaAdapter exhibits strong generalization capabilities, enabling species synthesis in challenging regimes such as few-shot species with only a handful of training images and even species unseen during training. Overall, our results highlight that VTMs are a key ingredient for scalable, fine-grained species generation.

Method

TaxaAdapter pipeline overview. Given a taxonomic name (e.g., Kingdom → Species), we extract taxonomy-image aligned embeddings using a pre-trained vision taxonomy model (e.g., BioCLIP, BioTrove-CLIP, or TaxaBind) and obtain complementary text features from the frozen CLIP text encoder. The dual conditioning streams are fused through a decoupled cross-attention mechanism, where the taxonomy branch captures species-level traits and the text branch retains free-form control over contextual cues such as style, background, or pose. During training, we only update the projection and cross-attention layers, while the diffusion backbone remains frozen for efficient and stable adaptation.

Trait Fidelity Evaluation

Caption-based trait fidelity evaluation. We leverage an MLLM to generate trait captions for real and generated images, summarize each set into a species-level trait description with an LLM, and compute text similarity between the two summaries. Our metric provides an interpretable, trait-level measure of morphological fidelity that complements standard image metrics.

Qualitative Results

Main Results - Slide 1 — Results on Tree-Of-Life-1M

Main Results - Slide 2 — Results on iNaturalist

Main Results - Slide 3 — Results on FishNet

Each row shows a different species spanning birds, mammals, insects, and reptiles. TaxaAdapter generates morphology-faithful images that align with taxonomy-defining traits (e.g., texture patterns, body shape, coloration) while maintaining realistic textures and backgrounds. Notably, in the Tree-of-Life-1M results, the first two rows illustrate two extremely similar species under the same Genus, where subtle differences in white spotting patterns on bird wings (highlighted in yellow circles) are correctly captured.

Generalization Capabilities of TaxaAdapter

Free-form text prompting. Our dual-conditioning design allows flexible control over contextual attributes while preserving species-level morphology.
Attention maps across different taxonomic tokens. TaxaAdapter gradually focuses on fine-grained information, from body structure to head at the Genus level, and to eyes and beak at the Species level.
Results with different weighting factors (λ) during inference. Setting λ = 0 relies solely on the CLIP text branch, while increasing λ strengthens the taxonomy-conditioned route. Intermediate values (for example, λ = 0.5) balance both conditions and yield morphology-faithful generations.

BibTeX


                coming soon...

TaxaAdapter: Vision Taxonomy Models are Key to Fine-grained Image Generation over the Tree of Life

TL;DR TaxaAdapter injects Vision Taxonomy Model embeddings into a frozen text-to-image diffusion model to improve species-identity fidelity while preserving flexible text control over pose, style, and background.

Abstract

Method

Trait Fidelity Evaluation

Qualitative Results

Results on Tree-Of-Life-1M

Results on iNaturalist

Results on FishNet

Generalization Capabilities of TaxaAdapter

BibTeX