DiVE-k QWEN2.5-7B (CUB)

Overview

DiVE-k QWEN2.5-7B-CUB is a vision-language model fine-tuned using DiVE-k (Differential Visual Reasoning using Top-k Generations) on a fine-grained visual recognition domain (e.g., CUB).

DiVE-k reformulates fine-grained image classification as a differential reasoning problem. Instead of training the model to predict a single label, it leverages the model’s own top-k predictions to construct a multiple-choice reasoning task. The model is then trained using reinforcement learning to select the correct answer among visually similar candidates, encouraging deeper visual discrimination and reasoning.

This approach improves zero-shot and base-to-novel generalization performance by teaching the model to compare subtle visual differences between competing hypotheses.

The training framework, data construction, and evaluation pipeline are described in detail in the DiVE-k repository.

👉 Source code: https://github.com/raja-kumar/DiVE-k


Example Usage

Please refer to the official DiVE-k GitHub repository for:

  • Model loading
  • Inference pipelines
  • Fine-grained classification setup
  • Training and evaluation scripts

👉 https://github.com/raja-kumar/DiVE-k


Citation

If you use this model or the DiVE-k framework in your research, please cite:

@misc{kumar2025divekdifferentialvisualreasoning,
  title={DiVE-k: Differential Visual Reasoning for Fine-grained Image Recognition},
  author={Raja Kumar and Arka Sadhu and Ram Nevatia},
  year={2025},
  eprint={2511.18305},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}
Downloads last month
24
Safetensors
Model size
8B params
Tensor type
BF16
·
Video Preview
loading

Collection including raja-kumar/DiVE-k-QWEN2.5-7B-CUB

Paper for raja-kumar/DiVE-k-QWEN2.5-7B-CUB