DiVE-k QWEN2.5-7B (CUB)
Overview
DiVE-k QWEN2.5-7B-CUB is a vision-language model fine-tuned using DiVE-k (Differential Visual Reasoning using Top-k Generations) on a fine-grained visual recognition domain (e.g., CUB).
DiVE-k reformulates fine-grained image classification as a differential reasoning problem. Instead of training the model to predict a single label, it leverages the model’s own top-k predictions to construct a multiple-choice reasoning task. The model is then trained using reinforcement learning to select the correct answer among visually similar candidates, encouraging deeper visual discrimination and reasoning.
This approach improves zero-shot and base-to-novel generalization performance by teaching the model to compare subtle visual differences between competing hypotheses.
The training framework, data construction, and evaluation pipeline are described in detail in the DiVE-k repository.
👉 Source code: https://github.com/raja-kumar/DiVE-k
Example Usage
Please refer to the official DiVE-k GitHub repository for:
- Model loading
- Inference pipelines
- Fine-grained classification setup
- Training and evaluation scripts
👉 https://github.com/raja-kumar/DiVE-k
Citation
If you use this model or the DiVE-k framework in your research, please cite:
@misc{kumar2025divekdifferentialvisualreasoning,
title={DiVE-k: Differential Visual Reasoning for Fine-grained Image Recognition},
author={Raja Kumar and Arka Sadhu and Ram Nevatia},
year={2025},
eprint={2511.18305},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
- Downloads last month
- 24