ApsaraStackMaaS
/

EvoQwen2.5-VL-Retriever-3B-v1

+---
+license: apache-2.0
+datasets:
+- vidore/colpali_train_set
+- openbmb/VisRAG-Ret-Train-Synthetic-data
+- openbmb/VisRAG-Ret-Train-In-domain-data
+language:
+- en
+base_model:
+- vidore/colqwen2.5-base
+pipeline_tag: visual-document-retrieval
+---
+# Model Card for Model ID
+EvoQwen2.5-VL-Retriever-3B-v1 is a high-performance multimodal retrieval model built upon the Qwen2.5-VL-3B-Instruct backbone and employing multi-vector late-interaction. The model is fine-tuned by using an innovative evolutionary training framework (Evo-Retriever), enabling accurate retrieval of complex visual documents.
+## Version Specificity
+•	Base Model: ApsaraStackMaaS/EvoQwen2.5-VL-Retriever-3B-v1
+•	Parameter Size: 3 billion (3B)
+•	Features: As a smaller-weight model in this series, this model outperforms other models of similar size on evaluation benchmarks, delivering higher retrieval accuracy in resource-constrained scenarios.
+## Performance
+<table border="1">
+  <tr>
+    <th>Model</th>
+    <th>ViDoRe V2 (nDCG@5)</th>
+    <th>MMEB VisDoc (ndcg_linear@5)</th>
+  </tr>
+  <tr>
+    <th>ApsaraStackMaaS/EvoQwen2.5-VL-Retriever-3B-v1</th>
+    <th>63.00</th>
+    <th>75.96</th>
+  </tr>
+  <tr>
+    <th>ApsaraStackMaaS/EvoQwen2.5-VL-Retriever-7B-v1</th>
+    <th>65.24</th>
+    <th>77.10</th>
+  </tr>
+</table>
+## Usage
+Make sure that you have installed Transformers, Torch, Pillow, and colpali-engine.
+<body>
+    <div class="code-container">
+        <div class="line-numbers" id="lineNumbers"></div>
+        <pre><code class="language-javascript">
+  import torch
+from PIL import Image
+from transformers.utils.import_utils import is_flash_attn_2_available
+from colpali_engine.models import ColQwen2_5, ColQwen2_5_Processor
+model_name = "ApsaraStackMaaS/EvoQwen2.5-VL-Retriever-3B-v1"
+model = ColQwen2_5.from_pretrained(
+    model_name,
+    torch_dtype=torch.bfloat16,
+    device_map="cuda:0",  # or "mps" if on Apple Silicon
+    attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
+).eval()
+processor = ColQwen2_5_Processor.from_pretrained(model_name)
+<p># Your inputs</p>
+images = [
+    Image.new("RGB", (128, 128), color="white"),
+    Image.new("RGB", (64, 32), color="black"),
+]
+queries = [
+    "Is attention really all you need?",
+    "What is the amount of bananas farmed in Salvador?",
+]
+<p># Process the inputs</p>
+batch_images = processor.process_images(images).to(model.device)
+batch_queries = processor.process_queries(queries).to(model.device)
+<p># Forward pass</p>
+with torch.no_grad():
+    image_embeddings = model(**batch_images)
+    query_embeddings = model(**batch_queries)
+scores = processor.score_multi_vector(query_embeddings, image_embeddings)
+print(scores)
+</div>
+## Parameters
+All models are fine-tuned by using the Evo-Retriever paradigm with a two-stage training schedule (one epoch per stage). Unless otherwise noted, parameter-efficient fine-tuning is achieved through low-rank adapters (LoRA) with a rank of 32 for both 3B and 7B models. Training is performed in bfloat16 precision with the paged_adamw_8bit optimizer on an 8-GPU H20 server, employing a data-parallel strategy, a learning rate of 2e-5, cosine decay, 2% warm-up steps, and a batch size of 32.