Visual Document Retrieval
Safetensors
English
qwen2_5_vl
Lanpo commited on
Commit
f536205
·
verified ·
1 Parent(s): aeacaa2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +90 -3
README.md CHANGED
@@ -1,3 +1,90 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - vidore/colpali_train_set
5
+ - openbmb/VisRAG-Ret-Train-Synthetic-data
6
+ - openbmb/VisRAG-Ret-Train-In-domain-data
7
+ language:
8
+ - en
9
+ base_model:
10
+ - vidore/colqwen2.5-base
11
+ pipeline_tag: visual-document-retrieval
12
+ ---
13
+ # Model Card for Model ID
14
+ EvoQwen2.5-VL-Retriever-3B-v1 is a high-performance multimodal retrieval model built upon the Qwen2.5-VL-3B-Instruct backbone and employing multi-vector late-interaction. The model is fine-tuned by using an innovative evolutionary training framework (Evo-Retriever), enabling accurate retrieval of complex visual documents.
15
+ ## Version Specificity
16
+ • Base Model: ApsaraStackMaaS/EvoQwen2.5-VL-Retriever-3B-v1
17
+
18
+ • Parameter Size: 3 billion (3B)
19
+
20
+ • Features: As a smaller-weight model in this series, this model outperforms other models of similar size on evaluation benchmarks, delivering higher retrieval accuracy in resource-constrained scenarios.
21
+
22
+ ## Performance
23
+ <table border="1">
24
+ <tr>
25
+ <th>Model</th>
26
+ <th>ViDoRe V2 (nDCG@5)</th>
27
+ <th>MMEB VisDoc (ndcg_linear@5)</th>
28
+ </tr>
29
+ <tr>
30
+ <th>ApsaraStackMaaS/EvoQwen2.5-VL-Retriever-3B-v1</th>
31
+ <th>63.00</th>
32
+ <th>75.96</th>
33
+ </tr>
34
+ <tr>
35
+ <th>ApsaraStackMaaS/EvoQwen2.5-VL-Retriever-7B-v1</th>
36
+ <th>65.24</th>
37
+ <th>77.10</th>
38
+ </tr>
39
+ </table>
40
+
41
+ ## Usage
42
+ Make sure that you have installed Transformers, Torch, Pillow, and colpali-engine.
43
+
44
+ <body>
45
+ <div class="code-container">
46
+ <div class="line-numbers" id="lineNumbers"></div>
47
+ <pre><code class="language-javascript">
48
+ import torch
49
+ from PIL import Image
50
+ from transformers.utils.import_utils import is_flash_attn_2_available
51
+
52
+ from colpali_engine.models import ColQwen2_5, ColQwen2_5_Processor
53
+
54
+ model_name = "ApsaraStackMaaS/EvoQwen2.5-VL-Retriever-3B-v1"
55
+
56
+ model = ColQwen2_5.from_pretrained(
57
+ model_name,
58
+ torch_dtype=torch.bfloat16,
59
+ device_map="cuda:0", # or "mps" if on Apple Silicon
60
+ attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
61
+ ).eval()
62
+
63
+ processor = ColQwen2_5_Processor.from_pretrained(model_name)
64
+
65
+ <p># Your inputs</p>
66
+ images = [
67
+ Image.new("RGB", (128, 128), color="white"),
68
+ Image.new("RGB", (64, 32), color="black"),
69
+ ]
70
+ queries = [
71
+ "Is attention really all you need?",
72
+ "What is the amount of bananas farmed in Salvador?",
73
+ ]
74
+
75
+ <p># Process the inputs</p>
76
+ batch_images = processor.process_images(images).to(model.device)
77
+ batch_queries = processor.process_queries(queries).to(model.device)
78
+
79
+ <p># Forward pass</p>
80
+ with torch.no_grad():
81
+ image_embeddings = model(**batch_images)
82
+ query_embeddings = model(**batch_queries)
83
+
84
+ scores = processor.score_multi_vector(query_embeddings, image_embeddings)
85
+ print(scores)
86
+
87
+ </div>
88
+
89
+ ## Parameters
90
+ All models are fine-tuned by using the Evo-Retriever paradigm with a two-stage training schedule (one epoch per stage). Unless otherwise noted, parameter-efficient fine-tuning is achieved through low-rank adapters (LoRA) with a rank of 32 for both 3B and 7B models. Training is performed in bfloat16 precision with the paged_adamw_8bit optimizer on an 8-GPU H20 server, employing a data-parallel strategy, a learning rate of 2e-5, cosine decay, 2% warm-up steps, and a batch size of 32.