OrlandoHugBot commited on
Commit
7bc35c2
·
verified ·
1 Parent(s): 9a33d91

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -9,7 +9,7 @@
9
 
10
  ## 1. Introduction
11
 
12
- We introduce Skywork-R1V, a multimodal reasoning model that extends the R1-series text models to visual modalities through a near-lossless transfer method. Using a lightweight visual projector, Skywork-R1V enables seamless multimodal adaptation without requiring retraining of either the base language model or vision encoder. To enhance visual-text alignment, we developed a hybrid optimization strategy combining Iterative Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), significantly improving cross-modal integration. Additionally, we created an adaptive-length Chain-of-Thought distillation approach for generating reasoning data, which dynamically optimizes reasoning chain lengths to improve inference efficiency and prevent overthinking. The model achieves state-of-the-art performance on key multimodal reasoning benchmarks, scoring 68.1 on MMMU and 71.0 on MathVista, comparable to leading closed-source models like Gemini 2.0 and Kimi-k1.5. It also maintains strong textual reasoning capabilities, achieving impressive scores of 72.6 on AIME and 94.3 on MATH500.
13
 
14
 
15
  ## 2. Model Summary
@@ -177,9 +177,9 @@ The model follows a connection pattern of Vision Encoder → MLP Adapter → Lan
177
  <td align="center">94.0</td>
178
  <td align="center">72.0</td>
179
  <td align="center">61.6</td>
180
- <td align="center">71.0</td>
181
- <td align="center">68.1</td>
182
- <td align="center">XXX</td>
183
  </tr>
184
  </tbody>
185
  </table>
 
9
 
10
  ## 1. Introduction
11
 
12
+ We introduce Skywork-R1V, a multimodal reasoning model that extends the R1-series text models to visual modalities through a near-lossless transfer method. Using a lightweight visual projector, Skywork-R1V enables seamless multimodal adaptation without requiring retraining of either the base language model or vision encoder. To enhance visual-text alignment, we developed a hybrid optimization strategy combining Iterative Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), significantly improving cross-modal integration. Additionally, we created an adaptive-length Chain-of-Thought distillation approach for generating reasoning data, which dynamically optimizes reasoning chain lengths to improve inference efficiency and prevent overthinking. The model achieves state-of-the-art performance on key multimodal reasoning benchmarks, scoring 69 on MMMU and 67.5 on MathVista, comparable to leading closed-source models like Gemini 2.0 and Kimi-k1.5. It also maintains strong textual reasoning capabilities, achieving impressive scores of 72.6 on AIME and 94.3 on MATH500.
13
 
14
 
15
  ## 2. Model Summary
 
177
  <td align="center">94.0</td>
178
  <td align="center">72.0</td>
179
  <td align="center">61.6</td>
180
+ <td align="center">67.5</td>
181
+ <td align="center">68.5</td>
182
+ <td align="center">-</td>
183
  </tr>
184
  </tbody>
185
  </table>