Skywork
/

Skywork-R1V-38B

Image-Text-to-Text

Model card Files Files and versions

OrlandoHugBot commited on Mar 17, 2025

Commit

7bc35c2

·

verified ·

1 Parent(s): 9a33d91

Update README.md

Files changed (1) hide show

README.md +4 -4

README.md CHANGED Viewed

@@ -9,7 +9,7 @@
 ## 1. Introduction
-We introduce Skywork-R1V, a multimodal reasoning model that extends the R1-series text models to visual modalities through a near-lossless transfer method. Using a lightweight visual projector, Skywork-R1V enables seamless multimodal adaptation without requiring retraining of either the base language model or vision encoder. To enhance visual-text alignment, we developed a hybrid optimization strategy combining Iterative Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), significantly improving cross-modal integration. Additionally, we created an adaptive-length Chain-of-Thought distillation approach for generating reasoning data, which dynamically optimizes reasoning chain lengths to improve inference efficiency and prevent overthinking. The model achieves state-of-the-art performance on key multimodal reasoning benchmarks, scoring 68.1 on MMMU and 71.0 on MathVista, comparable to leading closed-source models like Gemini 2.0 and Kimi-k1.5. It also maintains strong textual reasoning capabilities, achieving impressive scores of 72.6 on AIME and 94.3 on MATH500.
 ## 2. Model Summary
@@ -177,9 +177,9 @@ The model follows a connection pattern of Vision Encoder → MLP Adapter → Lan
       <td align="center">94.0</td>
       <td align="center">72.0</td>
       <td align="center">61.6</td>
-      <td align="center">71.0</td>
-      <td align="center">68.1</td>
-      <td align="center">XXX</td>
     </tr>
   </tbody>
 </table>

 ## 1. Introduction
+We introduce Skywork-R1V, a multimodal reasoning model that extends the R1-series text models to visual modalities through a near-lossless transfer method. Using a lightweight visual projector, Skywork-R1V enables seamless multimodal adaptation without requiring retraining of either the base language model or vision encoder. To enhance visual-text alignment, we developed a hybrid optimization strategy combining Iterative Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), significantly improving cross-modal integration. Additionally, we created an adaptive-length Chain-of-Thought distillation approach for generating reasoning data, which dynamically optimizes reasoning chain lengths to improve inference efficiency and prevent overthinking. The model achieves state-of-the-art performance on key multimodal reasoning benchmarks, scoring 69 on MMMU and 67.5 on MathVista, comparable to leading closed-source models like Gemini 2.0 and Kimi-k1.5. It also maintains strong textual reasoning capabilities, achieving impressive scores of 72.6 on AIME and 94.3 on MATH500.
 ## 2. Model Summary
       <td align="center">94.0</td>
       <td align="center">72.0</td>
       <td align="center">61.6</td>
+      <td align="center">67.5</td>
+      <td align="center">68.5</td>
+      <td align="center">-</td>
     </tr>
   </tbody>
 </table>