| # Skywork-R1V | |
| <div align="center"> | |
| <img src="logo.jpeg" alt="Introduction Image" width="400" height="400"> | |
| <br> | |
| <a href="README.md">English</a> | <a href="https://github.com/SkyworkAI/Skywork-R1V">📂 GitHub</a> | |
| </div> | |
| ## 1. 介绍 | |
| 我们推出Skywork-R1V,一种多模态推理模型,通过近乎无损的迁移方法,将R1系列文本模型扩展到视觉模态。Skywork-R1V采用轻量级视觉投影器,无需重新训练基础语言模型或视觉编码器,即可实现无缝的多模态适配。为提升视觉-文本对齐,我们开发了结合迭代监督微调(SFT)与组相对策略优化(GRPO)的混合优化策略,显著提高了跨模态融合能力。此外,我们创造了一种自适应长度的思维链(Chain-of-Thought)蒸馏方法用于生成推理数据,动态优化推理链长度以提高推理效率并避免过度推理。该模型在重要多模态推理基准测试中达到最先进水平,在MMMU上得分68.1,在MathVista上得分71.0,可与领先的闭源模型(如Gemini 2.0和Kimi-k1.5)媲美。同时,它还保持了出色的文本推理能力,在AIME达到72.6分,在MATH500达到94.3分。 | |
| ## 2. 模型概述 | |
| **架构:** | |
| Skywork-R1V采用模块化架构,有效结合视觉和语言能力: | |
| - **视觉编码器:** 使用视觉Transformer (ViT)作为视觉主干处理图像输入。 | |
| - **视觉投影器:** 轻量级MLP适配器,作为视觉与语言组件间的桥梁。 | |
| - **语言模型:** 采用R1-distilled-Qwen-32B作为具备推理能力的语言模型主干。 | |
| 模型连接模式为视觉编码器 → MLP适配器 → 语言模型,其中MLP适配器将视觉编码器的输出空间与语言模型的输入空间对齐。这种设计可高效地将文本的推理能力迁移到多模态领域,无需大规模重新训练视觉编码器或语言模型。 | |
| **关键设计** | |
| - **先进的多模态推理** | |
| 擅长跨文本和视觉模态的复杂推理。 | |
| - **迭代训练策略** | |
| 采用迭代监督和GRPO优化模型对齐和性能。 | |
| - **自适应长度思维链** | |
| 动态调整推理长度以增强推理效率和准确性。 | |
| - **可扩展性能** | |
| 在数学、编程和多模态任务上性能媲美专有模型。 | |
| ## 3. 评估 | |
| <div align="center"> | |
| <img src="eval.jpeg" width="600" height="200" alt="skywork_r1v_eval" /> | |
| </div> | |
| <div align="center"> | |
| <b>Evaluation results of state-of-the-art LLMs and VLMs</b> | |
| </div> | |
| <table> | |
| <thead> | |
| <tr> | |
| <th></th> | |
| <th align="center"><strong>Vision</strong></th> | |
| <th align="center" colspan="3"><strong>Reasoning</strong></th> | |
| <th align="center" colspan="3"><strong>Vision</strong></th> | |
| </tr> | |
| <tr> | |
| <th></th> | |
| <th></th> | |
| <th align="center"><strong>MATH-500</strong></th> | |
| <th align="center"><strong>AIME 2024</strong></th> | |
| <th align="center"><strong>GPQA</strong></th> | |
| <th align="center"><strong>MathVista(mini)</strong></th> | |
| <th align="center"><strong>MMMU(Val)</strong></th> | |
| <th align="center"><strong>CSVQA</strong></th> | |
| </tr> | |
| <tr> | |
| <th></th> | |
| <th></th> | |
| <th align="center">pass@1</th> | |
| <th align="center">pass@1</th> | |
| <th align="center">pass@1</th> | |
| <th align="center">pass@1</th> | |
| <th align="center">pass@1</th> | |
| <th align="center">pass@1</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>Qwen2.5-72B-Instruct</td> | |
| <td align="center">❌</td> | |
| <td align="center">82.6</td> | |
| <td align="center">23.3</td> | |
| <td align="center">49.0</td> | |
| <td align="center">-</td> | |
| <td align="center">-</td> | |
| <td align="center">-</td> | |
| </tr> | |
| <tr> | |
| <td>Deepseek V3</td> | |
| <td align="center">❌</td> | |
| <td align="center">90.2</td> | |
| <td align="center">39.2</td> | |
| <td align="center">59.1</td> | |
| <td align="center">-</td> | |
| <td align="center">-</td> | |
| <td align="center">-</td> | |
| </tr> | |
| <tr> | |
| <td>Deepseek R1</td> | |
| <td align="center">❌</td> | |
| <td align="center">97.3</td> | |
| <td align="center">79.8</td> | |
| <td align="center">71.5</td> | |
| <td align="center">-</td> | |
| <td align="center">-</td> | |
| <td align="center">-</td> | |
| </tr> | |
| <tr> | |
| <td>Claude 3.5 Sonnet</td> | |
| <td align="center">✅</td> | |
| <td align="center">78.3</td> | |
| <td align="center">16.0</td> | |
| <td align="center">65.0</td> | |
| <td align="center">67.7</td> | |
| <td align="center">68.3</td> | |
| <td align="center">-</td> | |
| </tr> | |
| <tr> | |
| <td>GPT-4o</td> | |
| <td align="center">✅</td> | |
| <td align="center">76.6</td> | |
| <td align="center">9.3</td> | |
| <td align="center">53.6</td> | |
| <td align="center">63.8</td> | |
| <td align="center">69.1</td> | |
| <td align="center">-</td> | |
| </tr> | |
| <tr> | |
| <td>Kimi k1.5</td> | |
| <td align="center">✅</td> | |
| <td align="center">96.2</td> | |
| <td align="center">77.5</td> | |
| <td align="center">-</td> | |
| <td align="center">74.9</td> | |
| <td align="center">70.0</td> | |
| <td align="center">-</td> | |
| </tr> | |
| <tr> | |
| <td>Qwen2.5-VL-72B-Instruct</td> | |
| <td align="center">✅</td> | |
| <td align="center">-</td> | |
| <td align="center">-</td> | |
| <td align="center">-</td> | |
| <td align="center">74.8</td> | |
| <td align="center">70.2</td> | |
| <td align="center">-</td> | |
| </tr> | |
| <tr> | |
| <td>LLaVA-Onevision-72B</td> | |
| <td align="center">✅</td> | |
| <td align="center">-</td> | |
| <td align="center">-</td> | |
| <td align="center">-</td> | |
| <td align="center">67.5</td> | |
| <td align="center">56.8</td> | |
| <td align="center">-</td> | |
| </tr> | |
| <tr> | |
| <td>InternVL2-Llama3-76B</td> | |
| <td align="center">✅</td> | |
| <td align="center">-</td> | |
| <td align="center">-</td> | |
| <td align="center">-</td> | |
| <td align="center">65.5</td> | |
| <td align="center">58.3</td> | |
| <td align="center">-</td> | |
| </tr> | |
| <tr> | |
| <td>InternVL2.5-78B</td> | |
| <td align="center">✅</td> | |
| <td align="center">-</td> | |
| <td align="center">-</td> | |
| <td align="center">-</td> | |
| <td align="center">72.3</td> | |
| <td align="center">70.1</td> | |
| <td align="center">-</td> | |
| </tr> | |
| <tr> | |
| <td>Skywork-R1V-38B</td> | |
| <td align="center">✅</td> | |
| <td align="center">94.0</td> | |
| <td align="center">72.0</td> | |
| <td align="center">61.6</td> | |
| <td align="center">71.0</td> | |
| <td align="center">68.1</td> | |
| <td align="center">XXX</td> | |
| </tr> | |
| </tbody> | |
| </table> | |
| <div align="center"> | |
| <b>Comparison with Larger-Scale Open-Source and Closed-Source Models</b> | |
| </div> | |
| <table align="center"> | |
| <thead> | |
| <tr> | |
| <th></th> | |
| <th align="center"><strong>Benchmark</strong></th> | |
| <th align="center"><strong>LLM</strong></th> | |
| <th align="center" colspan="4"><strong>VLM</strong></th> | |
| </tr> | |
| <tr> | |
| <th></th> | |
| <th></th> | |
| <th align="center"><strong>QwQ-32B-Preview</strong></th> | |
| <th align="center"><strong>InternVL-2.5-38B</strong></th> | |
| <th align="center"><strong>VILA 1.5-40B</strong></th> | |
| <th align="center"><strong>InternVL2-40B</strong></th> | |
| <th align="center"><strong>Skywork-R1V-38B</strong></th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td rowspan="3">Reasoning</td> | |
| <td>MATH-500</td> | |
| <td align="center">90.6</td> | |
| <td align="center">-</td> | |
| <td align="center">-</td> | |
| <td align="center">-</td> | |
| <td align="center"><strong>94.0</strong></td> | |
| </tr> | |
| <tr> | |
| <td>AIME 2024</td> | |
| <td align="center">50.0</td> | |
| <td align="center">-</td> | |
| <td align="center">-</td> | |
| <td align="center">-</td> | |
| <td align="center"><strong>72.0</strong></td> | |
| </tr> | |
| <tr> | |
| <td>GPQA</td> | |
| <td align="center">65.2</td> | |
| <td align="center">-</td> | |
| <td align="center">-</td> | |
| <td align="center">-</td> | |
| <td align="center">61.6</td> | |
| </tr> | |
| <tr> | |
| <td rowspan="3">Vision</td> | |
| <td>MathVista(mini)</td> | |
| <td align="center">-</td> | |
| <td align="center">71.9</td> | |
| <td align="center">49.5</td> | |
| <td align="center">63.7</td> | |
| <td align="center">71.0</td> | |
| </tr> | |
| <tr> | |
| <td>MMMU(Val)</td> | |
| <td align="center">-</td> | |
| <td align="center">63.9</td> | |
| <td align="center">55.1</td> | |
| <td align="center">55.2</td> | |
| <td align="center">68.1</td> | |
| </tr> | |
| <tr> | |
| <td>CSVQA</td> | |
| <td align="center">-</td> | |
| <td align="center"></td> | |
| <td align="center"></td> | |
| <td align="center"></td> | |
| <td align="center"></td> | |
| </tr> | |
| </tbody> | |
| </table> | |
| ## 4. Skywork-R1V家族 | |
| | Model Name | Vision Encoder | Language Model | HF Link | | |
| | ---------------------- | -------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------- | ------------ | | |
| | Skywork-R1V-38B | [InternViT-6B-448px-V2_5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V2_5) | [deepseek-ai/DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) | [🤗 Link](#) | | |
| | Skywork-R1V-38B-qwq | [InternViT-6B-448px-V2_5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V2_5) | [Qwen/QwQ-32B](https://huggingface.co/Qwen/QwQ-32B) | - | | |
| --- | |
| ## 5. 快速开始 | |
| **示例步骤:** | |
| 1. **克隆GitHub仓库** | |
| ```bash | |
| git clone https://github.com/your-repo | |
| ``` | |
| 2. **安装依赖** | |
| ```bash | |
| cd your-repo | |
| pip install -r requirements.txt | |
| ``` | |
| 3. **运行示例代码** | |
| ```bash | |
| python demo.py | |
| ``` | |
| --- | |
| ## 6. 附加资源 | |
| - [📂 GitHub仓库](https://github.com/your-repo) | |
| - [🗨️ Chat Demo](#) | |
| - [🚀 快速入门](#快速入门) | |
| - [📖 完整文档](#) | |
| --- | |
| ## 7. 引用 | |
| 如果您在研究中使用了Skywork-R1V,请引用: | |
| ``` | |
| @article{skywork2025r1v, | |
| title = {Skywork-R1V: Bridging Vision and Language for Advanced Multimodal Reasoning}, | |
| author = {SkyworkVL Team}, | |
| year = {2025}, | |
| journal = {arXiv preprint arXiv:XXXX.XXXXX}, | |
| url = {https://github.com/skywork-ai/Skywork-R1V} | |
| } | |
| ``` | |
| *本项目采用开源许可证发布。* | |