Improve model card: Add metadata, paper/code links, and comprehensive details
Browse filesThis PR significantly enhances the model card for `SeeNav-Agent` by:
- Adding `library_name: transformers` metadata, which enables the "How to use" widget on the model page, given the model's compatibility with the Transformers library.
- Adding `pipeline_tag: image-text-to-text` metadata, improving the model's discoverability for relevant tasks on the Hugging Face Hub.
- Updating the paper link to the official Hugging Face paper page: [SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization](https://huggingface.co/papers/2512.02631).
- Including a direct link to the official [GitHub repository](https://github.com/WzcTHU/SeeNav-Agent).
- Providing a comprehensive content section with an overview, highlights, detailed summary, results, checkpoint information, and usage instructions (evaluation and training scripts) directly from the GitHub README.
- Ensuring proper citation information is included.
These changes make the model card more informative and user-friendly for the community.
|
@@ -1,5 +1,102 @@
|
|
| 1 |
---
|
| 2 |
base_model:
|
| 3 |
- Qwen/Qwen2.5-VL-3B-Instruct
|
|
|
|
|
|
|
| 4 |
---
|
| 5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
base_model:
|
| 3 |
- Qwen/Qwen2.5-VL-3B-Instruct
|
| 4 |
+
pipeline_tag: image-text-to-text
|
| 5 |
+
library_name: transformers
|
| 6 |
---
|
| 7 |
+
|
| 8 |
+
# SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization
|
| 9 |
+
|
| 10 |
+
This repository contains the official implementation for the paper [SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization](https://huggingface.co/papers/2512.02631).
|
| 11 |
+
|
| 12 |
+
<div align="center">
|
| 13 |
+
<a href="https://github.com/WzcTHU/SeeNav-Agent"><img src="https://img.shields.io/badge/GitHub-Code-blue.svg?logo=github&" alt="GitHub Code"></a>
|
| 14 |
+
<a href="https://huggingface.co/wangzc9865/SeeNav-Agent"><img src="https://img.shields.io/badge/π€ -HuggingFace-blue" alt="Hugging Face Model"></a>
|
| 15 |
+
</div>
|
| 16 |
+
|
| 17 |
+
## Overview
|
| 18 |
+
We propose **SeeNav-Agent**, a novel LVLM-based embodied navigation framework that includes a zero-shot dual-view visual prompt technique for the input side and an efficient RFT algorithm named SRGPO for post-training. Existing Vision-Language Navigation (VLN) agents often suffer from perception, reasoning, and planning errors, which SeeNav-Agent aims to mitigate through its proposed techniques.
|
| 19 |
+
|
| 20 |
+
## π Highlights
|
| 21 |
+
|
| 22 |
+
* π« **Zero-Shot Visual Prompt:** No extra training for performance improvement with visual prompt.
|
| 23 |
+
* π² **Efficient Step-Level Advantage Calculation:** Step-Level groups are randomly sampled from the entire batch.
|
| 24 |
+
* π **Significant Gains:** +20.0pp (GPT4.1+VP) and +5.6pp (Qwen2.5-VL-3B+VP+SRGPO) improvements on EmbodiedBench-Navigation.
|
| 25 |
+
|
| 26 |
+
## π Summary
|
| 27 |
+
<div style="text-align: center;">
|
| 28 |
+
<img src="https://github.com/WzcTHU/SeeNav-Agent/raw/main/SeeNav/figs/framework.png" width="100%">
|
| 29 |
+
</div>
|
| 30 |
+
|
| 31 |
+
* π¨ **Dual-View Visual Prompt:** We apply visual prompt techniques directly on the input dual-view image to reduce the visual hallucination.
|
| 32 |
+
* π **Step Reward Group Policy Optimization (SRGPO):** By defining a state-independent verifiable process reward function, we achieve efficient step-level random grouping and advantage estimation.
|
| 33 |
+
|
| 34 |
+
## π Results on EmbodiedBench-Navigation
|
| 35 |
+
|
| 36 |
+
### π Main Results
|
| 37 |
+
<div align="center">
|
| 38 |
+
<img src="https://github.com/WzcTHU/SeeNav-Agent/raw/main/SeeNav/figs/results.png" width="50%"/>
|
| 39 |
+
</div>
|
| 40 |
+
|
| 41 |
+
### ποΈ Training Curves for RFT
|
| 42 |
+
<div align="center">
|
| 43 |
+
<img src="https://github.com/WzcTHU/SeeNav-Agent/raw/main/SeeNav/figs/training_curves.png" width="50%"/>
|
| 44 |
+
</div>
|
| 45 |
+
|
| 46 |
+
### ποΈ Testing Curves for OOD-Scenes
|
| 47 |
+
<div align="center">
|
| 48 |
+
<img src="https://github.com/WzcTHU/SeeNav-Agent/raw/main/SeeNav/figs/ood_val.png" width="50%"/>
|
| 49 |
+
</div>
|
| 50 |
+
|
| 51 |
+
### π¦ Checkpoint
|
| 52 |
+
|
| 53 |
+
| base model | env | π€ link |
|
| 54 |
+
| :--: | :--: | :--: |
|
| 55 |
+
| Qwen2.5-VL-3B-Instruct-SRGPO| EmbodiedBench-Nav | [Qwen2.5-VL-3B-Instruct-SRGPO](https://huggingface.co/wangzc9865/SeeNav-Agent) |
|
| 56 |
+
|
| 57 |
+
## π οΈ Usage
|
| 58 |
+
|
| 59 |
+
### Setup
|
| 60 |
+
|
| 61 |
+
1. Setup a seperate environment for evaluation according to: [EmbodiedBench-Nav](https://github.com/EmbodiedBench/EmbodiedBench) and [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL) to support Qwen2.5-VL-3B-Instruct.
|
| 62 |
+
|
| 63 |
+
2. Setup a seperate training environment according to: [verl-agent](https://github.com/langfengQ/verl-agent) and [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL) to support Qwen2.5-VL-3B-Instruct.
|
| 64 |
+
|
| 65 |
+
### Evaluation
|
| 66 |
+
|
| 67 |
+
Use the following command to evaluate the model on EmbodiedBench:
|
| 68 |
+
|
| 69 |
+
```bash
|
| 70 |
+
conda activate <your_env_for_eval>
|
| 71 |
+
cd SeeNav
|
| 72 |
+
python testEBNav.py
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
Hint: you need to first set your endpoint, API-key and api_version in [`SeeNav/planner/models/remote_model.py`](https://github.com/WzcTHU/SeeNav-Agent/blob/main/SeeNav/planner/models/remote_model.py)
|
| 76 |
+
|
| 77 |
+
### Training
|
| 78 |
+
|
| 79 |
+
[`verl-agent/examples/srgpo_trainer`](https://github.com/WzcTHU/SeeNav-Agent/blob/main/verl-agent/examples/srgpo_trainer) contains example scripts for SRGPO-based training on EmbodiedBench-Navigation.
|
| 80 |
+
|
| 81 |
+
1. Modify [`run_ebnav.sh`](https://github.com/WzcTHU/SeeNav-Agent/blob/main/verl-agent/examples/srgpo_trainer/run_ebnav.sh) according to your setup.
|
| 82 |
+
|
| 83 |
+
2. Run the following command:
|
| 84 |
+
|
| 85 |
+
```bash
|
| 86 |
+
conda activate <your_env_for_train>
|
| 87 |
+
cd verl-agent
|
| 88 |
+
bash examples/srgpo_trainer/run_ebnav.sh
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
## π Citation
|
| 92 |
+
|
| 93 |
+
If you find this work helpful in your research, please consider citing:
|
| 94 |
+
|
| 95 |
+
```bibtex
|
| 96 |
+
@article{wang2025seenav,
|
| 97 |
+
title={SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization},
|
| 98 |
+
author={Zhengcheng Wang and Zichuan Lin and Yijun Yang and Haobo Fu and Deheng Ye},
|
| 99 |
+
journal={arXiv preprint arXiv:2512.02631},
|
| 100 |
+
year={2025}
|
| 101 |
+
}
|
| 102 |
+
```
|