nielsr HF Staff commited on
Commit
56665ae
Β·
verified Β·
1 Parent(s): 20a757d

Improve model card: Add metadata, paper/code links, and comprehensive details

Browse files

This PR significantly enhances the model card for `SeeNav-Agent` by:

- Adding `library_name: transformers` metadata, which enables the "How to use" widget on the model page, given the model's compatibility with the Transformers library.
- Adding `pipeline_tag: image-text-to-text` metadata, improving the model's discoverability for relevant tasks on the Hugging Face Hub.
- Updating the paper link to the official Hugging Face paper page: [SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization](https://huggingface.co/papers/2512.02631).
- Including a direct link to the official [GitHub repository](https://github.com/WzcTHU/SeeNav-Agent).
- Providing a comprehensive content section with an overview, highlights, detailed summary, results, checkpoint information, and usage instructions (evaluation and training scripts) directly from the GitHub README.
- Ensuring proper citation information is included.

These changes make the model card more informative and user-friendly for the community.

Files changed (1) hide show
  1. README.md +98 -1
README.md CHANGED
@@ -1,5 +1,102 @@
1
  ---
2
  base_model:
3
  - Qwen/Qwen2.5-VL-3B-Instruct
 
 
4
  ---
5
- arxiv.org/abs/2512.02631
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  base_model:
3
  - Qwen/Qwen2.5-VL-3B-Instruct
4
+ pipeline_tag: image-text-to-text
5
+ library_name: transformers
6
  ---
7
+
8
+ # SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization
9
+
10
+ This repository contains the official implementation for the paper [SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization](https://huggingface.co/papers/2512.02631).
11
+
12
+ <div align="center">
13
+ <a href="https://github.com/WzcTHU/SeeNav-Agent"><img src="https://img.shields.io/badge/GitHub-Code-blue.svg?logo=github&" alt="GitHub Code"></a>
14
+ <a href="https://huggingface.co/wangzc9865/SeeNav-Agent"><img src="https://img.shields.io/badge/πŸ€—&nbsp;-HuggingFace-blue" alt="Hugging Face Model"></a>
15
+ </div>
16
+
17
+ ## Overview
18
+ We propose **SeeNav-Agent**, a novel LVLM-based embodied navigation framework that includes a zero-shot dual-view visual prompt technique for the input side and an efficient RFT algorithm named SRGPO for post-training. Existing Vision-Language Navigation (VLN) agents often suffer from perception, reasoning, and planning errors, which SeeNav-Agent aims to mitigate through its proposed techniques.
19
+
20
+ ## πŸš€ Highlights
21
+
22
+ * 🚫 **Zero-Shot Visual Prompt:** No extra training for performance improvement with visual prompt.
23
+ * πŸ—² **Efficient Step-Level Advantage Calculation:** Step-Level groups are randomly sampled from the entire batch.
24
+ * πŸ“ˆ **Significant Gains:** +20.0pp (GPT4.1+VP) and +5.6pp (Qwen2.5-VL-3B+VP+SRGPO) improvements on EmbodiedBench-Navigation.
25
+
26
+ ## πŸ“– Summary
27
+ <div style="text-align: center;">
28
+ <img src="https://github.com/WzcTHU/SeeNav-Agent/raw/main/SeeNav/figs/framework.png" width="100%">
29
+ </div>
30
+
31
+ * 🎨 **Dual-View Visual Prompt:** We apply visual prompt techniques directly on the input dual-view image to reduce the visual hallucination.
32
+ * πŸ” **Step Reward Group Policy Optimization (SRGPO):** By defining a state-independent verifiable process reward function, we achieve efficient step-level random grouping and advantage estimation.
33
+
34
+ ## πŸ“‹ Results on EmbodiedBench-Navigation
35
+
36
+ ### πŸ“ Main Results
37
+ <div align="center">
38
+ <img src="https://github.com/WzcTHU/SeeNav-Agent/raw/main/SeeNav/figs/results.png" width="50%"/>
39
+ </div>
40
+
41
+ ### πŸ–ŒοΈ Training Curves for RFT
42
+ <div align="center">
43
+ <img src="https://github.com/WzcTHU/SeeNav-Agent/raw/main/SeeNav/figs/training_curves.png" width="50%"/>
44
+ </div>
45
+
46
+ ### πŸ–οΈ Testing Curves for OOD-Scenes
47
+ <div align="center">
48
+ <img src="https://github.com/WzcTHU/SeeNav-Agent/raw/main/SeeNav/figs/ood_val.png" width="50%"/>
49
+ </div>
50
+
51
+ ### πŸ“¦ Checkpoint
52
+
53
+ | base model | env | πŸ€— link |
54
+ | :--: | :--: | :--: |
55
+ | Qwen2.5-VL-3B-Instruct-SRGPO| EmbodiedBench-Nav | [Qwen2.5-VL-3B-Instruct-SRGPO](https://huggingface.co/wangzc9865/SeeNav-Agent) |
56
+
57
+ ## πŸ› οΈ Usage
58
+
59
+ ### Setup
60
+
61
+ 1. Setup a seperate environment for evaluation according to: [EmbodiedBench-Nav](https://github.com/EmbodiedBench/EmbodiedBench) and [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL) to support Qwen2.5-VL-3B-Instruct.
62
+
63
+ 2. Setup a seperate training environment according to: [verl-agent](https://github.com/langfengQ/verl-agent) and [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL) to support Qwen2.5-VL-3B-Instruct.
64
+
65
+ ### Evaluation
66
+
67
+ Use the following command to evaluate the model on EmbodiedBench:
68
+
69
+ ```bash
70
+ conda activate <your_env_for_eval>
71
+ cd SeeNav
72
+ python testEBNav.py
73
+ ```
74
+
75
+ Hint: you need to first set your endpoint, API-key and api_version in [`SeeNav/planner/models/remote_model.py`](https://github.com/WzcTHU/SeeNav-Agent/blob/main/SeeNav/planner/models/remote_model.py)
76
+
77
+ ### Training
78
+
79
+ [`verl-agent/examples/srgpo_trainer`](https://github.com/WzcTHU/SeeNav-Agent/blob/main/verl-agent/examples/srgpo_trainer) contains example scripts for SRGPO-based training on EmbodiedBench-Navigation.
80
+
81
+ 1. Modify [`run_ebnav.sh`](https://github.com/WzcTHU/SeeNav-Agent/blob/main/verl-agent/examples/srgpo_trainer/run_ebnav.sh) according to your setup.
82
+
83
+ 2. Run the following command:
84
+
85
+ ```bash
86
+ conda activate <your_env_for_train>
87
+ cd verl-agent
88
+ bash examples/srgpo_trainer/run_ebnav.sh
89
+ ```
90
+
91
+ ## πŸ“š Citation
92
+
93
+ If you find this work helpful in your research, please consider citing:
94
+
95
+ ```bibtex
96
+ @article{wang2025seenav,
97
+ title={SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization},
98
+ author={Zhengcheng Wang and Zichuan Lin and Yijun Yang and Haobo Fu and Deheng Ye},
99
+ journal={arXiv preprint arXiv:2512.02631},
100
+ year={2025}
101
+ }
102
+ ```