--- license: apache-2.0 tags: - vision-language-action - mobile-robot - kosmos-2b - robotics - obstacle-avoidance datasets: - mobile-vla-dataset language: - en - ko metrics: - mae - r2_score library_name: transformers pipeline_tag: robotics --- # ๐Ÿš€ Mobile VLA: Vision-Language-Action Model for Mobile Robots ## ๐Ÿ“‹ Model Description Mobile VLA๋Š” Kosmos-2B๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ Mobile Robot ์ „์šฉ Vision-Language-Action ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์žฅ์• ๋ฌผ ํšŒํ”ผ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ์—ฐ์†์ ์ธ 3D ์•ก์…˜ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ### ๐ŸŽฏ ํ•ต์‹ฌ ๊ธฐ๋Šฅ - **Vision-Language-Action**: ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ์ง€์‹œ์‚ฌํ•ญ์„ ๋ฐ›์•„ ๋กœ๋ด‡ ์•ก์…˜ ์˜ˆ์ธก - **3D ์—ฐ์† ์ œ์–ด**: `[linear_x, linear_y, angular_z]` ํ˜•ํƒœ์˜ ์—ฐ์† ์•ก์…˜ ๊ณต๊ฐ„ - **์žฅ์• ๋ฌผ ํšŒํ”ผ**: 1-box, 2-box ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ์ขŒ์šฐ ํšŒํ”ผ ์ „๋žต ํ•™์Šต - **์‹ค์‹œ๊ฐ„ ์ฒ˜๋ฆฌ**: ํšจ์œจ์ ์ธ vision-only ์ฒ˜๋ฆฌ๋กœ ๋น ๋ฅธ ์ถ”๋ก  ### ๐Ÿ”ง ๊ธฐ์ˆ  ์‚ฌ์–‘ - **๋ฐฑ๋ณธ ๋ชจ๋ธ**: microsoft/kosmos-2-patch14-224 - **์ž…๋ ฅ**: RGB ์ด๋ฏธ์ง€ (224x224) + ํ…์ŠคํŠธ ์ง€์‹œ์‚ฌํ•ญ - **์ถœ๋ ฅ**: 3D ์—ฐ์† ์•ก์…˜ ๋ฒกํ„ฐ - **ํ•™์Šต ๋ฐฉ์‹**: Huber Loss ๊ธฐ๋ฐ˜ ํšŒ๊ท€ - **๋ฐ์ดํ„ฐ**: 72๊ฐœ ์‹ค์ œ ๋กœ๋ด‡ ์—ํ”ผ์†Œ๋“œ ## ๐Ÿ“Š ์„ฑ๋Šฅ ์ง€ํ‘œ ### ์ „์ฒด ์„ฑ๋Šฅ - **์ „์ฒด MAE**: 0.285 - **์ž„๊ณ„๊ฐ’ ์ •ํ™•๋„ (0.1)**: 37.5% ### ์•ก์…˜๋ณ„ ์„ฑ๋Šฅ | ์•ก์…˜ | MAE | Rยฒ Score | ์„ค๋ช… | |------|-----|----------|------| | linear_x | 0.243 | 0.354 | ์ „์ง„/ํ›„์ง„ (์šฐ์ˆ˜) | | linear_y | 0.550 | 0.293 | ์ขŒ์šฐ ์ด๋™ (๋ณดํ†ต) | | angular_z | 0.062 | 0.000 | ํšŒ์ „ (๋‚ฎ์Œ) | ### ์‹œ๋‚˜๋ฆฌ์˜ค๋ณ„ ์„ฑ๋Šฅ | ์‹œ๋‚˜๋ฆฌ์˜ค | MAE | ๋“ฑ๊ธ‰ | ์„ค๋ช… | |----------|-----|------|------| | 1box_right_vertical | 0.217 | B+ | ์šฐ์ˆ˜ | | 1box_left_horizontal | 0.303 | B | ์–‘ํ˜ธ | | 2box_left_vertical | 0.322 | B | ์–‘ํ˜ธ | | 1box_left_vertical | 0.337 | B- | ๋ณดํ†ต | ## ๐Ÿš€ ์‚ฌ์šฉ ๋ฐฉ๋ฒ• ### ์„ค์น˜ ```bash pip install transformers torch pillow numpy ``` ### ๊ธฐ๋ณธ ์‚ฌ์šฉ๋ฒ• ```python from mobile_vla import MobileVLAModel, MobileVLATrainer from PIL import Image import torch # ๋ชจ๋ธ ๋กœ๋“œ model = MobileVLAModel.from_pretrained("minuum/mobile-vla") # ์ด๋ฏธ์ง€์™€ ํƒœ์Šคํฌ ์ค€๋น„ image = Image.open("robot_camera.jpg") task = "Navigate around obstacles to track the target cup" # ์˜ˆ์ธก with torch.no_grad(): actions = model.predict(image, task) print(f"Predicted actions: {actions}") # ์ถœ๋ ฅ: [linear_x, linear_y, angular_z] ``` ### ๊ณ ๊ธ‰ ์‚ฌ์šฉ๋ฒ• ```python # ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ images = [Image.open(f"frame_{i}.jpg") for i in range(8)] actions = model.predict_sequence(images, task) # ์‹ค์‹œ๊ฐ„ ์ œ์–ด for frame in camera_stream: action = model.predict(frame, task) robot.execute(action) ``` ## ๐Ÿ—๏ธ ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜ ``` [RGB Images] โ†’ [Kosmos-2B Vision] โ†’ [Action Head] โ†’ [3D Actions] โ†“ โ†“ โ†“ โ†“ 224x224 Image Features Regression [x, y, ฮธ] ``` ### ํ•ต์‹ฌ ์ปดํฌ๋„ŒํŠธ 1. **Kosmos-2B Vision Model**: ์ด๋ฏธ์ง€ ํŠน์ง• ์ถ”์ถœ 2. **Action Head**: 3D ํšŒ๊ท€ ํ—ค๋“œ (512 โ†’ 3*chunk_size) 3. **Window/Chunk**: 8ํ”„๋ ˆ์ž„ ๊ด€์ฐฐ โ†’ 2ํ”„๋ ˆ์ž„ ์˜ˆ์ธก ## ๐Ÿ“ˆ RoboVLMs์™€์˜ ๋น„๊ต | ํ•ญ๋ชฉ | RoboVLMs | Mobile VLA | |------|----------|------------| | **๋ฐ์ดํ„ฐ ์š”๊ตฌ๋Ÿ‰** | ์ˆ˜๋ฐฑ๋งŒ ๋ฐ๋ชจ | 72 ์—ํ”ผ์†Œ๋“œ | | **์•ก์…˜ ๊ณต๊ฐ„** | 7-DOF Discrete | 3D Continuous | | **์ถ”๋ก  ์†๋„** | ๋ณตํ•ฉ์  | ๋น ๋ฆ„ | | **ํŠนํ™” ๋ถ„์•ผ** | ๋ฒ”์šฉ Manipulation | Mobile Robot | | **ํ‰๊ฐ€ ๋ฐฉ์‹** | ์„ฑ๊ณต๋ฅ  | ๋‹ค์ฐจ์› ํšŒ๊ท€ ์ง€ํ‘œ | ## ๐ŸŽฏ ์ฃผ์š” ๊ฐœ์„ ์‚ฌํ•ญ - **๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ**: 1000๋ฐฐ ์ ์€ ๋ฐ์ดํ„ฐ๋กœ ์‹ค์šฉ์  ์„ฑ๋Šฅ - **์‹ค์‹œ๊ฐ„ ์„ฑ๋Šฅ**: Vision-only ์ฒ˜๋ฆฌ๋กœ ๋น ๋ฅธ ์ถ”๋ก  - **์—ฐ์† ์ œ์–ด**: ์ •๋ฐ€ํ•œ 3D ์•ก์…˜ ์˜ˆ์ธก - **์‹œ๋‚˜๋ฆฌ์˜ค ํŠนํ™”**: ์žฅ์• ๋ฌผ ํšŒํ”ผ ์ „์šฉ ์ตœ์ ํ™” ## ๐Ÿ“š ํ•™์Šต ๋ฐ์ดํ„ฐ - **์—ํ”ผ์†Œ๋“œ ์ˆ˜**: 72๊ฐœ - **์‹œ๋‚˜๋ฆฌ์˜ค**: 1box/2box ร— left/right ร— vertical/horizontal - **์•ก์…˜**: [linear_x, linear_y, angular_z] ์—ฐ์† ๊ฐ’ - **์ด๋ฏธ์ง€**: ์‹ค์ œ ๋กœ๋ด‡ ์นด๋ฉ”๋ผ RGB (224x224) ## ๐Ÿ”ฌ ์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ ์ด ๋ชจ๋ธ์€ RoboVLMs์˜ Window/Chunk ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์œ ์ง€ํ•˜๋ฉด์„œ Mobile Robot์— ํŠนํ™”๋œ ๊ธฐ๋Šฅ์„ ์ถ”๊ฐ€ํ•œ ์—ฐ๊ตฌ์ž…๋‹ˆ๋‹ค: 1. **Window/Chunk ์œ ์ง€**: 8ํ”„๋ ˆ์ž„ ๊ด€์ฐฐ โ†’ 2ํ”„๋ ˆ์ž„ ์˜ˆ์ธก ๊ตฌ์กฐ 2. **Kosmos-2B ํ†ตํ•ฉ**: Vision-Language ๋ฐฑ๋ณธ ํ™œ์šฉ 3. **์—ฐ์† ์ œ์–ด**: Discrete โ†’ Continuous ์•ก์…˜ ๊ณต๊ฐ„ ์ „ํ™˜ 4. **์‹ค์ œ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ**: HDF5 ํ˜•ํƒœ์˜ ์‹ค์ œ ์ˆ˜์ง‘ ๋ฐ์ดํ„ฐ ## ๐Ÿ“„ ์ธ์šฉ ```bibtex @misc{mobile_vla_2024, title={Mobile VLA: Vision-Language-Action Model for Mobile Robot Navigation}, author={Mobile VLA Team}, year={2024}, publisher={HuggingFace}, url={https://huggingface.co/minuum/mobile-vla} } ``` ## ๐Ÿค ๊ธฐ์—ฌ ์ด ๋ชจ๋ธ์€ RoboVLMs ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐœ๋ฐœ๋˜์—ˆ์œผ๋ฉฐ, Mobile Robot ์ปค๋ฎค๋‹ˆํ‹ฐ์˜ ๋ฐœ์ „์„ ์œ„ํ•ด ๊ณต๊ฐœ๋ฉ๋‹ˆ๋‹ค. ## ๐Ÿ“ž ์—ฐ๋ฝ์ฒ˜ - **Issues**: [GitHub Issues](https://github.com/minuum/vla/issues) - **Discussions**: [HuggingFace Discussions](https://huggingface.co/minuum/mobile-vla/discussions) --- *Generated on 2025-08-21*