|
|
--- |
|
|
datasets: |
|
|
- GetSoloTech/FoodStack |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- lerobot/smolvla_base |
|
|
library_name: transformers |
|
|
tags: |
|
|
- Robotics |
|
|
- Lerobot |
|
|
- Food |
|
|
- PickPlace |
|
|
- VLA |
|
|
- SmolVLA |
|
|
- PhysicalAI |
|
|
--- |
|
|
|
|
|
### SmolVLA Fine-Tuned on for Food Stacking |
|
|
|
|
|
**Summary**: This is a fine-tuned version of `lerobot/smolvla_base` for stacking food objects (e.g., burgers, sandwiches). It was fine-tuned on the `GetSoloTech/FoodStack` dataset using the LeRobot framework. |
|
|
|
|
|
### Model details |
|
|
- **Base model**: `lerobot/smolvla_base` |
|
|
- **Task**: Vision-Language-Action control for manipulation (stacking) |
|
|
- **Domain**: Food item stacking (burger, sandwich, etc.) |
|
|
- **Params**: ~450M (SmolVLA) |
|
|
- **Library**: LeRobot (`lerobot`) |
|
|
|
|
|
### Quick start |
|
|
Install LeRobot with SmolVLA extras: |
|
|
|
|
|
```bash |
|
|
git clone https://github.com/huggingface/lerobot.git |
|
|
cd lerobot |
|
|
pip install -e ".[smolvla]" |
|
|
``` |
|
|
|
|
|
Load the policy from this repo and run inference: |
|
|
|
|
|
```python |
|
|
from lerobot.common.policies.smolvla.modeling_smolvla import SmolVLAPolicy |
|
|
|
|
|
# Replace with your actual model ID on the Hub |
|
|
model_id = "GetSoloTech/SmolVLA-FoodStack" |
|
|
|
|
|
policy = SmolVLAPolicy.from_pretrained(model_id) |
|
|
|
|
|
# Example placeholders for observation and instruction |
|
|
observation = { |
|
|
"image": ... , # BGR/RGB frame or processed observation per your setup |
|
|
"state": ... , # optional proprio/scene state if used |
|
|
} |
|
|
instruction = "Stack the burger: bun, patty, cheese, lettuce, bun." |
|
|
|
|
|
# Depending on your pipeline, you may wrap this in your control loop |
|
|
actions = policy(observation, instruction) |
|
|
|
|
|
# Send actions to your robot controller |
|
|
# send_actions_to_robot(actions) |
|
|
``` |
|
|
|
|
|
For end-to-end examples (policy loops, camera/robot IO), see the LeRobot docs and examples. |
|
|
|
|
|
|
|
|
Notes: |
|
|
- Tune batch size/steps and augmentation to your hardware and dataset split. |
|
|
- Ensure your observation preprocessing at train-time matches inference. |
|
|
|
|
|
|
|
|
### Limitations |
|
|
- Specializes in food stacking; may not generalize to unseen objects/layouts. |
|
|
- Sensitive to perception domain shift (lighting, textures, camera intrinsics). |
|
|
- Requires correct observation normalization consistent with training. |
|
|
|
|
|
### Dataset |
|
|
- **Training data**: `GetSoloTech/FoodStack` |
|
|
|
|
|
### Resources and references |
|
|
- SmolVLA base: `https://huggingface.co/lerobot/smolvla_base` |
|
|
- SmolVLA overview: `https://smolvla.net/index_en.html` |
|
|
- LeRobot: `https://github.com/huggingface/lerobot` |