GetSoloTech
/

SmolVLA-FoodStack

Model card Files Files and versions

SmolVLA-FoodStack / README.md

zeeshaan-ai's picture

Update README.md

a23b17a verified 3 months ago

|

history blame contribute delete

2.37 kB

	---
	datasets:
	- GetSoloTech/FoodStack
	language:
	- en
	base_model:
	- lerobot/smolvla_base
	library_name: transformers
	tags:
	- Robotics
	- Lerobot
	- Food
	- PickPlace
	- VLA
	- SmolVLA
	- PhysicalAI
	---

	### SmolVLA Fine-Tuned on for Food Stacking

	Summary: This is a fine-tuned version of `lerobot/smolvla_base` for stacking food objects (e.g., burgers, sandwiches). It was fine-tuned on the `GetSoloTech/FoodStack` dataset using the LeRobot framework.

	### Model details
	- Base model: `lerobot/smolvla_base`
	- Task: Vision-Language-Action control for manipulation (stacking)
	- Domain: Food item stacking (burger, sandwich, etc.)
	- Params: ~450M (SmolVLA)
	- Library: LeRobot (`lerobot`)

	### Quick start
	Install LeRobot with SmolVLA extras:

	```bash
	git clone https://github.com/huggingface/lerobot.git
	cd lerobot
	pip install -e ".[smolvla]"
	```

	Load the policy from this repo and run inference:

	```python
	from lerobot.common.policies.smolvla.modeling_smolvla import SmolVLAPolicy

	# Replace with your actual model ID on the Hub
	model_id = "GetSoloTech/SmolVLA-FoodStack"

	policy = SmolVLAPolicy.from_pretrained(model_id)

	# Example placeholders for observation and instruction
	observation = {
	"image": ... , # BGR/RGB frame or processed observation per your setup
	"state": ... , # optional proprio/scene state if used
	}
	instruction = "Stack the burger: bun, patty, cheese, lettuce, bun."

	# Depending on your pipeline, you may wrap this in your control loop
	actions = policy(observation, instruction)

	# Send actions to your robot controller
	# send_actions_to_robot(actions)
	```

	For end-to-end examples (policy loops, camera/robot IO), see the LeRobot docs and examples.


	Notes:
	- Tune batch size/steps and augmentation to your hardware and dataset split.
	- Ensure your observation preprocessing at train-time matches inference.


	### Limitations
	- Specializes in food stacking; may not generalize to unseen objects/layouts.
	- Sensitive to perception domain shift (lighting, textures, camera intrinsics).
	- Requires correct observation normalization consistent with training.

	### Dataset
	- Training data: `GetSoloTech/FoodStack`

	### Resources and references
	- SmolVLA base: `https://huggingface.co/lerobot/smolvla_base`
	- SmolVLA overview: `https://smolvla.net/index_en.html`
	- LeRobot: `https://github.com/huggingface/lerobot`