Video-Text-to-Text
Transformers
Safetensors
sam2
English
vica_qwen
text-generation
multimodal
vision-language
video understanding
visuospatial cognition
spatial reasoning
vlm
llava
qwen
siglip
hiera
dual-encoder
Eval Results (legacy)
Instructions to use nkkbr/ViCA2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nkkbr/ViCA2 with Transformers:
# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("nkkbr/ViCA2", dtype="auto") - sam2
How to use nkkbr/ViCA2 with sam2:
# Use SAM2 with images import torch from sam2.sam2_image_predictor import SAM2ImagePredictor predictor = SAM2ImagePredictor.from_pretrained(nkkbr/ViCA2) with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16): predictor.set_image(<your_image>) masks, _, _ = predictor.predict(<input_prompts>)# Use SAM2 with videos import torch from sam2.sam2_video_predictor import SAM2VideoPredictor predictor = SAM2VideoPredictor.from_pretrained(nkkbr/ViCA2) with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16): state = predictor.init_state(<your_video>) # add new prompts and instantly get the output on the same frame frame_idx, object_ids, masks = predictor.add_new_points(state, <your_prompts>): # propagate the prompts to get masklets throughout the video for frame_idx, object_ids, masks in predictor.propagate_in_video(state): ... - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -217,4 +217,26 @@ text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip()
|
|
| 217 |
print(repr(text_outputs))
|
| 218 |
```
|
| 219 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 220 |
---
|
|
|
|
| 217 |
print(repr(text_outputs))
|
| 218 |
```
|
| 219 |
|
| 220 |
+
## Citation
|
| 221 |
+
|
| 222 |
+
If you find our work helpful, we would appreciate it if you cite the following papers.
|
| 223 |
+
|
| 224 |
+
```bibtex
|
| 225 |
+
@misc{feng2025vica2,
|
| 226 |
+
title={Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts},
|
| 227 |
+
author={Feng, Qi},
|
| 228 |
+
publisher={arXiv:2505.12363},
|
| 229 |
+
year={2025},
|
| 230 |
+
}
|
| 231 |
+
```
|
| 232 |
+
|
| 233 |
+
```bibtex
|
| 234 |
+
@misc{feng2025vica,
|
| 235 |
+
title={Visuospatial Cognitive Assistant},
|
| 236 |
+
author={Feng, Qi},
|
| 237 |
+
publisher={arXiv:2505.12312},
|
| 238 |
+
year={2025},
|
| 239 |
+
}
|
| 240 |
+
```
|
| 241 |
+
|
| 242 |
---
|