The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding Paper • 2512.19693 • Published 4 days ago • 60
Scaling Spatial Intelligence with Multimodal Foundation Models Paper • 2511.13719 • Published Nov 17 • 46
NEO1_0 Collection From Pixels to Words -- Towards Native Vision-Language Primitives at Scale • 7 items • Updated Oct 17 • 4
SenseNova-SI Collection Scaling Spatial Intelligence with Multimodal Foundation Models • 8 items • Updated 19 days ago • 14
MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling Paper • 2511.11793 • Published Nov 14 • 164
Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning Paper • 2510.11027 • Published Oct 13 • 21
VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning Paper • 2510.10518 • Published Oct 12 • 18
From Pixels to Words -- Towards Native Vision-Language Primitives at Scale Paper • 2510.14979 • Published Oct 16 • 66
CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving Paper • 2510.07944 • Published Oct 9 • 24
InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue Paper • 2510.13747 • Published Oct 15 • 29
NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints Paper • 2510.08565 • Published Oct 9 • 19
Paper2Video: Automatic Video Generation from Scientific Papers Paper • 2510.05096 • Published Oct 6 • 118
A Survey of Reinforcement Learning for Large Reasoning Models Paper • 2509.08827 • Published Sep 10 • 190
ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data Paper • 2509.15221 • Published Sep 18 • 111