Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark Paper • 2510.13759 • Published Oct 15 • 9
RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark Paper • 2509.24897 • Published Sep 29 • 46
Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding Paper • 2507.15028 • Published Jul 20 • 21
Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras Paper • 2507.17664 • Published Jul 23 • 1
ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models Paper • 2506.21356 • Published Jun 26 • 22
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning Paper • 2507.05920 • Published Jul 8 • 11
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning Paper • 2506.13654 • Published Jun 16 • 43
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models Paper • 2407.12772 • Published Jul 17, 2024 • 35
view post Post 2456 🔥🔥Introducing Ola! State-of-the-art omni-modal understanding model with advanced progressive modality alignment strategy!Ola ranks #1 on OpenCompass Leaderboard (<10B). 📜Paper: https://arxiv.org/abs/2502.04328🛠️Code: https://github.com/Ola-Omni/Ola🛠️We have fully released our video&audio training data, intermediate image&video model at THUdyh/ola-67b8220eb93406ec87aeec37. Try to build your own powerful omni-modal model with our data and models! See translation 👀 4 4 + Reply
Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment Paper • 2502.04328 • Published Feb 6 • 30
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives Paper • 2501.04003 • Published Jan 7 • 27
view post Post 1604 🚀🚀🚀Introducing Insight-V! An early attempt towards o1-like multi-modal reasoning. We offer a structured long-chain visual reasoning data generation pipeline and a multi-agent system to unleash the reasoning potential of MLLMs.📜 Paper: https://arxiv.org/abs/2411.14432🛠️ Github: https://github.com/dongyh20/Insight-V💼 Model Weight: THUdyh/insight-v-673f5e1dd8ab5f2d8d332035 🔥 5 5 👀 3 3 😎 1 1 + Reply
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models Paper • 2411.14432 • Published Nov 21, 2024 • 25
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models Paper • 2411.14432 • Published Nov 21, 2024 • 25
view post Post 3384 🔥🔥🔥Introducing Oryx-1.5!A series of unified MLLMs with much stronger performance on all the image, video, and 3D benchmarks 😍🛠️Github: https://github.com/Oryx-mllm/Oryx🚀Model: THUdyh/oryx-15-6718c60763845525c2bba71d🎨Demo: THUdyh/Oryx👋Try the top-tier MLLM yourself!👀Stay tuned for more explorations on MLLMs! 🔥 11 11 + Reply