Benchmark
updated
Scaling Computer-Use Grounding via User Interface Decomposition and
Synthesis
Paper
• 2505.13227
• Published
• 45
facebook/natural_reasoning
Viewer
• Updated
• 1.15M • 1.52k
• 554
Viewer
• Updated
• 5.68M • 14.7k
• 449
Search Arena: Analyzing Search-Augmented LLMs
Paper
• 2506.05334
• Published
• 18
OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation
Paper
• 2506.07977
• Published
• 41
Viewer
• Updated
• 824 • 10.9k
• 248
LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive
Programming?
Paper
• 2506.11928
• Published
• 25
SciVer: Evaluating Foundation Models for Multimodal Scientific Claim
Verification
Paper
• 2506.15569
• Published
• 12
MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark
for Financial LLM Evaluation
Paper
• 2506.14028
• Published
• 93
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
Paper
• 2506.11763
• Published
• 74
VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement
Learning
Paper
• 2506.09049
• Published
• 37
Viewer
• Updated
• 3.35k • 140
• 49
Can LLMs Identify Critical Limitations within Scientific Research? A
Systematic Evaluation on AI Research Papers
Paper
• 2507.02694
• Published
• 19
REST: Stress Testing Large Reasoning Models by Asking Multiple Problems
at Once
Paper
• 2507.10541
• Published
• 30
HuggingFaceTB/SmolLM3-3B-Base
Text Generation
• Updated
• 91.7k
• 150
AgentsNet: Coordination and Collaborative Reasoning in Multi-Agent LLMs
Paper
• 2507.08616
• Published
• 15
The Generative Energy Arena (GEA): Incorporating Energy Awareness in
Large Language Model (LLM) Human Evaluations
Paper
• 2507.13302
• Published
• 5
Viewer
• Updated
• 140 • 242
• 7
AbGen: Evaluating Large Language Models in Ablation Study Design and
Evaluation for Scientific Research
Paper
• 2507.13300
• Published
• 20
DrafterBench: Benchmarking Large Language Models for Tasks Automation in
Civil Engineering
Paper
• 2507.11527
• Published
• 35
Can Multimodal Foundation Models Understand Schematic Diagrams? An
Empirical Study on Information-Seeking QA over Scientific Papers
Paper
• 2507.10787
• Published
• 13
WideSearch: Benchmarking Agentic Broad Info-Seeking
Paper
• 2508.07999
• Published
• 111
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Paper
• 2508.13186
• Published
• 19
AetherCode: Evaluating LLMs' Ability to Win In Premier Programming
Competitions
Paper
• 2508.16402
• Published
• 14
MCP-Universe: Benchmarking Large Language Models with Real-World Model
Context Protocol Servers
Paper
• 2508.14704
• Published
• 43
NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid
Mamba-Transformer Reasoning Model
Paper
• 2508.14444
• Published
• 46
UQ: Assessing Language Models on Unsolved Questions
Paper
• 2508.17580
• Published
• 15
T2I-ReasonBench: Benchmarking Reasoning-Informed Text-to-Image
Generation
Paper
• 2508.17472
• Published
• 26
ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks
Paper
• 2508.15804
• Published
• 15
MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World
Tasks via MCP Servers
Paper
• 2508.20453
• Published
• 63
DeepResearch Arena: The First Exam of LLMs' Research Abilities via
Seminar-Grounded Tasks
Paper
• 2509.01396
• Published
• 58
Viewer
• Updated
• 8.61k • 874
• 15
Benchmark
• Updated
• 12.1k • 126k
• 453
On Robustness and Reliability of Benchmark-Based Evaluation of LLMs
Paper
• 2509.04013
• Published
• 4
SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric
Knowledge
Paper
• 2509.07968
• Published
• 14
LoCoBench: A Benchmark for Long-Context Large Language Models in Complex
Software Engineering
Paper
• 2509.09614
• Published
• 7
GenExam: A Multidisciplinary Text-to-Image Exam
Paper
• 2509.14232
• Published
• 21
ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark
Paper
• 2501.01290
• Published
• 1
MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP
Use
Paper
• 2509.24002
• Published
• 179
OceanGym: A Benchmark Environment for Underwater Embodied Agents
Paper
• 2509.26536
• Published
• 36
PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs
Paper
• 2510.09507
• Published
• 11
PICABench: How Far Are We from Physically Realistic Image Editing?
Paper
• 2510.17681
• Published
• 65
LiveTradeBench: Seeking Real-World Alpha with Large Language Models
Paper
• 2511.03628
• Published
• 13
Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks
Paper
• 2511.15065
• Published
• 78
M3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark
Paper
• 2511.17729
• Published
• 17
Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward
Paper
• 2511.20561
• Published
• 33
RefineBench: Evaluating Refinement Capability of Language Models via Checklists
Paper
• 2511.22173
• Published
• 15
DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle
Paper
• 2512.04324
• Published
• 157
NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents
Paper
• 2512.12730
• Published
• 50
OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value
Paper
• 2512.14051
• Published
• 47
MMGR: Multi-Modal Generative Reasoning
Paper
• 2512.14691
• Published
• 121
MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments
Paper
• 2512.19432
• Published
• 13
FrontierCS: Evolving Challenges for Evolving Intelligence
Paper
• 2512.15699
• Published
• 5
GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models
Paper
• 2512.15560
• Published
• 25
Benchmark^2: Systematic Evaluation of LLM Benchmarks
Paper
• 2601.03986
• Published
• 34
Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments
Paper
• 2602.11964
• Published
• 12
ResearchGym: Evaluating Language Model Agents on Real-World AI Research
Paper
• 2602.15112
• Published
• 21
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks
Paper
• 2602.12670
• Published
• 55
MLE-bench: Evaluating Machine Learning Agents on Machine Learning
Engineering
Paper
• 2410.07095
• Published
• 8
RubricBench: Aligning Model-Generated Rubrics with Human Standards
Paper
• 2603.01562
• Published
• 57
SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration
Paper
• 2603.03823
• Published
• 6
AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios
Paper
• 2602.23166
• Published
• 42
Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections
Paper
• 2603.12180
• Published
• 59