Benchmark - a JuanRafap Collection

LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?

Paper • 2506.11928 • Published Jun 13, 2025 • 25

SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification

Paper • 2506.15569 • Published Jun 18, 2025 • 12

MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation

Paper • 2506.14028 • Published Jun 16, 2025 • 93

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Paper • 2506.11763 • Published Jun 13, 2025 • 74

VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning

Paper • 2506.09049 • Published Jun 10, 2025 • 37

yandex/alchemist

Viewer • Updated Jun 6, 2025 • 3.35k • 140 • 49

Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers

Paper • 2507.02694 • Published Jul 3, 2025 • 19

REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once

Paper • 2507.10541 • Published Jul 14, 2025 • 30

HuggingFaceTB/SmolLM3-3B-Base

Text Generation • Updated Aug 14, 2025 • 91.7k • 150

AgentsNet: Coordination and Collaborative Reasoning in Multi-Agent LLMs

Paper • 2507.08616 • Published Jul 11, 2025 • 15

The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations

Paper • 2507.13302 • Published Jul 17, 2025 • 5

SWE-Perf/SWE-Perf

Viewer • Updated Aug 5, 2025 • 140 • 242 • 7

AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research

Paper • 2507.13300 • Published Jul 17, 2025 • 20

DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering

Paper • 2507.11527 • Published Jul 15, 2025 • 35

Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers

Paper • 2507.10787 • Published Jul 14, 2025 • 13

WideSearch: Benchmarking Agentic Broad Info-Seeking

Paper • 2508.07999 • Published Aug 11, 2025 • 111

MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents

Paper • 2508.13186 • Published Aug 14, 2025 • 19

AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions

Paper • 2508.16402 • Published Aug 22, 2025 • 14

MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers

Paper • 2508.14704 • Published Aug 20, 2025 • 43

NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

Paper • 2508.14444 • Published Aug 20, 2025 • 46

UQ: Assessing Language Models on Unsolved Questions

Paper • 2508.17580 • Published Aug 25, 2025 • 15

T2I-ReasonBench: Benchmarking Reasoning-Informed Text-to-Image Generation

Paper • 2508.17472 • Published Aug 24, 2025 • 26

ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks

Paper • 2508.15804 • Published Aug 14, 2025 • 15

MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

Paper • 2508.20453 • Published Aug 28, 2025 • 63

DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks

Paper • 2509.01396 • Published Sep 1, 2025 • 58

Salesforce/CRMArenaPro

Viewer • Updated Jul 9, 2025 • 8.61k • 874 • 15

TIGER-Lab/MMLU-Pro

Benchmark • Updated 7 days ago • 12.1k • 126k • 453

On Robustness and Reliability of Benchmark-Based Evaluation of LLMs

Paper • 2509.04013 • Published Sep 4, 2025 • 4

SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge

Paper • 2509.07968 • Published Sep 9, 2025 • 14

LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering

Paper • 2509.09614 • Published Sep 11, 2025 • 7

GenExam: A Multidisciplinary Text-to-Image Exam

Paper • 2509.14232 • Published Sep 17, 2025 • 21

ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark

Paper • 2501.01290 • Published Jan 2, 2025 • 1

MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

Paper • 2509.24002 • Published Sep 28, 2025 • 179

OceanGym: A Benchmark Environment for Underwater Embodied Agents

Paper • 2509.26536 • Published Sep 30, 2025 • 36

PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs

Paper • 2510.09507 • Published Oct 10, 2025 • 11

PICABench: How Far Are We from Physically Realistic Image Editing?

Paper • 2510.17681 • Published Oct 20, 2025 • 65

LiveTradeBench: Seeking Real-World Alpha with Large Language Models

Paper • 2511.03628 • Published Nov 5, 2025 • 13

Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks

Paper • 2511.15065 • Published Nov 19, 2025 • 78

M3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark

Paper • 2511.17729 • Published Nov 21, 2025 • 17

Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward

Paper • 2511.20561 • Published Nov 25, 2025 • 33

RefineBench: Evaluating Refinement Capability of Language Models via Checklists

Paper • 2511.22173 • Published Nov 27, 2025 • 15

DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle

Paper • 2512.04324 • Published Dec 3, 2025 • 157

NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

Paper • 2512.12730 • Published Dec 14, 2025 • 50

OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value

Paper • 2512.14051 • Published Dec 16, 2025 • 47

MMGR: Multi-Modal Generative Reasoning

Paper • 2512.14691 • Published Dec 16, 2025 • 121

MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments

Paper • 2512.19432 • Published Dec 22, 2025 • 13

FrontierCS: Evolving Challenges for Evolving Intelligence

Paper • 2512.15699 • Published Dec 17, 2025 • 5

GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models

Paper • 2512.15560 • Published Dec 17, 2025 • 25

Benchmark^2: Systematic Evaluation of LLM Benchmarks

Paper • 2601.03986 • Published Jan 7 • 34

Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments

Paper • 2602.11964 • Published Feb 12 • 12

ResearchGym: Evaluating Language Model Agents on Real-World AI Research

Paper • 2602.15112 • Published 29 days ago • 21

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Paper • 2602.12670 • Published Feb 13 • 55

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Paper • 2410.07095 • Published Oct 9, 2024 • 8

RubricBench: Aligning Model-Generated Rubrics with Human Standards

Paper • 2603.01562 • Published 16 days ago • 57

SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

Paper • 2603.03823 • Published 14 days ago • 6

AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios

Paper • 2602.23166 • Published 19 days ago • 42

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Paper • 2603.12180 • Published 5 days ago • 59