Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads Paper • 2511.06209 • Published Nov 9, 2025 • 18
MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use Paper • 2509.24002 • Published Sep 28, 2025 • 174
Can Large Language Models Capture Human Annotator Disagreements? Paper • 2506.19467 • Published Jun 24, 2025 • 18
Balancing Truthfulness and Informativeness with Uncertainty-Aware Instruction Fine-Tuning Paper • 2502.11962 • Published Feb 17, 2025 • 38
SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis Paper • 2506.02096 • Published Jun 2, 2025 • 52
GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning Paper • 2505.11049 • Published May 16, 2025 • 60
Efficient Inference for Large Reasoning Models: A Survey Paper • 2503.23077 • Published Mar 29, 2025 • 46
GuardReasoner: Towards Reasoning-based LLM Safeguards Paper • 2501.18492 • Published Jan 30, 2025 • 88
MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures Paper • 2410.13754 • Published Oct 17, 2024 • 75