D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models Paper • 2509.17938 • Published Sep 22 • 4
Evaluating the Critical Risks of Amazon's Nova Premier under the Frontier Model Safety Framework Paper • 2507.06260 • Published Jul 7 • 5
Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation Paper • 2505.21784 • Published May 27 • 17