COSMIC: Generalized Refusal Direction Identification in LLM Activations Paper • 2506.00085 • Published May 30 • 2
RepIt: Representing Isolated Targets to Steer Language Models Paper • 2509.13281 • Published Sep 16 • 4
SteeringControl: Holistic Evaluation of Alignment Steering in LLMs Paper • 2509.13450 • Published Sep 16 • 7
JudgeBench: A Benchmark for Evaluating LLM-based Judges Paper • 2410.12784 • Published Oct 16, 2024 • 48