Spaces:
Running
Request for Benchmark Evaluation
Please enter the name of the model you would like us to evaluate.
LiquidAI/LFM2.5-1.2B-Thinking
LiquidAI/LFM2.5-1.2B-Instruct
LiquidAI/LFM2.5-1.2B-JP
LiquidAI/LFM2.5-1.2B-Base
And
Qwen/Qwen3.5-27B
Qwen/Qwen3.5-9B
Qwen/Qwen3.5-9B-Base
Qwen/Qwen3.5-4B
Qwen/Qwen3.5-4B-Base
Qwen/Qwen3.5-2B
Qwen/Qwen3.5-2B-Base
Qwen/Qwen3.5-0.8B
Qwen/Qwen3.5-0.8B-Base
Thx ! Or is it possible to release eval code to run the experiement ourselves ? ๐
Can you please evaluate the following model:
Alibaba-Apsara/DASD-4B-Thinking
Thank you both for the suggestions!
We've added all requested models to our evaluation queue:
- LiquidAI/LFM2.5-1.2B (Instruct & Thinking)
- Qwen/Qwen3.5 series (0.8B, 2B, 4B)
- Alibaba-Apsara/DASD-4B-Thinking
Priority models will be evaluated and added to the leaderboard
in the coming updates.
On eval code: We're working on a public evaluation pipeline
that cleanly separates the grading logic from answer keys.
Will share when it's ready.
NEW LISTING: Llama-3.2-1B
Hi! Can u evaluate GRM Family?
OrionLLM/GRM-7b
OrionLLM/GRM-1.5b