saurabh5 commited on
Commit
5608d10
·
verified ·
1 Parent(s): e0f48d4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -10
README.md CHANGED
@@ -127,16 +127,25 @@ Moo Moo the cow would certinaly win.
127
 
128
  ## Evaluation
129
 
130
- | **Model** | **Math** | | | **Reasoning** | | | **Coding** | | | **IF** | | **QA** | | |
131
- |----------|----------|----------|----------|-------------|-------------|-------------|-------------|-------------|-------------|-------------|-------------|-------------|-------------|-------------|
132
- | | AIME '24 | AIME '25 | OMEGA | BBH | Zebra Logic | AGI Eval | Human Eval+ | MBPP+ | LCB v3 | IFEval | IFBench | MMLU | PopQA | GPQA |
133
- | **Olmo-3-7B-Think** | 72.5 | 64.8 | **44.1** | 86.6 | 65.0 | 81.5 | **90.3** | 64.6 | 75.1 | 86.0 | 41.6 | 79.0 | 24.0 | 49.0 |
134
- | **Nemotron–Nano–9B–v2** | 72.1 | 58.9 | 42.4 | 86.2 | 60.8 | 83.1 | 89.7 | 66.1 | 83.4 | 86.0 | 34.6 | 84.3 | 17.9 | 56.2 |
135
- | **OpenThinker3–7B** | 67.7 | 57.2 | 38.4 | 77.1 | 34.9 | 78.6 | 87.4 | 61.4 | 68.2 | 51.7 | 23.0 | 77.4 | 18.0 | 48.0 |
136
- | **DeepSeek–R1–Qwen–7B** | 54.9 | 40.2 | 28.5 | 73.5 | 26.1 | 69.5 | 83.0 | 63.5 | 58.8 | 59.6 | 16.7 | 67.9 | 12.8 | 53.2 |
137
- | **Qwen 3 8B (w/ reasoning)** | **74.0** | **67.8** | 43.4 | 84.4 | 85.2 | 87.0 | 80.2 | **69.1** | **86.2** | **87.4** | 37.1 | 85.4 | 24.3 | 57.7 |
138
- | **Qwen 3 VL 8B Thinker** | 70.9 | 61.5 | 37.9 | **86.8** | **91.2** | **90.1** | 83.7 | 63.0 | 85.5 | 85.5 | 40.4 | **86.5** | **29.3** | **62.4** |
139
- | **OpenReasoning Nemo 7B** | 77.0 | 73.1 | 43.2 | 81.3 | 22.4 | 81.4 | 89.7 | 61.2 | 82.3 | 42.5 | — | 80.7 | 14.5 | 60.8 |
 
 
 
 
 
 
 
 
 
140
 
141
  ## Model Details
142
 
 
127
 
128
  ## Evaluation
129
 
130
+ | Skill | Benchmark | Olmo 3 7B Think SFT | Olmo 3 7B Think DPO | Olmo 3 7B Think | OpenThinker3-7B | Nemotron-Nano-9B-v2 | DeepSeek-R1-Distill-Qwen-7B | Qwen 3 8B (reasoning) | Qwen 3 VL 8B Thinker | OpenReasoning Nemotron 7B |
131
+ |-------|-----------|------------------|------------------|--------------|------------------|-----------------------|------------------------------|-------------------------|---------------------------|-----------------------------|
132
+ | **Math** | MATH | 94.4 | 92.4 | 95.1 | 94.5 | 94.4 | 87.9 | 95.1 | 95.2 | 94.6 |
133
+ | | AIME 2024 | 69.6 | 74.6 | 71.6 | 67.7 | 72.1 | 54.9 | 74.0 | 70.9 | 77.0 |
134
+ | | AIME 2025 | 57.6 | 62.7 | 64.6 | 57.2 | 58.9 | 40.2 | 67.8 | 61.5 | 73.1 |
135
+ | | OMEGA | 45.0 | 40.5 | 37.8 | 38.4 | 42.4 | 28.5 | 43.4 | 38.1 | 43.2 |
136
+ | **Reasoning** | BBH | 84.1 | 83.7 | 86.6 | 77.1 | 86.2 | 73.5 | 84.4 | 86.8 | 81.3 |
137
+ | | ZebraLogic | 57.9 | 60.6 | 66.5 | 34.9 | 60.8 | 26.1 | 85.2 | 91.2 | 22.4 |
138
+ | | AGI Eval | 77.2 | 79.1 | 81.5 | 78.6 | 83.1 | 69.5 | 87.0 | 90.1 | 81.4 |
139
+ | **Coding** | HumanEval+ | 88.2 | 91.4 | 89.9 | 87.4 | 89.7 | 83.0 | 80.2 | 83.7 | 89.7 |
140
+ | | MBPP+ | 63.2 | 63.0 | 64.7 | 61.4 | 66.1 | 63.5 | 69.1 | 63.0 | 61.2 |
141
+ | | LCB v3 | 67.8 | 75.1 | 75.2 | 68.0 | 83.4 | 58.8 | 86.2 | 85.5 | 82.3 |
142
+ | **IF** | IFEval | 77.9 | 75.9 | 88.2 | 51.7 | 86.0 | 59.6 | 87.4 | 85.5 | 42.5 |
143
+ | | IFBench | 30.0 | 28.3 | 41.6 | 23.0 | 34.6 | 16.7 | 37.1 | 40.4 | 23.4 |
144
+ | **Knowledge** | MMLU | 74.9 | 74.8 | 77.8 | 77.4 | 84.3 | 67.9 | 85.4 | 86.5 | 80.7 |
145
+ | **QA** | PopQA | 20.8 | 24.7 | 23.7 | 18.0 | 17.9 | 12.8 | 24.3 | 29.3 | 14.5 |
146
+ | | GPQA | 45.8 | 48.6 | 46.2 | 47.6 | 56.2 | 54.4 | 57.7 | 61.5 | 56.6 |
147
+ | **Chat** | AE 2 | 43.9 | 50.6 | 52.1 | 24.0 | 58.0 | 7.7 | 60.5 | 73.5 | 8.6 |
148
+ | **Safety** | | 65.8 | 67.7 | 70.7 | 31.3 | 72.1 | 54.0 | 68.3 | 82.9 | 30.3 |
149
 
150
  ## Model Details
151