lbourdois commited on
Commit
ee05fcd
·
verified ·
1 Parent(s): 259e5b8

Improve language tag

Browse files

Hi! As the model is multilingual, this is a PR to add other languages than English to the language tag to improve the referencing. Note that 29 languages are announced in the README, but only 13 are explicitly listed. I was therefore only able to add these 13 languages.

Files changed (1) hide show
  1. README.md +175 -161
README.md CHANGED
@@ -1,162 +1,176 @@
1
- ---
2
- library_name: transformers
3
- license: apache-2.0
4
- base_model: Qwen/Qwen2.5-7B
5
- datasets:
6
- - allenai/tulu-3-sft-mixture
7
- ---
8
-
9
- # Teleut 7b
10
-
11
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/UqIi8eztdptvt52Mak_1K.png)
12
-
13
- A replication attempt of Tulu 3 on the Qwen 2.5 base models.
14
-
15
- ## Evals (so far)
16
- | | Teleut 7B (measured) | Tülu 3 SFT 8B (reported) | Qwen 2.5 7B Instruct (reported) | Ministral 8B (reported) | Mistral 7B v0.3 (reported)
17
- |-------------------------|----------------------|--------------------------|---------------------------------|-------------------------|---------------------------
18
- |BBH (3 shot, CoT) |*64.4%* |**67.9%** |21.7% |56.2% |47.0%<sup>NLL</sup>
19
- |GSM8K (8 shot, CoT) |78.5% |76.2% |**83.8%** |*80.0%* |xx.x%
20
- |IFEval (prompt loose) |66.3% |*72.8%* |**74.7%** |56.4% |53.0%
21
- |MMLU (0 shot, CoT) |*73.2%* |65.9% |**76.6%** |68.5% |30.7%<sup>5-shot</sup>
22
- |MMLU Pro (0 shot, CoT) |*48.3%* |44.3% |**56.3%**<sup>Unknown</sup> |32.9%<sup>5-shot</sup> |30.7%<sup>5-shot</sup>
23
- |PopQA (15 shot) |18.9% |**29.3%** |18.1% |*20.2%* |xx.x%
24
- |TruthfulQA |47.2% |46.8% |**63.1%** |*55.5%* |xx.x%
25
-
26
- ## Credits
27
- Big thanks to Retis Labs for being providing my 8xH100 polycule used to train and test this model!
28
- Another big thanks to AllenAI for publishing the Tülu 3 data and model series (as well as the paper and details on training), as well as Alibaba for training the original Qwen 2.5 base model series!
29
-
30
- ```
31
- @article{lambert2024tulu3,
32
- title = {Tülu 3: Pushing Frontiers in Open Language Model Post-Training},
33
- author = {
34
- Nathan Lambert and
35
- Jacob Morrison and
36
- Valentina Pyatkin and
37
- Shengyi Huang and
38
- Hamish Ivison and
39
- Faeze Brahman and
40
- Lester James V. Miranda and
41
- Alisa Liu and
42
- Nouha Dziri and
43
- Shane Lyu and
44
- Yuling Gu and
45
- Saumya Malik and
46
- Victoria Graf and
47
- Jena D. Hwang and
48
- Jiangjiang Yang and
49
- Ronan Le Bras and
50
- Oyvind Tafjord and
51
- Chris Wilhelm and
52
- Luca Soldaini and
53
- Noah A. Smith and
54
- Yizhong Wang and
55
- Pradeep Dasigi and
56
- Hannaneh Hajishirzi
57
- },
58
- year = {2024},
59
- email = {tulu@allenai.org}
60
- }
61
- ```
62
-
63
- ## Training procedure
64
-
65
- [<img src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/axolotl-ai-cloud/axolotl)
66
-
67
- ### Training hyperparameters
68
-
69
- The following hyperparameters were used during training:
70
- - learning_rate: 3.5e-06
71
- - train_batch_size: 8
72
- - eval_batch_size: 8
73
- - seed: 42
74
- - distributed_type: multi-GPU
75
- - num_devices: 8
76
- - gradient_accumulation_steps: 2
77
- - total_train_batch_size: 128
78
- - total_eval_batch_size: 64
79
- - optimizer: Use paged_ademamix_8bit and the args are:
80
- No additional optimizer arguments
81
- - lr_scheduler_type: cosine
82
- - lr_scheduler_warmup_steps: 370
83
- - num_epochs: 1
84
-
85
- ### Framework versions
86
-
87
- - Transformers 4.46.3
88
- - Pytorch 2.5.1+cu124
89
- - Datasets 3.1.0
90
- - Tokenizers 0.20.3
91
-
92
- ### Configuration
93
- <details><summary>See axolotl config</summary>
94
-
95
- axolotl version: `0.5.2`
96
- ```yaml
97
- base_model: Qwen/Qwen2.5-7B
98
-
99
- plugins:
100
- - axolotl.integrations.liger.LigerPlugin
101
- liger_rope: true
102
- liger_rms_norm: true
103
- liger_glu_activation: true
104
- liger_fused_linear_cross_entropy: true
105
-
106
- strict: false
107
-
108
- chat_template: chatml
109
- datasets:
110
- - path: allenai/tulu-3-sft-mixture
111
- type: chat_template
112
- split: train
113
- field_messages: messages
114
-
115
- dataset_prepared_path: last_run_prepared
116
- #val_set_size: 0.02
117
- output_dir: ./ckpts
118
-
119
- sequence_len: 8192
120
- #sample_packing: true
121
- pad_to_sequence_len: true
122
-
123
- wandb_project: qwen-2.5-7b-sft
124
- wandb_entity:
125
- wandb_watch:
126
- wandb_name:
127
- wandb_log_model:
128
-
129
- gradient_accumulation_steps: 2
130
- micro_batch_size: 8
131
- num_epochs: 1
132
- optimizer: paged_ademamix_8bit
133
- lr_scheduler: cosine
134
- learning_rate: 3.5e-6
135
-
136
- train_on_inputs: false
137
- group_by_length: false
138
- bf16: auto
139
- fp16:
140
- tf32: false
141
-
142
- gradient_checkpointing: true
143
- gradient_checkpointing_kwargs:
144
- use_reentrant: false
145
- early_stopping_patience:
146
- resume_from_checkpoint:
147
- logging_steps: 1
148
- xformers_attention:
149
- flash_attention: true
150
-
151
- deepspeed: deepspeed_configs/zero3_bf16.json
152
-
153
- warmup_steps: 370
154
- #evals_per_epoch: 4
155
- eval_table_size:
156
- saves_per_epoch: 2
157
- debug:
158
- weight_decay: 0.0
159
-
160
- ```
161
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
162
  </details><br>
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ base_model: Qwen/Qwen2.5-7B
5
+ datasets:
6
+ - allenai/tulu-3-sft-mixture
7
+ language:
8
+ - zho
9
+ - eng
10
+ - fra
11
+ - spa
12
+ - por
13
+ - deu
14
+ - ita
15
+ - rus
16
+ - jpn
17
+ - kor
18
+ - vie
19
+ - tha
20
+ - ara
21
+ ---
22
+
23
+ # Teleut 7b
24
+
25
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/UqIi8eztdptvt52Mak_1K.png)
26
+
27
+ A replication attempt of Tulu 3 on the Qwen 2.5 base models.
28
+
29
+ ## Evals (so far)
30
+ | | Teleut 7B (measured) | Tülu 3 SFT 8B (reported) | Qwen 2.5 7B Instruct (reported) | Ministral 8B (reported) | Mistral 7B v0.3 (reported)
31
+ |-------------------------|----------------------|--------------------------|---------------------------------|-------------------------|---------------------------
32
+ |BBH (3 shot, CoT) |*64.4%* |**67.9%** |21.7% |56.2% |47.0%<sup>NLL</sup>
33
+ |GSM8K (8 shot, CoT) |78.5% |76.2% |**83.8%** |*80.0%* |xx.x%
34
+ |IFEval (prompt loose) |66.3% |*72.8%* |**74.7%** |56.4% |53.0%
35
+ |MMLU (0 shot, CoT) |*73.2%* |65.9% |**76.6%** |68.5% |30.7%<sup>5-shot</sup>
36
+ |MMLU Pro (0 shot, CoT) |*48.3%* |44.3% |**56.3%**<sup>Unknown</sup> |32.9%<sup>5-shot</sup> |30.7%<sup>5-shot</sup>
37
+ |PopQA (15 shot) |18.9% |**29.3%** |18.1% |*20.2%* |xx.x%
38
+ |TruthfulQA |47.2% |46.8% |**63.1%** |*55.5%* |xx.x%
39
+
40
+ ## Credits
41
+ Big thanks to Retis Labs for being providing my 8xH100 polycule used to train and test this model!
42
+ Another big thanks to AllenAI for publishing the Tülu 3 data and model series (as well as the paper and details on training), as well as Alibaba for training the original Qwen 2.5 base model series!
43
+
44
+ ```
45
+ @article{lambert2024tulu3,
46
+ title = {Tülu 3: Pushing Frontiers in Open Language Model Post-Training},
47
+ author = {
48
+ Nathan Lambert and
49
+ Jacob Morrison and
50
+ Valentina Pyatkin and
51
+ Shengyi Huang and
52
+ Hamish Ivison and
53
+ Faeze Brahman and
54
+ Lester James V. Miranda and
55
+ Alisa Liu and
56
+ Nouha Dziri and
57
+ Shane Lyu and
58
+ Yuling Gu and
59
+ Saumya Malik and
60
+ Victoria Graf and
61
+ Jena D. Hwang and
62
+ Jiangjiang Yang and
63
+ Ronan Le Bras and
64
+ Oyvind Tafjord and
65
+ Chris Wilhelm and
66
+ Luca Soldaini and
67
+ Noah A. Smith and
68
+ Yizhong Wang and
69
+ Pradeep Dasigi and
70
+ Hannaneh Hajishirzi
71
+ },
72
+ year = {2024},
73
+ email = {tulu@allenai.org}
74
+ }
75
+ ```
76
+
77
+ ## Training procedure
78
+
79
+ [<img src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/axolotl-ai-cloud/axolotl)
80
+
81
+ ### Training hyperparameters
82
+
83
+ The following hyperparameters were used during training:
84
+ - learning_rate: 3.5e-06
85
+ - train_batch_size: 8
86
+ - eval_batch_size: 8
87
+ - seed: 42
88
+ - distributed_type: multi-GPU
89
+ - num_devices: 8
90
+ - gradient_accumulation_steps: 2
91
+ - total_train_batch_size: 128
92
+ - total_eval_batch_size: 64
93
+ - optimizer: Use paged_ademamix_8bit and the args are:
94
+ No additional optimizer arguments
95
+ - lr_scheduler_type: cosine
96
+ - lr_scheduler_warmup_steps: 370
97
+ - num_epochs: 1
98
+
99
+ ### Framework versions
100
+
101
+ - Transformers 4.46.3
102
+ - Pytorch 2.5.1+cu124
103
+ - Datasets 3.1.0
104
+ - Tokenizers 0.20.3
105
+
106
+ ### Configuration
107
+ <details><summary>See axolotl config</summary>
108
+
109
+ axolotl version: `0.5.2`
110
+ ```yaml
111
+ base_model: Qwen/Qwen2.5-7B
112
+
113
+ plugins:
114
+ - axolotl.integrations.liger.LigerPlugin
115
+ liger_rope: true
116
+ liger_rms_norm: true
117
+ liger_glu_activation: true
118
+ liger_fused_linear_cross_entropy: true
119
+
120
+ strict: false
121
+
122
+ chat_template: chatml
123
+ datasets:
124
+ - path: allenai/tulu-3-sft-mixture
125
+ type: chat_template
126
+ split: train
127
+ field_messages: messages
128
+
129
+ dataset_prepared_path: last_run_prepared
130
+ #val_set_size: 0.02
131
+ output_dir: ./ckpts
132
+
133
+ sequence_len: 8192
134
+ #sample_packing: true
135
+ pad_to_sequence_len: true
136
+
137
+ wandb_project: qwen-2.5-7b-sft
138
+ wandb_entity:
139
+ wandb_watch:
140
+ wandb_name:
141
+ wandb_log_model:
142
+
143
+ gradient_accumulation_steps: 2
144
+ micro_batch_size: 8
145
+ num_epochs: 1
146
+ optimizer: paged_ademamix_8bit
147
+ lr_scheduler: cosine
148
+ learning_rate: 3.5e-6
149
+
150
+ train_on_inputs: false
151
+ group_by_length: false
152
+ bf16: auto
153
+ fp16:
154
+ tf32: false
155
+
156
+ gradient_checkpointing: true
157
+ gradient_checkpointing_kwargs:
158
+ use_reentrant: false
159
+ early_stopping_patience:
160
+ resume_from_checkpoint:
161
+ logging_steps: 1
162
+ xformers_attention:
163
+ flash_attention: true
164
+
165
+ deepspeed: deepspeed_configs/zero3_bf16.json
166
+
167
+ warmup_steps: 370
168
+ #evals_per_epoch: 4
169
+ eval_table_size:
170
+ saves_per_epoch: 2
171
+ debug:
172
+ weight_decay: 0.0
173
+
174
+ ```
175
+
176
  </details><br>