Update README.md
Browse files
README.md
CHANGED
|
@@ -1,12 +1,15 @@
|
|
| 1 |
---
|
| 2 |
license: llama3
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
This preference model is trained from [LLaMA3-8B-it](meta-llama/Meta-Llama-3-8B-Instruct) with the training script at [Reward Modeling](https://github.com/RLHFlow/RLHF-Reward-Modeling/tree/pm_dev/pair-pm).
|
| 5 |
|
| 6 |
The dataset is RLHFlow/pair_preference_model_dataset. It achieves Chat-98.6, Char-hard 65.8, Safety 89.6, and reasoning 94.9 in reward bench.
|
| 7 |
|
| 8 |
-
See our paper [RLHF Workflow: From Reward Modeling to Online RLHF](https://arxiv.org/abs/2405.07863) for more details of this model.
|
| 9 |
-
|
| 10 |
## Service the RM
|
| 11 |
|
| 12 |
Here is an example to use the Preference Model to rank a pair. For n>2 responses, it is recommened to use the tournament style ranking strategy to get the best response so that the complexity is linear in n.
|
|
|
|
| 1 |
---
|
| 2 |
license: llama3
|
| 3 |
---
|
| 4 |
+
|
| 5 |
+
* **Paper**: [RLHF Workflow: From Reward Modeling to Online RLHF](https://arxiv.org/pdf/2405.07863) (Published in TMLR, 2024)
|
| 6 |
+
* **Authors**: Hanze Dong*, Wei Xiong*, Bo Pang*, Haoxiang Wang*, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang
|
| 7 |
+
* **Code**: https://github.com/RLHFlow/RLHF-Reward-Modeling/
|
| 8 |
+
|
| 9 |
This preference model is trained from [LLaMA3-8B-it](meta-llama/Meta-Llama-3-8B-Instruct) with the training script at [Reward Modeling](https://github.com/RLHFlow/RLHF-Reward-Modeling/tree/pm_dev/pair-pm).
|
| 10 |
|
| 11 |
The dataset is RLHFlow/pair_preference_model_dataset. It achieves Chat-98.6, Char-hard 65.8, Safety 89.6, and reasoning 94.9 in reward bench.
|
| 12 |
|
|
|
|
|
|
|
| 13 |
## Service the RM
|
| 14 |
|
| 15 |
Here is an example to use the Preference Model to rank a pair. For n>2 responses, it is recommened to use the tournament style ranking strategy to get the best response so that the complexity is linear in n.
|