GPT-NeoX now supports preference learning (SFT, DPO, KTO)! For more information on this joint effort between EleutherAI and SynthLabs, view our associated blog posts:
SynthLabs: https://www.synthlabs.ai/blog/rlhf-and-rlaif-in-gpt-neox
EleutherAI: https://www.eleuther.ai/rlhf-and-rlaif-in-gpt-neox
This is a direct preference optimization (DPO) model produced by:
- Taking the ultrachat SFT checkpoint from https://huggingface.co/HuggingFaceH4/mistral-7b-sft-beta
- Loading the model into the GPT-NeoX library and running DPO on the Zephyr 7B recipe (ultrafeedback). Example usage for running post-training in GPT-NeoX is at: https://github.com/EleutherAI/gpt-neox/tree/main/post-training
| Model | gsm8k 5-shot flexible-extract | MMLU 5-shot acc | ARC Challenge 25-shot acc_norm | HellaSwag 10-shot acc_norm | Winogrande 5-shot acc | TruthfulQA mc2 0-shot acc |
|---|---|---|---|---|---|---|
| NeoX DPO from Zephyr-SFT | 64.1 | 41.8 | 60 | 63.2 | 85.2 | 79.2 |
| Zephyr-7b-Beta | 62.5 | 34.3 | 59.8 | 63.6 | 84.4 | 77.6 |
- Downloads last month
- 43
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support