| | --- |
| | license: apache-2.0 |
| | base_model: thomasjhuang/qwen2-sft-warmup |
| | tags: |
| | - reinforcement-learning |
| | - rloo |
| | - countdown-math |
| | - qwen2 |
| | language: |
| | - en |
| | pipeline_tag: text-generation |
| | --- |
| | |
| | # Qwen2 RLOO Countdown (Step 150) |
| |
|
| | This model is a Qwen2-based language model fine-tuned using RLOO (Rank-order Learning with Localized Objectives) on countdown math problems. |
| |
|
| | ## Training Details |
| |
|
| | - **Base Model**: thomasjhuang/qwen2-sft-warmup |
| | - **Method**: RLOO (Reinforcement Learning from Human Feedback) |
| | - **Dataset**: Jiayi-Pan/Countdown-Tasks-3to4 |
| | - **Training Steps**: 150 optimizer steps |
| | - **Learning Rate**: 3e-6 |
| | - **Temperature**: 0.1 |
| | - **Batch Size**: 2 |
| | - **K Samples**: 8 |
| |
|
| | ## Key Fixes Applied |
| |
|
| | 1. **Prompt Format**: Updated to match SFT evaluation format with detailed instructions |
| | 2. **Token Length**: Increased to 250 tokens for complete reasoning |
| | 3. **Temperature**: Reduced to 0.1 for more deterministic generation |
| | 4. **Extraction**: Fixed to work with vLLM outputs |
| |
|
| | ## Performance |
| |
|
| | During training at step 150, the model achieved: |
| | - Average rewards ranging from 0.05 to 0.50 across batches |
| | - Successful generation of proper `<think>` and `<answer>` tags |
| | - Correct solutions to various countdown math problems |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForCausalLM |
| | |
| | tokenizer = AutoTokenizer.from_pretrained("thomasjhuang/qwen2-rloo-countdown-step150") |
| | model = AutoModelForCausalLM.from_pretrained("thomasjhuang/qwen2-rloo-countdown-step150") |
| | |
| | prompt = '''Using the numbers [8, 16, 80], create an equation that equals 72. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>.''' |
| | |
| | inputs = tokenizer(prompt, return_tensors="pt") |
| | outputs = model.generate(**inputs, max_length=300, temperature=0.1) |
| | response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
| | print(response) |
| | ``` |
| |
|
| | ## Training Progress |
| |
|
| | This checkpoint represents an intermediate state in RLOO training where: |
| | - The model learned to follow the correct prompt format |
| | - Success rates improved from 0% to 10-50% on various problems |
| | - The model generates structured reasoning in `<think>` tags |
| | - Solutions are properly formatted in `<answer>` tags |
| |
|
| | For the latest checkpoint, see: thomasjhuang/qwen2-rloo-countdown-final |
| |
|