--- license: apache-2.0 base_model: thomasjhuang/qwen2-sft-warmup tags: - reinforcement-learning - rloo - countdown-math - qwen2 language: - en pipeline_tag: text-generation --- # Qwen2 RLOO Countdown (Step 250) This model is a Qwen2-based language model fine-tuned using RLOO (Rank-order Learning with Localized Objectives) on countdown math problems. ## Training Details - **Base Model**: thomasjhuang/qwen2-sft-warmup - **Method**: RLOO (Reinforcement Learning from Human Feedback) - **Dataset**: Jiayi-Pan/Countdown-Tasks-3to4 - **Training Steps**: 250 optimizer steps - **Learning Rate**: 3e-6 - **Temperature**: 0.1 - **Batch Size**: 2 - **K Samples**: 8 ## Key Fixes Applied 1. **Prompt Format**: Updated to match SFT evaluation format with detailed instructions 2. **Token Length**: Increased to 250 tokens for complete reasoning 3. **Temperature**: Reduced to 0.1 for more deterministic generation 4. **Extraction**: Fixed to work with vLLM outputs ## Performance During training at step 250, the model achieved: - Average rewards ranging from 0.05 to 0.50 across batches - Successful generation of proper `` and `` tags - Correct solutions to various countdown math problems ## Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("thomasjhuang/qwen2-rloo-countdown-step250") model = AutoModelForCausalLM.from_pretrained("thomasjhuang/qwen2-rloo-countdown-step250") prompt = '''Using the numbers [8, 16, 80], create an equation that equals 72. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in tags. And return the final answer in tags, for example (1 + 2) / 3 .''' inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_length=300, temperature=0.1) response = tokenizer.decode(outputs[0], skip_special_tokens=True) print(response) ``` ## Training Progress This checkpoint represents an intermediate state in RLOO training where: - The model learned to follow the correct prompt format - Success rates improved from 0% to 10-50% on various problems - The model generates structured reasoning in `` tags - Solutions are properly formatted in `` tags For the latest checkpoint, see: thomasjhuang/qwen2-rloo-countdown-final