---
license: apache-2.0
base_model: thomasjhuang/qwen2-sft-warmup
tags:
- reinforcement-learning
- rloo
- countdown-math
- qwen2
language:
- en
pipeline_tag: text-generation
---

# Qwen2 RLOO Countdown (Step 250)

This model is a Qwen2-based language model fine-tuned using RLOO (Rank-order Learning with Localized Objectives) on countdown math problems.

## Training Details

- **Base Model**: thomasjhuang/qwen2-sft-warmup
- **Method**: RLOO (Reinforcement Learning from Human Feedback)
- **Dataset**: Jiayi-Pan/Countdown-Tasks-3to4
- **Training Steps**: 250 optimizer steps
- **Learning Rate**: 3e-6
- **Temperature**: 0.1
- **Batch Size**: 2
- **K Samples**: 8

## Key Fixes Applied

1. **Prompt Format**: Updated to match SFT evaluation format with detailed instructions
2. **Token Length**: Increased to 250 tokens for complete reasoning
3. **Temperature**: Reduced to 0.1 for more deterministic generation
4. **Extraction**: Fixed to work with vLLM outputs

## Performance

During training at step 250, the model achieved:
- Average rewards ranging from 0.05 to 0.50 across batches
- Successful generation of proper `<think>` and `<answer>` tags
- Correct solutions to various countdown math problems

## Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("thomasjhuang/qwen2-rloo-countdown-step250")
model = AutoModelForCausalLM.from_pretrained("thomasjhuang/qwen2-rloo-countdown-step250")

prompt = '''Using the numbers [8, 16, 80], create an equation that equals 72. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>.'''

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=300, temperature=0.1)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```

## Training Progress

This checkpoint represents an intermediate state in RLOO training where:
- The model learned to follow the correct prompt format
- Success rates improved from 0% to 10-50% on various problems
- The model generates structured reasoning in `<think>` tags
- Solutions are properly formatted in `<answer>` tags

For the latest checkpoint, see: thomasjhuang/qwen2-rloo-countdown-final