Granite 3.1 8B โ Basic Fantasy RPG (OSFT v2)
A domain-expert language model fine-tuned on Basic Fantasy Role-Playing Game source material using OSFT (Orthogonal Subspace Fine-Tuning), a continual learning method that preserves the base model's general knowledge while injecting new domain expertise.
This is the v2 release, trained on significantly improved synthetic data that closed 68% of the performance gap vs the base model compared to v1.
Model Details
| Field | Value |
|---|---|
| Base Model | ibm-granite/granite-3.1-8b-instruct |
| Fine-tuning Method | OSFT (Orthogonal Subspace Fine-Tuning) |
| Parameters | 8B |
| Precision | bfloat16 |
| License | Apache 2.0 |
| Domain | Basic Fantasy RPG rules, monsters, equipment, spells, and gameplay |
| Source Code | github.com/RobbieJ/frank |
Evaluation Results
Evaluated on 210 questions across 16 BFRPG categories, scored 0-10 by an automated LLM judge (qwen3-14b) against reference answers.
Overall Performance
| Metric | OSFT v2 Fine-tuned | Base | Delta |
|---|---|---|---|
| Mean Score | 6.98 | 7.12 | -0.14 |
| Median Score | 10.0 | 10.0 | +0.0 |
| Std Deviation | 3.65 | 3.71 |
Head-to-Head: OSFT v2 wins 23 / Base wins 34 / Ties 153 Win Rate (excl. ties): 40.4%
Performance Across Training Iterations
| Iteration | Method | Training Data | Judge | FT Score | Base Score | Delta |
|---|---|---|---|---|---|---|
| v1 | LoRA SFT | 7,600 (low-quality) | granite-3-2-8b | 4.12 | 5.54 | -1.41 |
| v2 | OSFT | 7,600 (low-quality) | granite-3-2-8b | 4.01 | 5.52 | -1.51 |
| v3 (this) | OSFT | 5,717 (high-quality) | qwen3-14b | 6.98 | 7.12 | -0.14 |
Results by Category
| Category | Questions | OSFT v2 | Base | Delta | OSFT Wins | Base Wins | Ties |
|---|---|---|---|---|---|---|---|
| Core Monsters | 10 | 9.1 | 7.5 | +1.6 | 2 | 0 | 8 |
| Weapons | 15 | 7.7 | 6.7 | +1.0 | 3 | 1 | 11 |
| Rules | 8 | 8.6 | 7.9 | +0.7 | 2 | 1 | 5 |
| Armor | 10 | 8.1 | 7.7 | +0.4 | 1 | 0 | 9 |
| Combat | 12 | 5.8 | 5.4 | +0.4 | 3 | 3 | 6 |
| Thief Abilities | 8 | 9.3 | 9.2 | +0.0 | 1 | 1 | 6 |
| Field Guide | 30 | 8.2 | 8.2 | -0.0 | 3 | 2 | 25 |
| Spells | 14 | 5.8 | 5.8 | +0.0 | 2 | 1 | 11 |
| Gear | 15 | 6.8 | 6.7 | +0.0 | 1 | 2 | 12 |
| Beginner's Guide | 12 | 4.0 | 4.3 | -0.3 | 2 | 2 | 8 |
| Character Creation | 12 | 5.4 | 5.8 | -0.4 | 1 | 1 | 10 |
| Animals & Vehicles | 15 | 7.9 | 8.4 | -0.5 | 1 | 1 | 13 |
| Classes | 14 | 5.7 | 6.5 | -0.8 | 0 | 6 | 8 |
| Movement | 10 | 7.0 | 8.3 | -1.3 | 1 | 1 | 8 |
| Races | 10 | 5.1 | 7.2 | -2.1 | 0 | 4 | 6 |
| Monster Index | 15 | 5.8 | 8.3 | -2.5 | 0 | 8 | 7 |
Key Improvement: Training Data Quality
The biggest change in v2 was not the algorithm โ it was the training data:
| Metric | v1 Data | v2 Data |
|---|---|---|
| Mean response length | 192 chars | 386 chars |
| Has numeric content | 22.1% | 71.2% |
| Source coverage | 200/476 chunks | 466/466 chunks |
| Teacher model | granite-3-2-8b (8B) | qwen3-14b (14B) |
| Meta-reference artifacts | 12.2% | <1.3% |
Why OSFT?
Traditional fine-tuning (SFT/LoRA) on domain data caused significant knowledge degradation โ the base model scored 5.54 vs the LoRA fine-tuned model's 4.12 (granite judge). OSFT addresses this by:
- Computing SVD of each weight matrix to identify critical dimensions
- Freezing the most critical 85% of weight dimensions
- Only updating the least critical 15% (
unfreeze_rank_ratio=0.15)
This mathematically guarantees that new domain knowledge doesn't overwrite existing capabilities.
Training Details
Data
- Source: 5 Basic Fantasy RPG PDFs (Core Rules r142, Beginner's Essentials r18, Field Guide Omnibus r4, Monster Index r7, Equipment Emporium r33)
- Pipeline: PDF โ Markdown (Docling) โ Semantic chunking (466 chunks) โ Q&A generation (qwen3-14b teacher) โ Quality filtering โ Merged dataset
- Training examples: 5,717 instruction-response pairs (3,678 v2 high-quality + 2,039 best of v1)
- Coverage: All 466 document chunks, 16 categories
- Format: ChatML with system prompt, user question, and assistant answer
Hyperparameters
| Parameter | Value |
|---|---|
| Algorithm | OSFT |
| Unfreeze rank ratio | 0.15 |
| Epochs | 3 |
| Learning rate | 5e-6 |
| LR scheduler | Cosine |
| Effective batch size | 16 |
| Max tokens per GPU | 256 |
| Max sequence length | 1024 |
| Train dtype | float32 |
| Save dtype | bfloat16 |
| Total training steps | 1,074 |
Training Loss
| Milestone | Loss |
|---|---|
| Step 1 | 1.97 |
| End of Epoch 1 (step 358) | ~1.05 |
| End of Epoch 2 (step 716) | ~0.90 |
| End of Epoch 3 (step 1074) | 0.85 |
Hardware
- GPU: NVIDIA DGX Spark (GB10 Grace Blackwell Superchip)
- Memory: 128 GB unified CPU+GPU
- Peak memory usage: 78.4 GB
- Training throughput: ~75 tokens/sec
- Total training time: ~10 hours (including ~35 min SVD initialization)
Toolchain
- Training Hub v0.4.0 (OSFT implementation)
- Docling (PDF to markdown conversion)
- Custom SDG script with qwen3-14b teacher model
- PyTorch 2.9.0+cu130
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "RobbieJ/granite-3.1-8b-bfrpg-osft"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="bfloat16", device_map="auto")
messages = [
{"role": "system", "content": "You are an expert on Basic Fantasy Role-Playing Game (BFRPG). You provide accurate, helpful answers about BFRPG rules, character creation, combat, spells, monsters, equipment, and gameplay."},
{"role": "user", "content": "What is the Armor Class and cost of plate mail in BFRPG?"},
]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(input_ids, max_new_tokens=512, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True))
Limitations
- Remaining gap: The fine-tuned model scores -0.14 points below the base model overall. Categories like Monster Index (-2.5) and Races (-2.1) have persistent gaps.
- Response length: Generates shorter responses than the base model, which may affect perceived completeness.
- Numeric precision: Some specific values (prices, percentages) may be plausible but incorrect โ the model learns patterns better than exact data points.
- BFRPG-specific: Tuned for Basic Fantasy RPG only. May not generalize to other RPG systems.
- Inherited limitations: All limitations of the base Granite 3.1 8B model apply.
Intended Use
- BFRPG game masters looking for rules lookup assistance
- Players needing character creation, spell, or equipment reference help
- RPG content creators working with BFRPG material
- Research into domain-specific fine-tuning with knowledge preservation (OSFT)
- Educational reference for LLM fine-tuning pipelines
Ethical Considerations
This model is fine-tuned on open-source RPG content (Basic Fantasy RPG is released under the Open Game License). It generates fictional game content and should not be used for real-world decision-making.
- Downloads last month
- 15
Model tree for RobbieJ/granite-3.1-8b-bfrpg-osft
Base model
ibm-granite/granite-3.1-8b-baseEvaluation results
- Mean Score (0-10)self-reported6.980
- Win Rate vs Baseself-reported40.400