Granite 3.1 8B — Basic Fantasy RPG (OSFT v2)

A domain-expert language model fine-tuned on Basic Fantasy Role-Playing Game source material using OSFT (Orthogonal Subspace Fine-Tuning), a continual learning method that preserves the base model's general knowledge while injecting new domain expertise.

This is the v2 release, trained on significantly improved synthetic data that closed 68% of the performance gap vs the base model compared to v1.

Model Details

Field	Value
Base Model	ibm-granite/granite-3.1-8b-instruct
Fine-tuning Method	OSFT (Orthogonal Subspace Fine-Tuning)
Parameters	8B
Precision	bfloat16
License	Apache 2.0
Domain	Basic Fantasy RPG rules, monsters, equipment, spells, and gameplay
Source Code	github.com/RobbieJ/frank

Evaluation Results

Evaluated on 210 questions across 16 BFRPG categories, scored 0-10 by an automated LLM judge (qwen3-14b) against reference answers.

Overall Performance

Metric	OSFT v2 Fine-tuned	Base	Delta
Mean Score	6.98	7.12	-0.14
Median Score	10.0	10.0	+0.0
Std Deviation	3.65	3.71

Head-to-Head: OSFT v2 wins 23 / Base wins 34 / Ties 153 Win Rate (excl. ties): 40.4%

Performance Across Training Iterations

Iteration	Method	Training Data	Judge	FT Score	Base Score	Delta
v1	LoRA SFT	7,600 (low-quality)	granite-3-2-8b	4.12	5.54	-1.41
v2	OSFT	7,600 (low-quality)	granite-3-2-8b	4.01	5.52	-1.51
v3 (this)	OSFT	5,717 (high-quality)	qwen3-14b	6.98	7.12	-0.14

Results by Category

Category	Questions	OSFT v2	Base	Delta	OSFT Wins	Base Wins	Ties
Core Monsters	10	9.1	7.5	+1.6	2	0	8
Weapons	15	7.7	6.7	+1.0	3	1	11
Rules	8	8.6	7.9	+0.7	2	1	5
Armor	10	8.1	7.7	+0.4	1	0	9
Combat	12	5.8	5.4	+0.4	3	3	6
Thief Abilities	8	9.3	9.2	+0.0	1	1	6
Field Guide	30	8.2	8.2	-0.0	3	2	25
Spells	14	5.8	5.8	+0.0	2	1	11
Gear	15	6.8	6.7	+0.0	1	2	12
Beginner's Guide	12	4.0	4.3	-0.3	2	2	8
Character Creation	12	5.4	5.8	-0.4	1	1	10
Animals & Vehicles	15	7.9	8.4	-0.5	1	1	13
Classes	14	5.7	6.5	-0.8	0	6	8
Movement	10	7.0	8.3	-1.3	1	1	8
Races	10	5.1	7.2	-2.1	0	4	6
Monster Index	15	5.8	8.3	-2.5	0	8	7

Key Improvement: Training Data Quality

The biggest change in v2 was not the algorithm — it was the training data:

Metric	v1 Data	v2 Data
Mean response length	192 chars	386 chars
Has numeric content	22.1%	71.2%
Source coverage	200/476 chunks	466/466 chunks
Teacher model	granite-3-2-8b (8B)	qwen3-14b (14B)
Meta-reference artifacts	12.2%	<1.3%

Why OSFT?

Traditional fine-tuning (SFT/LoRA) on domain data caused significant knowledge degradation — the base model scored 5.54 vs the LoRA fine-tuned model's 4.12 (granite judge). OSFT addresses this by:

Computing SVD of each weight matrix to identify critical dimensions
Freezing the most critical 85% of weight dimensions
Only updating the least critical 15% (unfreeze_rank_ratio=0.15)

This mathematically guarantees that new domain knowledge doesn't overwrite existing capabilities.

Training Details

Data

Source: 5 Basic Fantasy RPG PDFs (Core Rules r142, Beginner's Essentials r18, Field Guide Omnibus r4, Monster Index r7, Equipment Emporium r33)
Pipeline: PDF → Markdown (Docling) → Semantic chunking (466 chunks) → Q&A generation (qwen3-14b teacher) → Quality filtering → Merged dataset
Training examples: 5,717 instruction-response pairs (3,678 v2 high-quality + 2,039 best of v1)
Coverage: All 466 document chunks, 16 categories
Format: ChatML with system prompt, user question, and assistant answer

Hyperparameters

Parameter	Value
Algorithm	OSFT
Unfreeze rank ratio	0.15
Epochs	3
Learning rate	5e-6
LR scheduler	Cosine
Effective batch size	16
Max tokens per GPU	256
Max sequence length	1024
Train dtype	float32
Save dtype	bfloat16
Total training steps	1,074

Training Loss

Milestone	Loss
Step 1	1.97
End of Epoch 1 (step 358)	~1.05
End of Epoch 2 (step 716)	~0.90
End of Epoch 3 (step 1074)	0.85

Hardware

GPU: NVIDIA DGX Spark (GB10 Grace Blackwell Superchip)
Memory: 128 GB unified CPU+GPU
Peak memory usage: 78.4 GB
Training throughput: ~75 tokens/sec
Total training time: ~10 hours (including ~35 min SVD initialization)

Toolchain

Training Hub v0.4.0 (OSFT implementation)
Docling (PDF to markdown conversion)
Custom SDG script with qwen3-14b teacher model
PyTorch 2.9.0+cu130

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "RobbieJ/granite-3.1-8b-bfrpg-osft"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="bfloat16", device_map="auto")

messages = [
    {"role": "system", "content": "You are an expert on Basic Fantasy Role-Playing Game (BFRPG). You provide accurate, helpful answers about BFRPG rules, character creation, combat, spells, monsters, equipment, and gameplay."},
    {"role": "user", "content": "What is the Armor Class and cost of plate mail in BFRPG?"},
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(input_ids, max_new_tokens=512, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True))

Limitations

Remaining gap: The fine-tuned model scores -0.14 points below the base model overall. Categories like Monster Index (-2.5) and Races (-2.1) have persistent gaps.
Response length: Generates shorter responses than the base model, which may affect perceived completeness.
Numeric precision: Some specific values (prices, percentages) may be plausible but incorrect — the model learns patterns better than exact data points.
BFRPG-specific: Tuned for Basic Fantasy RPG only. May not generalize to other RPG systems.
Inherited limitations: All limitations of the base Granite 3.1 8B model apply.

Intended Use

BFRPG game masters looking for rules lookup assistance
Players needing character creation, spell, or equipment reference help
RPG content creators working with BFRPG material
Research into domain-specific fine-tuning with knowledge preservation (OSFT)
Educational reference for LLM fine-tuning pipelines

Ethical Considerations

This model is fine-tuned on open-source RPG content (Basic Fantasy RPG is released under the Open Game License). It generates fictional game content and should not be used for real-world decision-making.

Downloads last month: 15

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for RobbieJ/granite-3.1-8b-bfrpg-osft

Base model

ibm-granite/granite-3.1-8b-base

Finetuned

ibm-granite/granite-3.1-8b-instruct

Finetuned

(14)

this model

Evaluation results

Mean Score (0-10)
self-reported

6.980
Win Rate vs Base
self-reported

40.400