A newer version of this model is available: changcheng967/flashlm-v5.2-nova-ignition

FlashLM v5 "Thunderbolt" ⚡

A 29.7M parameter matmul-free language model trained entirely on CPU without GPUs.

Model Description

FlashLM v5 "Thunderbolt" is a revolutionary language model that was pre-trained from scratch on consumer hardware — without any GPUs. It uses a novel MatMul-free architecture called ParallelGatedRecurrence with ternary weights (BitLinear) to achieve dramatic efficiency improvements.

Key Achievements

Final PPL: 1.36 — Beats the TinyStories-1M baseline (PPL 1.59)!
Final BPC: 0.44
Training Time: ~40 hours on AMD Ryzen 7950X3D
Training Data: ~1B tokens from TinyStories

Architecture

FlashLM v5 uses ParallelGatedRecurrence — a matmul-free token mixer that replaces attention with:

Ternary weights (BitLinear): Quantized to {-1, 0, +1} reducing memory 16x
Parallel gated recurrence: Learned decay gates for efficient context
No matrix multiplications in the forward pass!

Parameters:     29,750,784
Ternary:       26,542,080 (89%)
Float:          3,208,704 (11%)
Ternary size:   ~6.6 MB (vs 119 MB float32)

Usage

With Gradio Demo

from demo_v5 import ThunderboltLM, load_model
from tokenizers import Tokenizer

# Load model
load_model("FlashLM_v5_Results")

# Generate text
prompt = "Once upon a time"
ids = tokenizer.encode(prompt).ids
x = torch.tensor([ids])
out = model.generate(x, max_new_tokens=100, temperature=0.8)
print(tokenizer.decode(out[0].tolist()))

Direct Model Loading

import torch
from ThunderboltLM import ThunderboltLM
from tokenizers import Tokenizer

# Load tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")

# Create model
model = ThunderboltLM(
    vocab=8192,
    d_model=384,
    n_heads=8,
    d_head=48,
    n_layers=18,
    d_ffn=1152
)

# Load weights
state_dict = torch.load("best.pt", map_location="cpu", weights_only=True)
model.load_state_dict(state_dict)
model.eval()

# Generate
ids = tokenizer.encode("Once upon a time").ids
out = model.generate(torch.tensor([ids]), max_new_tokens=100)
print(tokenizer.decode(out[0].tolist()))

Training Details

Metric	Value
Parameters	29.7M
Ternary Parameters	26.5M
Vocabulary Size	8192
Model Dimension	384
Layers	18
Attention Heads	8
Head Dimension	48
FFN Dimension	1152
Context Length	256
Training Tokens	~958M
Training Time	~40 hours
Hardware	AMD Ryzen 7950X3D
Final Loss	0.306
Final PPL	1.36
Final BPC	0.44

🎉 ACKNOWLEDGMENTS 🎉

Massive Thanks to arki05!!! 🙏🙏🙏

arki05 provided the AMD Ryzen 7950X3D used for training this model!

Without arki05's generous contribution of their machine, this project would not have been possible. I would still be stuck using free tier compute!

THANK YOU ARKI05!!! ⚡⚡⚡

Comparison with Baselines

Model	Params	PPL	Training
FlashLM v5 Thunderbolt	29.7M	1.36	~40h CPU
TinyStories-1M (baseline)	1M	1.59	~24h GPU
FlashLM v4 "Bolt"	4.3M	15.05	2h CPU
FlashLM v5.2 "Nova-Ignition"	5.0M	10.56	2h CPU

FlashLM v5 is the first CPU-trained model to beat the TinyStories-1M baseline while being trained on comparable compute!

Limitations

Trained only on TinyStories (synthetic short stories)
No chat capability
BPE tokenizer trained specifically for this dataset
CPU inference is slower than GPU

Citation

If you use this model, please cite:

@misc{flashlm-v5-thunderbolt,
  author = {Chang Cheng},
  title = {FlashLM v5 Thunderbolt: CPU-Based MatMul-Free Language Model},
  year = {2026},
  url = {https://github.com/changcheng967/FlashLM}
}

License

MIT License

FlashLM: Democratizing Language Model Research ⚡

Downloads last month: 94

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

changcheng967
/

flashlm-v5-thunderbolt