FlashLM v5 "Thunderbolt" ⚡
A 29.7M parameter matmul-free language model trained entirely on CPU without GPUs.
Model Description
FlashLM v5 "Thunderbolt" is a revolutionary language model that was pre-trained from scratch on consumer hardware — without any GPUs. It uses a novel MatMul-free architecture called ParallelGatedRecurrence with ternary weights (BitLinear) to achieve dramatic efficiency improvements.
Key Achievements
- Final PPL: 1.36 — Beats the TinyStories-1M baseline (PPL 1.59)!
- Final BPC: 0.44
- Training Time: ~40 hours on AMD Ryzen 7950X3D
- Training Data: ~1B tokens from TinyStories
Architecture
FlashLM v5 uses ParallelGatedRecurrence — a matmul-free token mixer that replaces attention with:
- Ternary weights (BitLinear): Quantized to {-1, 0, +1} reducing memory 16x
- Parallel gated recurrence: Learned decay gates for efficient context
- No matrix multiplications in the forward pass!
Parameters: 29,750,784
Ternary: 26,542,080 (89%)
Float: 3,208,704 (11%)
Ternary size: ~6.6 MB (vs 119 MB float32)
Usage
With Gradio Demo
from demo_v5 import ThunderboltLM, load_model
from tokenizers import Tokenizer
# Load model
load_model("FlashLM_v5_Results")
# Generate text
prompt = "Once upon a time"
ids = tokenizer.encode(prompt).ids
x = torch.tensor([ids])
out = model.generate(x, max_new_tokens=100, temperature=0.8)
print(tokenizer.decode(out[0].tolist()))
Direct Model Loading
import torch
from ThunderboltLM import ThunderboltLM
from tokenizers import Tokenizer
# Load tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")
# Create model
model = ThunderboltLM(
vocab=8192,
d_model=384,
n_heads=8,
d_head=48,
n_layers=18,
d_ffn=1152
)
# Load weights
state_dict = torch.load("best.pt", map_location="cpu", weights_only=True)
model.load_state_dict(state_dict)
model.eval()
# Generate
ids = tokenizer.encode("Once upon a time").ids
out = model.generate(torch.tensor([ids]), max_new_tokens=100)
print(tokenizer.decode(out[0].tolist()))
Training Details
| Metric | Value |
|---|---|
| Parameters | 29.7M |
| Ternary Parameters | 26.5M |
| Vocabulary Size | 8192 |
| Model Dimension | 384 |
| Layers | 18 |
| Attention Heads | 8 |
| Head Dimension | 48 |
| FFN Dimension | 1152 |
| Context Length | 256 |
| Training Tokens | ~958M |
| Training Time | ~40 hours |
| Hardware | AMD Ryzen 7950X3D |
| Final Loss | 0.306 |
| Final PPL | 1.36 |
| Final BPC | 0.44 |
🎉 ACKNOWLEDGMENTS 🎉
Massive Thanks to arki05!!! 🙏🙏🙏
arki05 provided the AMD Ryzen 7950X3D used for training this model!
Without arki05's generous contribution of their machine, this project would not have been possible. I would still be stuck using free tier compute!
THANK YOU ARKI05!!! ⚡⚡⚡
Comparison with Baselines
| Model | Params | PPL | Training |
|---|---|---|---|
| FlashLM v5 Thunderbolt | 29.7M | 1.36 | ~40h CPU |
| TinyStories-1M (baseline) | 1M | 1.59 | ~24h GPU |
| FlashLM v4 "Bolt" | 4.3M | 15.05 | 2h CPU |
| FlashLM v5.2 "Nova-Ignition" | 5.0M | 10.56 | 2h CPU |
FlashLM v5 is the first CPU-trained model to beat the TinyStories-1M baseline while being trained on comparable compute!
Limitations
- Trained only on TinyStories (synthetic short stories)
- No chat capability
- BPE tokenizer trained specifically for this dataset
- CPU inference is slower than GPU
Citation
If you use this model, please cite:
@misc{flashlm-v5-thunderbolt,
author = {Chang Cheng},
title = {FlashLM v5 Thunderbolt: CPU-Based MatMul-Free Language Model},
year = {2026},
url = {https://github.com/changcheng967/FlashLM}
}
License
MIT License
FlashLM: Democratizing Language Model Research ⚡
- Downloads last month
- 94