A newer version of this model is available: changcheng967/flashlm-v5.2-nova-ignition

FlashLM v5 "Thunderbolt" ⚡

A 29.7M parameter matmul-free language model trained entirely on CPU without GPUs.

Model Description

FlashLM v5 "Thunderbolt" is a revolutionary language model that was pre-trained from scratch on consumer hardware — without any GPUs. It uses a novel MatMul-free architecture called ParallelGatedRecurrence with ternary weights (BitLinear) to achieve dramatic efficiency improvements.

Key Achievements

  • Final PPL: 1.36 — Beats the TinyStories-1M baseline (PPL 1.59)!
  • Final BPC: 0.44
  • Training Time: ~40 hours on AMD Ryzen 7950X3D
  • Training Data: ~1B tokens from TinyStories

Architecture

FlashLM v5 uses ParallelGatedRecurrence — a matmul-free token mixer that replaces attention with:

  • Ternary weights (BitLinear): Quantized to {-1, 0, +1} reducing memory 16x
  • Parallel gated recurrence: Learned decay gates for efficient context
  • No matrix multiplications in the forward pass!
Parameters:     29,750,784
Ternary:       26,542,080 (89%)
Float:          3,208,704 (11%)
Ternary size:   ~6.6 MB (vs 119 MB float32)

Usage

With Gradio Demo

from demo_v5 import ThunderboltLM, load_model
from tokenizers import Tokenizer

# Load model
load_model("FlashLM_v5_Results")

# Generate text
prompt = "Once upon a time"
ids = tokenizer.encode(prompt).ids
x = torch.tensor([ids])
out = model.generate(x, max_new_tokens=100, temperature=0.8)
print(tokenizer.decode(out[0].tolist()))

Direct Model Loading

import torch
from ThunderboltLM import ThunderboltLM
from tokenizers import Tokenizer

# Load tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")

# Create model
model = ThunderboltLM(
    vocab=8192,
    d_model=384,
    n_heads=8,
    d_head=48,
    n_layers=18,
    d_ffn=1152
)

# Load weights
state_dict = torch.load("best.pt", map_location="cpu", weights_only=True)
model.load_state_dict(state_dict)
model.eval()

# Generate
ids = tokenizer.encode("Once upon a time").ids
out = model.generate(torch.tensor([ids]), max_new_tokens=100)
print(tokenizer.decode(out[0].tolist()))

Training Details

Metric Value
Parameters 29.7M
Ternary Parameters 26.5M
Vocabulary Size 8192
Model Dimension 384
Layers 18
Attention Heads 8
Head Dimension 48
FFN Dimension 1152
Context Length 256
Training Tokens ~958M
Training Time ~40 hours
Hardware AMD Ryzen 7950X3D
Final Loss 0.306
Final PPL 1.36
Final BPC 0.44

🎉 ACKNOWLEDGMENTS 🎉

Massive Thanks to arki05!!! 🙏🙏🙏

arki05 provided the AMD Ryzen 7950X3D used for training this model!

Without arki05's generous contribution of their machine, this project would not have been possible. I would still be stuck using free tier compute!

THANK YOU ARKI05!!! ⚡⚡⚡

Comparison with Baselines

Model Params PPL Training
FlashLM v5 Thunderbolt 29.7M 1.36 ~40h CPU
TinyStories-1M (baseline) 1M 1.59 ~24h GPU
FlashLM v4 "Bolt" 4.3M 15.05 2h CPU
FlashLM v5.2 "Nova-Ignition" 5.0M 10.56 2h CPU

FlashLM v5 is the first CPU-trained model to beat the TinyStories-1M baseline while being trained on comparable compute!

Limitations

  • Trained only on TinyStories (synthetic short stories)
  • No chat capability
  • BPE tokenizer trained specifically for this dataset
  • CPU inference is slower than GPU

Citation

If you use this model, please cite:

@misc{flashlm-v5-thunderbolt,
  author = {Chang Cheng},
  title = {FlashLM v5 Thunderbolt: CPU-Based MatMul-Free Language Model},
  year = {2026},
  url = {https://github.com/changcheng967/FlashLM}
}

License

MIT License


FlashLM: Democratizing Language Model Research

Downloads last month
94
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using changcheng967/flashlm-v5-thunderbolt 1