π§ Mini-LLM β 80M Parameter Transformer (Pretrained From Scratch)
Mini-LLM is an 80M parameter decoder-only transformer trained fully from scratch using a custom tokenizer, custom architecture, and custom training loop.
It is designed as an educational + research-friendly minimal LLM that demonstrates how modern LLM components are built end-to-end.
β¨ Key Features
- 80M parameters β compact but fully functional LLM
- Trained from scratch (no borrowed checkpoints)
- Custom Byte-Level BPE tokenizer (32k vocab)
- Modern architecture components:
- RoPE (Rotary Position Embeddings)
- RMSNorm
- SwiGLU FeedForward layer
- FlashAttention (via PyTorch SDPA)
- GQA-ready Attention implementation
- 2B tokens mixed corpus (FineWeb + WikiText + Wikipedia)
- Training logs, checkpoints, plots all included for transparency
- Released under a permissive license for research & learning
π Model Architecture
| Component | Value |
|---|---|
| Type | Decoder-only transformer |
| Parameters | ~80M |
| Layers | 16 |
| Embedding dim | 384 |
| Attention heads | 6 |
| KV Heads | 6 |
| MLP Hidden Dim | 1536 (SwiGLU) |
| Max sequence length | 2048 |
| Norm | RMSNorm |
| Positional Encoding | RoPE |
| Tokenizer | SentencePiece BPE (32k vocab, byte fallback) |
π¦ Files in This Repo
checkpoints/β Pretrained model state_dict + optimizersafetensors/β Final consolidated .safetensors filelogs/β Training logs in JSONLplots/β Train/val loss curvestokenizer.jsonβ HF-compatible tokenizerspm.modelβ SentencePiece model
π§ͺ Quick Usage (HF Transformers)
from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Ashx098/Mini-LLM", trust_remote_code=True)
tok = AutoTokenizer.from_pretrained("Ashx098/Mini-LLM")
prompt = "Hello, how are you?"
inputs = tok(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tok.decode(outputs[0], skip_special_tokens=True))
π Training Details
Optimizer
- AdamW (Ξ²1=0.9, Ξ²2=0.95, weight decay=0.1)
- Learning rate: 6e-4 (cosine annealing + warmup)
Batch β¨ Sequence
- Global batch size = 32
- Sequence length = 2048
- Gradient accumulation = 8
Hardware
- Trained on 1Γ NVIDIA A100 80GB
π Training Curve
Final loss reached: ~3.25
π¬ Example Outputs
Prompt: "Hello, how are you" Output: "Hello, how are you?"
Prompt: "Python is a programming language that" Output: "Python is a programming language that allows the history..."
β οΈ Limitations
- Small model β limited reasoning, hallucination likely
- Not instruction-tuned
- Not suitable for production usage
- Best viewed as a learning + research artifact
π License
MIT License β free for research, modification, and further training.
π Credits
Developed by Avinash Mynampati
Built from scratch using PyTorch + custom training pipeline.
Want to fine-tune or extend it?
You can:
- Train further with your own dataset
- Add LoRA adapters
- Use it to learn attention, RoPE, SwiGLU, etc.
- Build a tiny instruction-tuned version (coming soon!)
π¬ Contact
For questions or collaborations:
- GitHub: Ashx098
- LinkedIn: Avinash Mynampati