🧠 Mini-LLM β€” 80M Parameter Transformer (Pretrained From Scratch)

MIT License Model Size

Mini-LLM is an 80M parameter decoder-only transformer trained fully from scratch using a custom tokenizer, custom architecture, and custom training loop.
It is designed as an educational + research-friendly minimal LLM that demonstrates how modern LLM components are built end-to-end.


✨ Key Features

  • 80M parameters β€” compact but fully functional LLM
  • Trained from scratch (no borrowed checkpoints)
  • Custom Byte-Level BPE tokenizer (32k vocab)
  • Modern architecture components:
    • RoPE (Rotary Position Embeddings)
    • RMSNorm
    • SwiGLU FeedForward layer
    • FlashAttention (via PyTorch SDPA)
    • GQA-ready Attention implementation
  • 2B tokens mixed corpus (FineWeb + WikiText + Wikipedia)
  • Training logs, checkpoints, plots all included for transparency
  • Released under a permissive license for research & learning

πŸ“ Model Architecture

Component Value
Type Decoder-only transformer
Parameters ~80M
Layers 16
Embedding dim 384
Attention heads 6
KV Heads 6
MLP Hidden Dim 1536 (SwiGLU)
Max sequence length 2048
Norm RMSNorm
Positional Encoding RoPE
Tokenizer SentencePiece BPE (32k vocab, byte fallback)

πŸ“¦ Files in This Repo

  • checkpoints/ β†’ Pretrained model state_dict + optimizer
  • safetensors/ β†’ Final consolidated .safetensors file
  • logs/ β†’ Training logs in JSONL
  • plots/ β†’ Train/val loss curves
  • tokenizer.json β†’ HF-compatible tokenizer
  • spm.model β†’ SentencePiece model

πŸ§ͺ Quick Usage (HF Transformers)

from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("Ashx098/Mini-LLM", trust_remote_code=True)
tok = AutoTokenizer.from_pretrained("Ashx098/Mini-LLM")

prompt = "Hello, how are you?"
inputs = tok(prompt, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=50)
print(tok.decode(outputs[0], skip_special_tokens=True))

πŸš€ Training Details

Optimizer

  • AdamW (Ξ²1=0.9, Ξ²2=0.95, weight decay=0.1)
  • Learning rate: 6e-4 (cosine annealing + warmup)

Batch ⨉ Sequence

  • Global batch size = 32
  • Sequence length = 2048
  • Gradient accumulation = 8

Hardware

  • Trained on 1Γ— NVIDIA A100 80GB

πŸ“Š Training Curve

Final loss reached: ~3.25

πŸ’¬ Example Outputs

Prompt: "Hello, how are you" Output: "Hello, how are you?"

Prompt: "Python is a programming language that" Output: "Python is a programming language that allows the history..."

⚠️ Limitations

  • Small model β†’ limited reasoning, hallucination likely
  • Not instruction-tuned
  • Not suitable for production usage
  • Best viewed as a learning + research artifact

πŸ“œ License

MIT License β€” free for research, modification, and further training.

πŸ™Œ Credits

Developed by Avinash Mynampati
Built from scratch using PyTorch + custom training pipeline.

Want to fine-tune or extend it?

You can:

  • Train further with your own dataset
  • Add LoRA adapters
  • Use it to learn attention, RoPE, SwiGLU, etc.
  • Build a tiny instruction-tuned version (coming soon!)

πŸ“¬ Contact

For questions or collaborations:

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support