BERT-DLM
BERT-base (110M params) trained from scratch with a modern diffusion language model (DLM) objective using absorbing-state diffusion with a uniform noise schedule.
This model is part of a paired experiment comparing classic BERT MLM training against modern DLM training. See AntonXue/BERT-MLM for the counterpart.
Training Objective
Absorbing-state diffusion with uniform schedule: sample t ~ U(0,1), mask each token independently with probability t (replacing with [MASK]), then predict original tokens at masked positions. Cross-entropy loss on masked positions with uniform time weighting (time_weight = 1).
Key differences from classic BERT MLM:
- Variable mask rate (0-100%) vs fixed 15% — model sees the full spectrum from nearly clean to nearly destroyed
- Always [MASK] replacement (absorbing state) vs 80/10/10 corruption scheme
- Uniform noise schedule — no cosine time weighting
Dataset
- BookCorpusOpen () — ~17K books
- English Wikipedia (, 20231101.en) — ~6.4M articles
- Split: 95/5 train/eval on raw documents, then tokenized and packed into 512-token sequences (no padding)
- Train sequences: 10,784,085
- Total train tokens: 5.52B
Training Configuration
| Parameter | Value |
|---|---|
| Architecture | (fresh random init) |
| Parameters | 109.5M |
| Sequence length | 512 |
| Global batch size | 256 (128 per GPU x 2 GPUs) |
| Training steps | 100,000 |
| Tokens seen | ~13.1B |
| Optimizer | AdamW |
| Learning rate | 1e-4 |
| LR schedule | Constant with warmup |
| Warmup steps | 500 |
| Adam betas | (0.9, 0.999) |
| Weight decay | 0.01 |
| Max grad norm | 1.0 |
| Precision | bf16 |
| Hardware | 2x NVIDIA H100 NVL |
Usage
Code
Training code: github.com/AntonXue/dBERT
- Downloads last month
- 40