BERT-DLM

BERT-base (110M params) trained from scratch with a modern diffusion language model (DLM) objective using absorbing-state diffusion with a uniform noise schedule.

This model is part of a paired experiment comparing classic BERT MLM training against modern DLM training. See AntonXue/BERT-MLM for the counterpart.

Training Objective

Absorbing-state diffusion with uniform schedule: sample t ~ U(0,1), mask each token independently with probability t (replacing with [MASK]), then predict original tokens at masked positions. Cross-entropy loss on masked positions with uniform time weighting (time_weight = 1).

Key differences from classic BERT MLM:

Variable mask rate (0-100%) vs fixed 15% — model sees the full spectrum from nearly clean to nearly destroyed
Always [MASK] replacement (absorbing state) vs 80/10/10 corruption scheme
Uniform noise schedule — no cosine time weighting

Dataset

BookCorpusOpen () — ~17K books
English Wikipedia (, 20231101.en) — ~6.4M articles
Split: 95/5 train/eval on raw documents, then tokenized and packed into 512-token sequences (no padding)
Train sequences: 10,784,085
Total train tokens: 5.52B

Training Configuration

Parameter	Value
Architecture	(fresh random init)
Parameters	109.5M
Sequence length	512
Global batch size	256 (128 per GPU x 2 GPUs)
Training steps	100,000
Tokens seen	~13.1B
Optimizer	AdamW
Learning rate	1e-4
LR schedule	Constant with warmup
Warmup steps	500
Adam betas	(0.9, 0.999)
Weight decay	0.01
Max grad norm	1.0
Precision	bf16
Hardware	2x NVIDIA H100 NVL

Usage

Code

Training code: github.com/AntonXue/dBERT

Downloads last month: 40

Safetensors

Model size

0.1B params

Tensor type

F32

AntonXue
/

BERT-DLM