BERT-DLM

BERT-base (110M params) trained from scratch with a modern diffusion language model (DLM) objective using absorbing-state diffusion with a uniform noise schedule.

This model is part of a paired experiment comparing classic BERT MLM training against modern DLM training. See AntonXue/BERT-MLM for the counterpart.

Training Objective

Absorbing-state diffusion with uniform schedule: sample t ~ U(0,1), mask each token independently with probability t (replacing with [MASK]), then predict original tokens at masked positions. Cross-entropy loss on masked positions with uniform time weighting (time_weight = 1).

Key differences from classic BERT MLM:

  • Variable mask rate (0-100%) vs fixed 15% — model sees the full spectrum from nearly clean to nearly destroyed
  • Always [MASK] replacement (absorbing state) vs 80/10/10 corruption scheme
  • Uniform noise schedule — no cosine time weighting

Dataset

  • BookCorpusOpen () — ~17K books
  • English Wikipedia (, 20231101.en) — ~6.4M articles
  • Split: 95/5 train/eval on raw documents, then tokenized and packed into 512-token sequences (no padding)
  • Train sequences: 10,784,085
  • Total train tokens: 5.52B

Training Configuration

Parameter Value
Architecture (fresh random init)
Parameters 109.5M
Sequence length 512
Global batch size 256 (128 per GPU x 2 GPUs)
Training steps 100,000
Tokens seen ~13.1B
Optimizer AdamW
Learning rate 1e-4
LR schedule Constant with warmup
Warmup steps 500
Adam betas (0.9, 0.999)
Weight decay 0.01
Max grad norm 1.0
Precision bf16
Hardware 2x NVIDIA H100 NVL

Usage

Code

Training code: github.com/AntonXue/dBERT

Downloads last month
40
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train AntonXue/BERT-DLM