KL3M 500M, 7th Gen Model, Step 37000 (4x Stacked Architecture)
A 500M parameter language model created via 4x cyclic layer duplication (G_stack method, NeurIPS 2024) from the KL3M 170M Phase 2+A checkpoint. This checkpoint represents 37,000 training steps on the 120-layer architecture.
Model Details
- Architecture: Llama-based with Grouped Query Attention (GQA)
- Parameters: 500.3M (487M non-embedding)
- Layers: 120 (4x stacked from 30)
- Source:
alea-institute/kl3m-006-170m-checkpoint-63000 - Stacking Method: G_stack cyclic duplication
- Training Status: 37,000 steps (3.92B tokens processed)
- Precision: BF16
Training Progress
- Steps: 37,000 / 100,000 planned
- Tokens Processed: 3,917,824,000 (~3.92B)
- Valid Tokens: 3,914,825,728
- Dataset:
alea-institute/kl3m-data-sample-006-medium
Model Architecture
- Hidden Size: 576 (unchanged from source)
- Layers: 120 (4ร from 30)
- Attention Heads: 9 (3 KV heads with GQA)
- Intermediate Size: 1536 (unchanged from source)
- Vocabulary: 131,072 tokens
- RoPE Theta: 100,000
- Parameter Growth: 181.7M โ 500.3M (2.75ร increase)
G_stack Methodology
Cyclic Layer Duplication
Based on "Stacking Your Transformers" (NeurIPS 2024, arXiv:2405.15319):
Stacking Pattern:
Source (30 layers): [0, 1, 2, ..., 28, 29]
Target (120 layers): [0-29, 0-29, 0-29, 0-29]
โโ Cyclic repetition 4 times
Preserved Components:
- Embedding layer (wte): Copied once
- Final layer norm (ln_f): Copied once
- LM head: Copied once
Duplicated Components:
- All 30 transformer blocks repeated 4 times
- Each block contains: self-attention (q/k/v/o_proj), MLP (gate/up/down_proj), layer norms
Expected Training Efficiency
Per G_stack paper findings:
- Token efficiency: ~54% fewer tokens to reach target loss vs training 120L from scratch
- Computational savings: ~46% reduction in FLOPs
- Convergence: Duplicated layers naturally diverge during training
Training Configuration
Optimizer: Muon
Depth-scaled learning rates:
- Muon LR: 0.00365 (depth-scaled: base ร โ(16/120))
- Aux LR: 5e-4
- Weight Decay: 1e-5 (Muon), 1e-3 (Aux)
- Momentum: 0.95
Layer-specific LR multipliers:
- q_proj/o_proj: 0.5ร
- k_proj/v_proj: 0.75ร
- mlp: 1.0ร
- lm_head: 0.9ร
Spectral Clamping (Layer-Specific)
Attention layers (q/k/v/o_proj):
- Frequency: Every 100 steps
- Max condition: 3000
- Sigma floor: 5e-5
MLP layers (gate/up/down):
- Frequency: Every 100 steps
- Max condition: 2000
- Sigma floor: 5e-5
LM head:
- Frequency: Every 100 steps
- Max condition: 2000
- Sigma floor: 5e-5
Regularization
- Label Smoothing: 0.001
- Entropy Bonus Weight: 0.001
- Entropy Target: 6.5 bits
- Entropy Target Weight: 0.001
- Activation Norm Weight: 0.0006
Batch Configuration
- Micro batch: 5
- Gradient accumulation: 32
- Effective batch: 160 sequences
- Sequence length: 4096 tokens
- Effective tokens/batch: 655,360
Training Details
Data Processing
- Dataset: Multi-domain legal corpus (kl3m-data-sample-006-medium)
- Streaming: Yes (shuffle buffer: 32)
- Packing: Cross-record packing enabled
- Append EOS: Yes
Optimization
- Gradient Checkpointing: Enabled (non-reentrant)
- Mixed Precision: BF16
- Flash Attention: Auto
- Gradient Clipping: Adaptive (max 64.0, coeff 2.0, beta 0.9)
- TF32: Auto
Learning Rate Schedule
- Scheduler: Cosine decay
- Warmup: 100 steps
- Min LR ratio: 0.1
- Max steps: 100,000
Usage
from transformers import pipeline
# Load the text generation pipeline
generator = pipeline(
"text-generation",
model="alea-institute/kl3m-007-500m-step37000",
device_map="auto"
)
# Generate text
output = generator(
"The United States Constitution establishes",
max_new_tokens=100,
do_sample=True,
temperature=0.8,
top_p=0.95
)
print(output[0]['generated_text'])
Alternative: Using model and tokenizer directly
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"alea-institute/kl3m-007-500m-step37000",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("alea-institute/kl3m-007-500m-step37000")
inputs = tokenizer("The United States Constitution establishes", return_tensors="pt", return_token_type_ids=False)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.8, top_p=0.95)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Generation Quality
This checkpoint has undergone 37,000 steps of training on the 120-layer architecture. The model is still in active training and quality continues to improve.
Recommended sampling parameters (from project testing):
- Temperature: 0.8
- Top-p: 0.95
- Repetition penalty: 1.0
Why G_stack?
Advantages over Training from Scratch
- Proven warm start: All 120 layers begin with learned representations
- Faster convergence: 54% fewer tokens expected
- Stable training: Avoids cold-start instabilities
- Spectral inheritance: Good conditioning from Phase A source
- Simple implementation: Just cyclic duplication, no complex initialization
Alternative Approaches Not Used
- Function-preserving init: More complex, similar results per paper
- Random initialization: Much slower convergence
- Progressive stacking: More complex training schedule
- Width expansion first: Different scaling dimension
Model Comparison
| Model | Layers | Params | Training Status | Tokens Processed |
|---|---|---|---|---|
| kl3m-006-170m-checkpoint-63000 | 30 | 181.7M | Trained (63K steps) | ~15.8B |
| kl3m-007-500m-step0 | 120 | 500.3M | Step 0 (init only) | 0 |
| kl3m-007-500m-step37000 | 120 | 500.3M | Step 37K (in training) | ~3.92B |
Training Philosophy
G_stack enables efficient depth scaling:
- Start with proven 170M model (63K steps, 15.83B tokens)
- Stack to 4ร depth (120 layers)
- Continue training with 54% efficiency gain
- Achieve 500M quality in ~100-150K steps (vs 250K from scratch)
This approach leverages transfer learning in the depth dimension rather than traditional width scaling or fine-tuning.
Next Steps
This checkpoint will continue training toward:
- Target steps: 100,000
- Expected quality match: 170M@300K by step ~150K
- Final model: kl3m-007-500m-final
Follow training progress in the kl3m-007-500m-checkpoint-* series.
Model Card Authors
Alea Institute
Citation
For G_stack technical details:
@inproceedings{gstack2024,
title={Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training},
author={Du, Wenyu and Luo, Tongxu and Qiu, Zihan and Huang, Zeyu and Shen, Yikang and Cheng, Reynold and Guo, Yike and Fu, Jie},
booktitle={NeurIPS},
year={2024},
note={arXiv:2405.15319}
}
@misc{kl3m2025,
title={KL3M: Knowledge-Guided Language Model Training with G_stack Depth Expansion},
author={Alea Institute},
year={2025},
url={https://github.com/alea-institute/alea-models},
note={500M model via 4x cyclic duplication from 170M Phase 2+A}
}
License
Apache 2.0
- Downloads last month
- 22