KL3M 500M, 7th Gen Model, Step 37000 (4x Stacked Architecture)

A 500M parameter language model created via 4x cyclic layer duplication (G_stack method, NeurIPS 2024) from the KL3M 170M Phase 2+A checkpoint. This checkpoint represents 37,000 training steps on the 120-layer architecture.

Model Details

  • Architecture: Llama-based with Grouped Query Attention (GQA)
  • Parameters: 500.3M (487M non-embedding)
  • Layers: 120 (4x stacked from 30)
  • Source: alea-institute/kl3m-006-170m-checkpoint-63000
  • Stacking Method: G_stack cyclic duplication
  • Training Status: 37,000 steps (3.92B tokens processed)
  • Precision: BF16

Training Progress

  • Steps: 37,000 / 100,000 planned
  • Tokens Processed: 3,917,824,000 (~3.92B)
  • Valid Tokens: 3,914,825,728
  • Dataset: alea-institute/kl3m-data-sample-006-medium

Model Architecture

  • Hidden Size: 576 (unchanged from source)
  • Layers: 120 (4ร— from 30)
  • Attention Heads: 9 (3 KV heads with GQA)
  • Intermediate Size: 1536 (unchanged from source)
  • Vocabulary: 131,072 tokens
  • RoPE Theta: 100,000
  • Parameter Growth: 181.7M โ†’ 500.3M (2.75ร— increase)

G_stack Methodology

Cyclic Layer Duplication

Based on "Stacking Your Transformers" (NeurIPS 2024, arXiv:2405.15319):

Stacking Pattern:

Source (30 layers):  [0, 1, 2, ..., 28, 29]
Target (120 layers): [0-29, 0-29, 0-29, 0-29]
                     โ””โ”€ Cyclic repetition 4 times

Preserved Components:

  • Embedding layer (wte): Copied once
  • Final layer norm (ln_f): Copied once
  • LM head: Copied once

Duplicated Components:

  • All 30 transformer blocks repeated 4 times
  • Each block contains: self-attention (q/k/v/o_proj), MLP (gate/up/down_proj), layer norms

Expected Training Efficiency

Per G_stack paper findings:

  • Token efficiency: ~54% fewer tokens to reach target loss vs training 120L from scratch
  • Computational savings: ~46% reduction in FLOPs
  • Convergence: Duplicated layers naturally diverge during training

Training Configuration

Optimizer: Muon

Depth-scaled learning rates:

  • Muon LR: 0.00365 (depth-scaled: base ร— โˆš(16/120))
  • Aux LR: 5e-4
  • Weight Decay: 1e-5 (Muon), 1e-3 (Aux)
  • Momentum: 0.95

Layer-specific LR multipliers:

  • q_proj/o_proj: 0.5ร—
  • k_proj/v_proj: 0.75ร—
  • mlp: 1.0ร—
  • lm_head: 0.9ร—

Spectral Clamping (Layer-Specific)

Attention layers (q/k/v/o_proj):

  • Frequency: Every 100 steps
  • Max condition: 3000
  • Sigma floor: 5e-5

MLP layers (gate/up/down):

  • Frequency: Every 100 steps
  • Max condition: 2000
  • Sigma floor: 5e-5

LM head:

  • Frequency: Every 100 steps
  • Max condition: 2000
  • Sigma floor: 5e-5

Regularization

  • Label Smoothing: 0.001
  • Entropy Bonus Weight: 0.001
  • Entropy Target: 6.5 bits
  • Entropy Target Weight: 0.001
  • Activation Norm Weight: 0.0006

Batch Configuration

  • Micro batch: 5
  • Gradient accumulation: 32
  • Effective batch: 160 sequences
  • Sequence length: 4096 tokens
  • Effective tokens/batch: 655,360

Training Details

Data Processing

  • Dataset: Multi-domain legal corpus (kl3m-data-sample-006-medium)
  • Streaming: Yes (shuffle buffer: 32)
  • Packing: Cross-record packing enabled
  • Append EOS: Yes

Optimization

  • Gradient Checkpointing: Enabled (non-reentrant)
  • Mixed Precision: BF16
  • Flash Attention: Auto
  • Gradient Clipping: Adaptive (max 64.0, coeff 2.0, beta 0.9)
  • TF32: Auto

Learning Rate Schedule

  • Scheduler: Cosine decay
  • Warmup: 100 steps
  • Min LR ratio: 0.1
  • Max steps: 100,000

Usage

from transformers import pipeline

# Load the text generation pipeline
generator = pipeline(
    "text-generation",
    model="alea-institute/kl3m-007-500m-step37000",
    device_map="auto"
)

# Generate text
output = generator(
    "The United States Constitution establishes",
    max_new_tokens=100,
    do_sample=True,
    temperature=0.8,
    top_p=0.95
)
print(output[0]['generated_text'])

Alternative: Using model and tokenizer directly

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "alea-institute/kl3m-007-500m-step37000",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("alea-institute/kl3m-007-500m-step37000")

inputs = tokenizer("The United States Constitution establishes", return_tensors="pt", return_token_type_ids=False)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.8, top_p=0.95)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Generation Quality

This checkpoint has undergone 37,000 steps of training on the 120-layer architecture. The model is still in active training and quality continues to improve.

Recommended sampling parameters (from project testing):

  • Temperature: 0.8
  • Top-p: 0.95
  • Repetition penalty: 1.0

Why G_stack?

Advantages over Training from Scratch

  1. Proven warm start: All 120 layers begin with learned representations
  2. Faster convergence: 54% fewer tokens expected
  3. Stable training: Avoids cold-start instabilities
  4. Spectral inheritance: Good conditioning from Phase A source
  5. Simple implementation: Just cyclic duplication, no complex initialization

Alternative Approaches Not Used

  • Function-preserving init: More complex, similar results per paper
  • Random initialization: Much slower convergence
  • Progressive stacking: More complex training schedule
  • Width expansion first: Different scaling dimension

Model Comparison

Model Layers Params Training Status Tokens Processed
kl3m-006-170m-checkpoint-63000 30 181.7M Trained (63K steps) ~15.8B
kl3m-007-500m-step0 120 500.3M Step 0 (init only) 0
kl3m-007-500m-step37000 120 500.3M Step 37K (in training) ~3.92B

Training Philosophy

G_stack enables efficient depth scaling:

  • Start with proven 170M model (63K steps, 15.83B tokens)
  • Stack to 4ร— depth (120 layers)
  • Continue training with 54% efficiency gain
  • Achieve 500M quality in ~100-150K steps (vs 250K from scratch)

This approach leverages transfer learning in the depth dimension rather than traditional width scaling or fine-tuning.

Next Steps

This checkpoint will continue training toward:

  • Target steps: 100,000
  • Expected quality match: 170M@300K by step ~150K
  • Final model: kl3m-007-500m-final

Follow training progress in the kl3m-007-500m-checkpoint-* series.

Model Card Authors

Alea Institute

Citation

For G_stack technical details:

@inproceedings{gstack2024,
  title={Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training},
  author={Du, Wenyu and Luo, Tongxu and Qiu, Zihan and Huang, Zeyu and Shen, Yikang and Cheng, Reynold and Guo, Yike and Fu, Jie},
  booktitle={NeurIPS},
  year={2024},
  note={arXiv:2405.15319}
}

@misc{kl3m2025,
  title={KL3M: Knowledge-Guided Language Model Training with G_stack Depth Expansion},
  author={Alea Institute},
  year={2025},
  url={https://github.com/alea-institute/alea-models},
  note={500M model via 4x cyclic duplication from 170M Phase 2+A}
}

License

Apache 2.0

Downloads last month
22
Safetensors
Model size
0.5B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support