KL3M 170M, 6th Gen Model, 57K Checkpoint (Phase 2)
A 170M parameter language model trained on multi-domain legal text using the Muon optimizer with spectral clamping. This checkpoint represents Phase 2 multi-domain training following Phase 1 single-domain training (37K steps).
Model Details
- Architecture: Llama-based with Grouped Query Attention (GQA)
- Parameters: 181.7M (170M non-embedding)
- Training Steps: 57,000 (37K Phase 1 + 20K Phase 2)
- Tokens Processed: 15.53 billion
- Sequence Length: 4,096 tokens
- Precision: BF16
- Optimizer: Muon with spectral regularization (max condition: 3000)
Model Architecture
- Hidden Size: 576
- Layers: 30
- Attention Heads: 9 (3 KV heads with GQA)
- Intermediate Size: 1536
- Vocabulary: 131,072 tokens
- RoPE Theta: 100,000
Training Configuration
Phase 2 Multi-Domain Dataset
- Source:
alea-institute/kl3m-data-sample-004-balanced - Type: Multi-domain legal corpus
- RECAP (Court filings): 48-52%
- GovInfo (Federal regulations): 27-32%
- USPTO (Patents): 3-5%
- EDGAR (SEC filings): 5-7%
- Format: Streaming, shuffled with buffer=32
- Domain Quality: Balanced for broad legal knowledge
Optimizer (Muon)
- Muon Learning Rate: 7.30e-5 (depth-scaled from 1e-4)
- Auxiliary Learning Rate: 9e-5
- Muon Weight Decay: 1e-4
- Auxiliary Weight Decay: 1e-3
- Muon Momentum: 0.95
- Muon NS Steps: 3 (faster convergence vs Phase 1's 5)
- Batch Size: 6 per device
- Gradient Accumulation: 2 steps (effective batch: 12)
- Warmup Steps: 0 (continuing from Phase 1)
- LR Scheduler: Cosine with min ratio 0.1
Regularization
- Spectral Clamping:
- Enabled on q_proj, o_proj, and lm_head
- Max condition number: 3000 (relaxed from Phase 1's 2000)
- Sigma floor: 5e-5 (relaxed from Phase 1's 1e-4)
- Applied every 160 steps (relaxed from Phase 1's 10 steps)
- Rationale: Relaxed constraints allow faster exploration of multi-domain manifolds
- Adaptive Gradient Clipping: Enabled (Ξ²=0.9, coeff=2.0)
- Label Smoothing: 0.01
- Entropy Regularization:
- Entropy bonus weight: 0.003
- Entropy target: 6.5 bits (weight: 0.003)
- Activation norm weight: 0.0006
- Loss chunk size: 1024 tokens
Training Infrastructure
- Mixed Precision: BF16
- Gradient Checkpointing: Enabled (non-reentrant)
- Flash Attention: Auto-enabled
- TF32 Mode: Auto
Spectral Health (Step 57K)
Analysis of weight matrix conditioning shows acceptable manifold quality with expected degradation from relaxed regularization:
Condition Numbers
- Attention Layers:
- Median: 2307 (β 42% from Phase 1's 1620)
- Mean: ~2500
- P95: ~3200
- Max: 3632 (β 46% from Phase 1's 2481)
- MLP Layers:
- Median: ~5-10 (healthy)
- Mean: ~8-12
- Max: ~15-20 (excellent conditioning)
- LM Head: ~300-400 (good)
Singular Values
- Smallest Ο_min: ~3.42e-4 (above Ο_floor of 5e-5)
- Worst-conditioned layers:
- Attention projection layers (q_proj, o_proj) approaching ceiling
- K/V projections moderately conditioned
Key Finding: Attention layers show higher condition numbers than Phase 1 due to intentionally relaxed spectral regularization. This trade-off enables faster multi-domain learning and exploration while maintaining numerical stability.
Training Dynamics (Phase 2: Steps 37K-57K)
- Loss: Competitive across all domains
- Gradient Norm: Stable with adaptive clipping
- Learning Rate: Gradual cosine decay
- Multi-Domain Quality: Balanced performance on RECAP, GovInfo, USPTO, EDGAR
Performance Comparison vs Phase 1
| Metric | Phase 1 (37K) | Phase 2 (57K) | Change |
|---|---|---|---|
| Attention Conditioning | Median: 1620 | Median: 2307 | β 42% |
| Generation Quality | 5.7/10 | 8.5/10 | β 49% |
| Multi-Domain Coverage | EDGAR only | 4 domains | β |
| Spectral Clamp Freq | Every 10 steps | Every 160 steps | 16x relaxed |
Result: Intentional conditioning trade-off yields significantly better generation quality and multi-domain capability.
Generation Quality
Generates coherent, fluent legal text across multiple domains with no repetition issues. Significantly improved over Phase 1:
- Court documents (RECAP): Natural legal argumentation
- Federal regulations (GovInfo): Proper regulatory structure
- Patents (USPTO): Technical claim language
- SEC filings (EDGAR): Financial/corporate disclosure style
Suitable for multi-domain legal content generation, analysis, and understanding tasks.
Usage
from transformers import pipeline
# Create text generation pipeline
generator = pipeline(
"text-generation",
model="alea-institute/kl3m-006-170m-checkpoint-57000",
torch_dtype="auto",
device_map="auto"
)
# Generate text
outputs = generator(
"This Agreement is entered into as of",
max_new_tokens=100,
temperature=0.7,
top_p=0.95,
repetition_penalty=1.15
)
print(outputs[0]['generated_text'])
Training Philosophy
This checkpoint demonstrates the intentional conditioning vs quality trade-off:
- Phase 1: Tight spectral regularization β Excellent conditioning, limited quality
- Phase 2: Relaxed regularization β Moderate conditioning, excellent quality
The 16x reduction in spectral clamping frequency allows the model to explore richer multi-domain manifolds while maintaining acceptable numerical stability (max condition < 4000).
Model Card Authors
Alea Institute
Citation
For technical details, see the paper: https://arxiv.org/abs/2504.07854
@misc{kl3m2025,
title={KL3M: Knowledge-Guided Language Model Training},
author={Alea Institute},
year={2025},
url={https://arxiv.org/abs/2504.07854},
note={Trained with Muon optimizer and multi-domain spectral regularization}
}
License
Apache 2.0
- Downloads last month
- 15