KL3M 170M, 6th Gen Model, 57K Checkpoint (Phase 2)

A 170M parameter language model trained on multi-domain legal text using the Muon optimizer with spectral clamping. This checkpoint represents Phase 2 multi-domain training following Phase 1 single-domain training (37K steps).

Model Details

  • Architecture: Llama-based with Grouped Query Attention (GQA)
  • Parameters: 181.7M (170M non-embedding)
  • Training Steps: 57,000 (37K Phase 1 + 20K Phase 2)
  • Tokens Processed: 15.53 billion
  • Sequence Length: 4,096 tokens
  • Precision: BF16
  • Optimizer: Muon with spectral regularization (max condition: 3000)

Model Architecture

  • Hidden Size: 576
  • Layers: 30
  • Attention Heads: 9 (3 KV heads with GQA)
  • Intermediate Size: 1536
  • Vocabulary: 131,072 tokens
  • RoPE Theta: 100,000

Training Configuration

Phase 2 Multi-Domain Dataset

  • Source: alea-institute/kl3m-data-sample-004-balanced
  • Type: Multi-domain legal corpus
    • RECAP (Court filings): 48-52%
    • GovInfo (Federal regulations): 27-32%
    • USPTO (Patents): 3-5%
    • EDGAR (SEC filings): 5-7%
  • Format: Streaming, shuffled with buffer=32
  • Domain Quality: Balanced for broad legal knowledge

Optimizer (Muon)

  • Muon Learning Rate: 7.30e-5 (depth-scaled from 1e-4)
  • Auxiliary Learning Rate: 9e-5
  • Muon Weight Decay: 1e-4
  • Auxiliary Weight Decay: 1e-3
  • Muon Momentum: 0.95
  • Muon NS Steps: 3 (faster convergence vs Phase 1's 5)
  • Batch Size: 6 per device
  • Gradient Accumulation: 2 steps (effective batch: 12)
  • Warmup Steps: 0 (continuing from Phase 1)
  • LR Scheduler: Cosine with min ratio 0.1

Regularization

  • Spectral Clamping:
    • Enabled on q_proj, o_proj, and lm_head
    • Max condition number: 3000 (relaxed from Phase 1's 2000)
    • Sigma floor: 5e-5 (relaxed from Phase 1's 1e-4)
    • Applied every 160 steps (relaxed from Phase 1's 10 steps)
    • Rationale: Relaxed constraints allow faster exploration of multi-domain manifolds
  • Adaptive Gradient Clipping: Enabled (Ξ²=0.9, coeff=2.0)
  • Label Smoothing: 0.01
  • Entropy Regularization:
    • Entropy bonus weight: 0.003
    • Entropy target: 6.5 bits (weight: 0.003)
    • Activation norm weight: 0.0006
    • Loss chunk size: 1024 tokens

Training Infrastructure

  • Mixed Precision: BF16
  • Gradient Checkpointing: Enabled (non-reentrant)
  • Flash Attention: Auto-enabled
  • TF32 Mode: Auto

Spectral Health (Step 57K)

Analysis of weight matrix conditioning shows acceptable manifold quality with expected degradation from relaxed regularization:

Condition Numbers

  • Attention Layers:
    • Median: 2307 (↑ 42% from Phase 1's 1620)
    • Mean: ~2500
    • P95: ~3200
    • Max: 3632 (↑ 46% from Phase 1's 2481)
  • MLP Layers:
    • Median: ~5-10 (healthy)
    • Mean: ~8-12
    • Max: ~15-20 (excellent conditioning)
  • LM Head: ~300-400 (good)

Singular Values

  • Smallest Οƒ_min: ~3.42e-4 (above Οƒ_floor of 5e-5)
  • Worst-conditioned layers:
    • Attention projection layers (q_proj, o_proj) approaching ceiling
    • K/V projections moderately conditioned

Key Finding: Attention layers show higher condition numbers than Phase 1 due to intentionally relaxed spectral regularization. This trade-off enables faster multi-domain learning and exploration while maintaining numerical stability.

Training Dynamics (Phase 2: Steps 37K-57K)

  • Loss: Competitive across all domains
  • Gradient Norm: Stable with adaptive clipping
  • Learning Rate: Gradual cosine decay
  • Multi-Domain Quality: Balanced performance on RECAP, GovInfo, USPTO, EDGAR

Performance Comparison vs Phase 1

Metric Phase 1 (37K) Phase 2 (57K) Change
Attention Conditioning Median: 1620 Median: 2307 ↑ 42%
Generation Quality 5.7/10 8.5/10 ↑ 49%
Multi-Domain Coverage EDGAR only 4 domains βœ“
Spectral Clamp Freq Every 10 steps Every 160 steps 16x relaxed

Result: Intentional conditioning trade-off yields significantly better generation quality and multi-domain capability.

Generation Quality

Generates coherent, fluent legal text across multiple domains with no repetition issues. Significantly improved over Phase 1:

  • Court documents (RECAP): Natural legal argumentation
  • Federal regulations (GovInfo): Proper regulatory structure
  • Patents (USPTO): Technical claim language
  • SEC filings (EDGAR): Financial/corporate disclosure style

Suitable for multi-domain legal content generation, analysis, and understanding tasks.

Usage

from transformers import pipeline

# Create text generation pipeline
generator = pipeline(
    "text-generation",
    model="alea-institute/kl3m-006-170m-checkpoint-57000",
    torch_dtype="auto",
    device_map="auto"
)

# Generate text
outputs = generator(
    "This Agreement is entered into as of",
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

print(outputs[0]['generated_text'])

Training Philosophy

This checkpoint demonstrates the intentional conditioning vs quality trade-off:

  • Phase 1: Tight spectral regularization β†’ Excellent conditioning, limited quality
  • Phase 2: Relaxed regularization β†’ Moderate conditioning, excellent quality

The 16x reduction in spectral clamping frequency allows the model to explore richer multi-domain manifolds while maintaining acceptable numerical stability (max condition < 4000).

Model Card Authors

Alea Institute

Citation

For technical details, see the paper: https://arxiv.org/abs/2504.07854

@misc{kl3m2025,
  title={KL3M: Knowledge-Guided Language Model Training},
  author={Alea Institute},
  year={2025},
  url={https://arxiv.org/abs/2504.07854},
  note={Trained with Muon optimizer and multi-domain spectral regularization}
}

License

Apache 2.0

Downloads last month
15
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support