reaperdoesntknow
/

MoA-100M

@@ -1,106 +1,89 @@
 ---
 language:
-- en
 license: apache-2.0
 library_name: transformers
 pipeline_tag: text-generation
 tags:
-- mixture-of-attentions
-- distance-attention
-- metric-attention
-- mqa
-- hyperffn
-- router-gating
 datasets:
-- nvidia/Nemotron-Math-HumanReasoning
-- WeMake/Intelligent-Content-Understanding
 ---
-# MoAMetricLM-100M — Mixture of Attentions (MoA)
-**A geometry-aware Transformer with a mixture of attention mechanisms and metric-based routing.**
-**Parameters:** ~100M| **Type:** Causal LM (decoder-only) | **KV cache:** not yet implemented
-## Model Index
-- **Model ID:** `your-hf-username/MoAMetricLM-185M`
-- **Task:** text generation (`text-generation`)
-- **Library:** 🤗 Transformers
-- **License:** Apache-2.0 (change here & add LICENSE file if different)
-- **Datasets :**
-- nvidia/Nemotron-Math-HumanReasoning: ~256k tokens
-- WeMake/Intelligent-Content-Understanding ~256k tokens
 ## Overview
-**MoA** replaces standard dot-product attention with **metric-based attention** and blends multiple attentional biases using a **token-wise router** and **feature/router gates**.
-**Heads per block**
-- **LocalConvHead** — depthwise separable 1D conv (local context).
-- **Metric Multi-Head Attention (MetricMHAttention)** — attention via negative distances in learned head subspaces (L2 / cosine / diagonal-Mahalanobis), with per-head **origin** \(o_h\) and **radius** \(r_h\) enabling **ball pruning**.
-- **Metric MQA** — multi-query attention with shared K/V in the same metric space (efficiency).
-- **ChannelMixHead** — per-token MLP for channel interactions.
-**FFN**
-- **HyperFFN** (multi-branch): SwiGLU MLP path, separable-conv path, and low-rank path, combined via a token-wise branch router and optional feature gates. LayerScale + DropPath for stability.
-**Regularization (optional)**
-- **Triangle-inequality (TI) penalty** on sampled triples to encourage true-metric behavior.
-**Design goals:** geometric consistency, diverse inductive biases, structured efficiency, and full HF compatibility.
-## What’s different from a standard Transformer?
-- **Distance-based attention (softmin over distances)** instead of dot product:
-  \[
-  \text{attn}(i,j)\ \propto\ \exp\!\big(-\alpha_h\ \|q_i-k_j\|^2\big)
-  \]
-  with per-head sharpness \(\alpha_h\). Cosine / diag-Mahalanobis variants supported.
-- **Per-head origins & radii** define balls for principled sparsity (**ball pruning**).
-- **Mixture of attentions** (conv / metric MHA / metric MQA / channel MLP) blended by a **token-wise router**, with **feature gates** (FiLM-like) and **router-bias gates** for up/down control.
-- **Up/Down projections** (SwiGLU-style) inside heads to expand/contract the value stream.
-- **HyperFFN** provides non-lazy capacity with token-wise branch routing.
-## Intended Use & Limitations
-**Intended use:** research on geometry-aware attention, structured sparsity, and mixtures of attentional biases; small-scale experimentation and ablations.
-**Limitations:**
-- Dev runs used small token budgets; this is **not** a general-purpose LM.
-- **No KV cache** yet → generation cost scales with context length.
-- No alignment/safety tuning; outputs may be biased or inaccurate.
-**Out-of-scope:** high-stakes applications (medical/legal/etc.) without further training, evaluation, and safeguards.
-## Training Details
-**Hardware:** CPU (Intel; no CUDA)
-**Precision:** FP32
-### Latest run (v0.2)
-- **Tokens:** ~500,000 (two datasets, ~250k each)
-- **Wall-time:** ~20 minutes (~**417 toks/s** overall)
-- **Tokenizer:** GPT-2 (`gpt2`)
-- **Learning rate:** **5e-4** (AdamW)
-- **Batch / Seq:** batch_size=4, sequence length ≤512
-- **Final train loss:** **≈ 0.30**
-### Prior run (v0.1)
-- **Tokens:** ~196k
-- **Wall-time:** ~14 minutes
-- **Final avg loss:** ≈ 0.417 (min batch ≈ 0.193)
-**Stability aids:** safe softmax (subtract max), PreNorm, LayerScale (≈1e-4), DropPath (optional), label masking (`-100` on padding).
 ## Configuration (example)
@@ -136,112 +119,153 @@ datasets:
   "eos_token_id": 50256
 }
 ```
----
-If you use gpt2 tokenizer, set pad_token = eos_token and ensure vocab_size/eos/pad match the tokenizer.
-⸻
-Usage
-Inference
 ```python
-from transformers import AutoTokenizer, AutoModelForCausalLM
-model_id = "your-hf-username/MoAMetricLM-185M"
-tok = AutoTokenizer.from_pretrained(model_id)
-if tok.pad_token is None:
-    tok.pad_token = tok.eos_token
-model = AutoModelForCausalLM.from_pretrained(model_id)
-prompt = "Explain metric-based attention in simple terms:"
-inputs = tok(prompt, return_tensors="pt")
-gen = model.generate(**inputs, max_new_tokens=128, do_sample=False)
-print(tok.decode(gen[0], skip_special_tokens=True))
-KV cache: not yet implemented; generation recomputes full context at each step.
-Training (custom loop sketch)
-from transformers import AutoTokenizer, DataCollatorForLanguageModeling
 from torch.utils.data import DataLoader
 import torch, torch.nn.functional as F
-tok = AutoTokenizer.from_pretrained("gpt2")
-if tok.pad_token is None:
-    tok.pad_token = tok.eos_token
-def collate(examples):
-    batch = tok([e["text"] for e in examples], padding="max_length",
-                truncation=True, max_length=512, return_tensors="pt")
     labels = batch["input_ids"].clone()
     labels[batch["attention_mask"] == 0] = -100
     batch["labels"] = labels
     return batch
-# dataset = ... (load HF dataset with a 'text' field)
-# loader = DataLoader(dataset, batch_size=4, shuffle=True, collate_fn=collate)
-# model = AutoModelForCausalLM.from_pretrained(model_id)  # or initialize config & model
-# optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, betas=(0.9,0.95), weight_decay=0.01)
-# for batch in loader:
-#     out = model(**batch)
-#     out.loss.backward()
-#     torch.nn.utils.clip_grad_norm_(model.parameters(), 1.2)
-#     optimizer.step(); optimizer.zero_grad()
 ```
-## Evaluation
-For meaningful comparisons, run:
-	•	Validation perplexity on a held-out split.
-	•	Ablations at fixed token budgets:
-	•	L2 vs cosine vs diag-Mahalanobis
-	•	With vs without ball pruning
-	•	With vs without HyperFFN branch router/gates
-	•	With vs without TI regularizer
-Efficiency Notes
-	•	Ball pruning: masks keys outside per-head radius → structured sparsity.
-	•	MQA: shared K/V reduce projection cost while retaining diversity via multi-query heads.
-	•	HyperFFN: token-wise branch router (optional top-k) to avoid paying for all branches equally.
-	•	CPU tips: set OMP_NUM_THREADS/MKL_NUM_THREADS to core count; use pad_token = eos_token.
-Roadmap: metric-aware KV cache for long contexts; kernelized distance approximations (e.g., RFF) for sub-quadratic regimes; quantization & mixed precision
-Safety, Bias & Risks
-	•	May produce biased, offensive, or factually incorrect outputs.
-	•	No safety/alignment training included.
-	•	Do not deploy in high-stakes contexts without additional
-License
-Apache-2.0 (update if different).
-Citation
 @misc{moametriclm185m,
-  title  = {MoAMetricLM-185M: A Geometry-Aware Mixture-of-Attentions Language Model},
-  author = {Colca, Roy Shawn and collaborators},
-  year   = {2025},
-  url    = {https://huggingface.co/your-hf-username/MoAMetricLM-185M}
 }
-Changelog
-	•	v0.2 (2025-09-20) — 500k-token CPU run, GPT-2 tokenizer, LR=5e-4, final loss ≈ 0.30.
-	•	v0.1 (2025-09-20) — initial public release: metric heads, MQA, ball pruning, HyperFFN, router & gates; HF-compatible; no KV cache.
-Maintainers
-	•	Author: reaper (Convergent Intelligence LLC)
-	•	Contact: add preferred contact
-	•	Issues: HF model hub issues tab
----

 ---
 language:
+  - en
 license: apache-2.0
 library_name: transformers
 pipeline_tag: text-generation
 tags:
+  - mixture-of-attentions
+  - distance-attention
+  - metric-attention
+  - mqa
+  - hyperffn
+  - router-gating
 datasets:
+  - nvidia/Nemotron-Math-HumanReasoning
+  - WeMake/Intelligent-Content-Understanding
 ---
+# MoAMetricLM‑100M — Mixture of Attentions (MoA)
+**A geometry‑aware Transformer that mixes several attention mechanisms and routes them with a metric‑based router.**
+- **Parameters:** ~185 M (≈ 100 M effective due to the mixture)
+- **Task:** Causal language modeling (decoder‑only)
+- **Library:** 🤗 Transformers
+- **KV cache:** Not yet implemented (generation recomputes the full context at every step)
+---
+## Model card
+| **Model ID** | `reaperdoesntknow/MoA-100M` |
+|--------------|-------------------------------------|
+| **Architecture** | `moa_metric` (custom) |
+| **Tokenizer** | GPT‑2 (`gpt2`) – `pad_token` set to `eos_token` |
+| **Context length** | 2048 tokens |
+| **Training data** | 2 × ≈ 256 k tokens from the datasets listed above |
+| **Training compute** | CPU‑only (Intel), FP32 |
+| **Training hyper‑parameters** | LR = 5e‑4 (AdamW), batch = 4, seq ≤ 512, 500 k total tokens |
+| **Final loss** | ≈ 0.30 (train) |
+| **License** | Apache‑2.0 |
+| **Safety** | No alignment or safety fine‑tuning – outputs may be biased or inaccurate. |
+| **Intended use** | Research on geometry‑aware attention, structured sparsity, and mixture‑of‑attention models. |
+| **Limitations** | • No KV‑cache → slower generation. <br>• Small token budget → not a general‑purpose LM. <br>• No safety/alignment training. |
+| **Out‑of‑scope** | High‑stakes applications (medical, legal, etc.) without further evaluation. |
+---
 ## Overview
+MoA replaces the classic dot‑product attention with **metric‑based attention** and blends **four** distinct heads per Transformer block:
+| Head type | Description |
+|-----------|-------------|
+| **LocalConvHead** | Depthwise‑separable 1‑D convolution → captures short‑range context. |
+| **Metric Multi‑Head Attention (MetricMHAttention)** | Soft‑min over **L2 / cosine / diagonal‑Mahalanobis** distances: <br> \(\displaystyle \text{attn}_{h}(i,j) \propto \exp\!\big(-\alpha_h\|q_i-k_j\|^2\big)\) |
+| **Metric MQA** | Multi‑Query attention (shared K/V) in the same metric space – cheaper than full MHA. |
+| **ChannelMixHead** | Per‑token MLP that mixes channel dimensions (no positional mixing). |
+A **token‑wise router** decides, for each token, which head(s) to use and applies **feature‑gates** (FiLM‑style) and **router‑bias gates** for up/down‑scaling.
+The **FFN** is a **HyperFFN** – three parallel branches (SwiGLU MLP, separable‑conv, low‑rank) combined by a **branch router**. LayerScale and optional DropPath keep training stable.
+### Regularisation (optional)
+* **Triangle‑inequality (TI) penalty** on sampled triples to encourage true‑metric behaviour.
+* **Ball pruning** – each head learns an **origin** \(o_h\) and **radius** \(r_h\); keys outside the ball are masked, giving structured sparsity.
+---
+## Architecture diagram (high‑level)
+```
+Input → Embedding → (PreNorm) → Block₁ → … → Blockₙ → LM‑Head → Output
+                     │
+                     ├─ LocalConvHead
+                     ├─ MetricMHAttention
+                     ├─ MetricMQA
+                     └─ ChannelMixHead
+                     (router decides per‑token)
+Each Block also contains:
+  → HyperFFN (SwiGLU | Conv | Low‑rank)  ← branch router
+  → LayerScale + DropPath
+```
+---
 ## Configuration (example)
   "eos_token_id": 50256
 }
 ```
+> **Tip:** If you use the GPT‑2 tokenizer, set `pad_token = eos_token` and make sure `vocab_size` matches the tokenizer (50257).
+---
+## Quick‑start (inference)
 ```python
+>>> from transformers import AutoTokenizer, AutoModelForCausalLM
+>>> model_id = "reaperdoesntknow/MoA-100M"
+>>> tokenizer = AutoTokenizer.from_pretrained(model_id)
+>>> tokenizer.pad_token = tokenizer.eos_token   # needed for the GPT‑2 tokenizer
+>>> model = AutoModelForCausalLM.from_pretrained(model_id)
+>>> prompt = "Explain metric‑based attention in simple terms:"
+>>> inputs = tokenizer(prompt, return_tensors="pt")
+>>> output_ids = model.generate(
+...     **inputs,
+...     max_new_tokens=128,
+...     do_sample=False,          # deterministic; set temperature>0 for sampling
+...     pad_token_id=tokenizer.pad_token_id,
+... )
+>>> print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
+```
+*Note:* Because KV‑cache is not implemented, generation time grows linearly with the total context length.
+---
+## Training (custom loop sketch)
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForLanguageModeling
 from torch.utils.data import DataLoader
 import torch, torch.nn.functional as F
+tokenizer = AutoTokenizer.from_pretrained("gpt2")
+tokenizer.pad_token = tokenizer.eos_token
+def collate_fn(examples):
+    batch = tokenizer(
+        [ex["text"] for ex in examples],
+        padding="max_length",
+        truncation=True,
+        max_length=512,
+        return_tensors="pt",
+    )
     labels = batch["input_ids"].clone()
     labels[batch["attention_mask"] == 0] = -100
     batch["labels"] = labels
     return batch
+# dataset = load_dataset(..., split="train")  # must contain a 'text' field
+# loader = DataLoader(dataset, batch_size=4, shuffle=True, collate_fn=collate_fn)
+model = AutoModelForCausalLM.from_pretrained("reaperdoesntknow/MoA-100M")
+optimizer = torch.optim.AdamW(
+    model.parameters(),
+    lr=5e-4,
+    betas=(0.9, 0.95),
+    weight_decay=0.01,
+)
+for batch in loader:
+    out = model(**batch)
+    out.loss.backward()
+    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.2)
+    optimizer.step()
+    optimizer.zero_grad()
 ```
+---
+## Evaluation checklist
+* **Perplexity** on a held‑out split of the two training datasets.
+* **Ablation studies** (keep total token budget constant):
+  * L2 vs. cosine vs. diagonal‑Mahalanobis distance.
+  * With / without ball pruning.
+  * With / without HyperFFN branch router.
+  * With / without TI regulariser.
+* **Speed / memory** comparison against a vanilla GPT‑2‑size model (same `dim`/`layers`).
+---
+## Efficiency notes
+| Feature | What it does |
+|---------|--------------|
+| **Ball pruning** | Masks keys that lie outside a learned radius → reduces the quadratic attention cost. |
+| **Metric MQA** | Shares K/V across heads → fewer projection matrices, lower FLOPs. |
+| **HyperFFN branch router** | Token‑wise top‑k routing means only the most useful branch is evaluated per token. |
+| **CPU tips** | Set `OMP_NUM_THREADS` / `MKL_NUM_THREADS` to the number of physical cores; use `torch.set_num_threads()` if needed. |
+Future roadmap: metric‑aware KV‑cache, kernelised distance approximations (e.g., Random Fourier Features), quantisation & mixed‑precision inference.
+---
+## Safety, Bias & Risks
+* The model **has not been fine‑tuned for safety or alignment**.
+* Outputs may contain **biases, profanity, or factual errors**.
+* Do **not** deploy in high‑stakes contexts without additional evaluation, moderation, and possibly further fine‑tuning.
+---
+## License
+Apache‑2.0 – see the `LICENSE` file in the repository.
+---
+## Citation
+```bibtex
 @misc{moametriclm185m,
+  title   = {reaperdoesntknow/MoA-100M: A Geometry-Aware Mixture-of-Attentions Language Model},
+  author  = {Colca, Roy Shawn and collaborators},
+  year    = {2025},
+  url     = {https://huggingface.co/reaperdoesntknow/MoA-100M}
 }
+```
+---
+## Changelog
+| Version | Date | Notes |
+|---------|------|-------|
+| **v0.2** | 2025‑09‑20 | 500 k‑token CPU run, GPT‑2 tokenizer, LR = 5e‑4, final loss ≈ 0.30. |
+| **v0.1** | 2025‑09‑20 | Initial public release: metric heads, MQA, ball pruning, HyperFFN, router & gates; HF‑compatible; no KV cache. |
+---
+## Maintainers
+* **Author:** reaper (Convergent Intelligence LLC)
+* **Contact:** *Email* ([email protected])*
+---
+## Special Remarks
+- This models still in an extremely experimental state. As are most of them, but im working on stabilizing this one for general inference.
+- I design create and train all of my models using my mathematical research and pure disgust for the dot product!
+- For those of you who actually read this and use my models, you make my day everytime I see another download, so thank you for being awesome!