reaperdoesntknow commited on
Commit
a61b9ff
·
verified ·
1 Parent(s): a9467db

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +173 -149
README.md CHANGED
@@ -1,106 +1,89 @@
1
  ---
2
  language:
3
- - en
4
  license: apache-2.0
5
  library_name: transformers
6
  pipeline_tag: text-generation
7
  tags:
8
- - mixture-of-attentions
9
- - distance-attention
10
- - metric-attention
11
- - mqa
12
- - hyperffn
13
- - router-gating
14
  datasets:
15
- - nvidia/Nemotron-Math-HumanReasoning
16
- - WeMake/Intelligent-Content-Understanding
17
  ---
18
 
19
- # MoAMetricLM-100M — Mixture of Attentions (MoA)
20
 
21
- **A geometry-aware Transformer with a mixture of attention mechanisms and metric-based routing.**
22
- **Parameters:** ~100M| **Type:** Causal LM (decoder-only) | **KV cache:** not yet implemented
 
 
 
23
 
 
24
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
- ## Model Index
27
-
28
- - **Model ID:** `your-hf-username/MoAMetricLM-185M`
29
- - **Task:** text generation (`text-generation`)
30
- - **Library:** 🤗 Transformers
31
- - **License:** Apache-2.0 (change here & add LICENSE file if different)
32
- - **Datasets :**
33
- - nvidia/Nemotron-Math-HumanReasoning: ~256k tokens
34
- - WeMake/Intelligent-Content-Understanding ~256k tokens
35
-
36
 
37
  ## Overview
38
 
39
- **MoA** replaces standard dot-product attention with **metric-based attention** and blends multiple attentional biases using a **token-wise router** and **feature/router gates**.
40
-
41
- **Heads per block**
42
- - **LocalConvHead** — depthwise separable 1D conv (local context).
43
- - **Metric Multi-Head Attention (MetricMHAttention)** — attention via negative distances in learned head subspaces (L2 / cosine / diagonal-Mahalanobis), with per-head **origin** \(o_h\) and **radius** \(r_h\) enabling **ball pruning**.
44
- - **Metric MQA** — multi-query attention with shared K/V in the same metric space (efficiency).
45
- - **ChannelMixHead** — per-token MLP for channel interactions.
46
-
47
- **FFN**
48
- - **HyperFFN** (multi-branch): SwiGLU MLP path, separable-conv path, and low-rank path, combined via a token-wise branch router and optional feature gates. LayerScale + DropPath for stability.
49
-
50
- **Regularization (optional)**
51
- - **Triangle-inequality (TI) penalty** on sampled triples to encourage true-metric behavior.
52
-
53
- **Design goals:** geometric consistency, diverse inductive biases, structured efficiency, and full HF compatibility.
54
-
55
-
56
- ## What’s different from a standard Transformer?
57
-
58
- - **Distance-based attention (softmin over distances)** instead of dot product:
59
- \[
60
- \text{attn}(i,j)\ \propto\ \exp\!\big(-\alpha_h\ \|q_i-k_j\|^2\big)
61
- \]
62
- with per-head sharpness \(\alpha_h\). Cosine / diag-Mahalanobis variants supported.
63
- - **Per-head origins & radii** define balls for principled sparsity (**ball pruning**).
64
- - **Mixture of attentions** (conv / metric MHA / metric MQA / channel MLP) blended by a **token-wise router**, with **feature gates** (FiLM-like) and **router-bias gates** for up/down control.
65
- - **Up/Down projections** (SwiGLU-style) inside heads to expand/contract the value stream.
66
- - **HyperFFN** provides non-lazy capacity with token-wise branch routing.
67
-
68
-
69
 
70
- ## Intended Use & Limitations
 
 
 
 
 
71
 
72
- **Intended use:** research on geometry-aware attention, structured sparsity, and mixtures of attentional biases; small-scale experimentation and ablations.
73
 
74
- **Limitations:**
75
- - Dev runs used small token budgets; this is **not** a general-purpose LM.
76
- - **No KV cache** yet → generation cost scales with context length.
77
- - No alignment/safety tuning; outputs may be biased or inaccurate.
78
 
79
- **Out-of-scope:** high-stakes applications (medical/legal/etc.) without further training, evaluation, and safeguards.
80
 
 
 
81
 
 
82
 
83
- ## Training Details
84
-
85
- **Hardware:** CPU (Intel; no CUDA)
86
- **Precision:** FP32
87
-
88
- ### Latest run (v0.2)
89
- - **Tokens:** ~500,000 (two datasets, ~250k each)
90
- - **Wall-time:** ~20 minutes (~**417 toks/s** overall)
91
- - **Tokenizer:** GPT-2 (`gpt2`)
92
- - **Learning rate:** **5e-4** (AdamW)
93
- - **Batch / Seq:** batch_size=4, sequence length ≤512
94
- - **Final train loss:** **≈ 0.30**
95
-
96
- ### Prior run (v0.1)
97
- - **Tokens:** ~196k
98
- - **Wall-time:** ~14 minutes
99
- - **Final avg loss:** ≈ 0.417 (min batch ≈ 0.193)
100
-
101
- **Stability aids:** safe softmax (subtract max), PreNorm, LayerScale (≈1e-4), DropPath (optional), label masking (`-100` on padding).
102
 
 
 
 
 
 
 
 
 
 
 
 
 
 
103
 
 
104
 
105
  ## Configuration (example)
106
 
@@ -136,112 +119,153 @@ datasets:
136
  "eos_token_id": 50256
137
  }
138
  ```
139
- ---
140
 
141
- If you use gpt2 tokenizer, set pad_token = eos_token and ensure vocab_size/eos/pad match the tokenizer.
142
 
143
-
144
-
145
- Usage
146
 
147
- Inference
148
 
149
  ```python
150
- from transformers import AutoTokenizer, AutoModelForCausalLM
151
-
152
- model_id = "your-hf-username/MoAMetricLM-185M"
153
- tok = AutoTokenizer.from_pretrained(model_id)
154
- if tok.pad_token is None:
155
- tok.pad_token = tok.eos_token
156
-
157
- model = AutoModelForCausalLM.from_pretrained(model_id)
 
 
 
 
 
 
 
 
 
 
158
 
159
- prompt = "Explain metric-based attention in simple terms:"
160
- inputs = tok(prompt, return_tensors="pt")
161
- gen = model.generate(**inputs, max_new_tokens=128, do_sample=False)
162
- print(tok.decode(gen[0], skip_special_tokens=True))
163
 
164
- KV cache: not yet implemented; generation recomputes full context at each step.
165
 
166
- Training (custom loop sketch)
167
 
168
- from transformers import AutoTokenizer, DataCollatorForLanguageModeling
 
169
  from torch.utils.data import DataLoader
170
  import torch, torch.nn.functional as F
171
 
172
- tok = AutoTokenizer.from_pretrained("gpt2")
173
- if tok.pad_token is None:
174
- tok.pad_token = tok.eos_token
175
-
176
- def collate(examples):
177
- batch = tok([e["text"] for e in examples], padding="max_length",
178
- truncation=True, max_length=512, return_tensors="pt")
 
 
 
 
179
  labels = batch["input_ids"].clone()
180
  labels[batch["attention_mask"] == 0] = -100
181
  batch["labels"] = labels
182
  return batch
183
 
184
- # dataset = ... (load HF dataset with a 'text' field)
185
- # loader = DataLoader(dataset, batch_size=4, shuffle=True, collate_fn=collate)
186
- # model = AutoModelForCausalLM.from_pretrained(model_id) # or initialize config & model
187
-
188
- # optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, betas=(0.9,0.95), weight_decay=0.01)
189
- # for batch in loader:
190
- # out = model(**batch)
191
- # out.loss.backward()
192
- # torch.nn.utils.clip_grad_norm_(model.parameters(), 1.2)
193
- # optimizer.step(); optimizer.zero_grad()
 
 
 
 
 
 
 
194
  ```
195
 
196
- ## Evaluation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
197
 
198
- For meaningful comparisons, run:
199
- • Validation perplexity on a held-out split.
200
- • Ablations at fixed token budgets:
201
- • L2 vs cosine vs diag-Mahalanobis
202
- • With vs without ball pruning
203
- • With vs without HyperFFN branch router/gates
204
- • With vs without TI regularizer
205
 
 
206
 
 
207
 
208
- Efficiency Notes
209
- • Ball pruning: masks keys outside per-head radius → structured sparsity.
210
- • MQA: shared K/V reduce projection cost while retaining diversity via multi-query heads.
211
- • HyperFFN: token-wise branch router (optional top-k) to avoid paying for all branches equally.
212
- • CPU tips: set OMP_NUM_THREADS/MKL_NUM_THREADS to core count; use pad_token = eos_token.
213
 
214
- Roadmap: metric-aware KV cache for long contexts; kernelized distance approximations (e.g., RFF) for sub-quadratic regimes; quantization & mixed precision
215
- Safety, Bias & Risks
216
- • May produce biased, offensive, or factually incorrect outputs.
217
- • No safety/alignment training included.
218
- • Do not deploy in high-stakes contexts without additional
219
 
 
220
 
221
- License
222
 
223
- Apache-2.0 (update if different).
224
 
 
225
 
226
- Citation
227
 
 
228
  @misc{moametriclm185m,
229
- title = {MoAMetricLM-185M: A Geometry-Aware Mixture-of-Attentions Language Model},
230
- author = {Colca, Roy Shawn and collaborators},
231
- year = {2025},
232
- url = {https://huggingface.co/your-hf-username/MoAMetricLM-185M}
233
  }
 
 
 
234
 
 
235
 
 
 
 
 
236
 
237
- Changelog
238
- • v0.2 (2025-09-20) — 500k-token CPU run, GPT-2 tokenizer, LR=5e-4, final loss ≈ 0.30.
239
- • v0.1 (2025-09-20) — initial public release: metric heads, MQA, ball pruning, HyperFFN, router & gates; HF-compatible; no KV cache.
 
 
 
 
 
 
240
 
 
241
 
 
 
 
242
 
243
- Maintainers
244
- • Author: reaper (Convergent Intelligence LLC)
245
- • Contact: add preferred contact
246
- • Issues: HF model hub issues tab
247
- ---
 
1
  ---
2
  language:
3
+ - en
4
  license: apache-2.0
5
  library_name: transformers
6
  pipeline_tag: text-generation
7
  tags:
8
+ - mixture-of-attentions
9
+ - distance-attention
10
+ - metric-attention
11
+ - mqa
12
+ - hyperffn
13
+ - router-gating
14
  datasets:
15
+ - nvidia/Nemotron-Math-HumanReasoning
16
+ - WeMake/Intelligent-Content-Understanding
17
  ---
18
 
19
+ # MoAMetricLM100M — Mixture of Attentions (MoA)
20
 
21
+ **A geometryaware Transformer that mixes several attention mechanisms and routes them with a metricbased router.**
22
+ - **Parameters:** ~185 M (≈ 100 M effective due to the mixture)
23
+ - **Task:** Causal language modeling (decoder‑only)
24
+ - **Library:** 🤗 Transformers
25
+ - **KV cache:** Not yet implemented (generation recomputes the full context at every step)
26
 
27
+ ---
28
 
29
+ ## Model card
30
+
31
+ | **Model ID** | `reaperdoesntknow/MoA-100M` |
32
+ |--------------|-------------------------------------|
33
+ | **Architecture** | `moa_metric` (custom) |
34
+ | **Tokenizer** | GPT‑2 (`gpt2`) – `pad_token` set to `eos_token` |
35
+ | **Context length** | 2048 tokens |
36
+ | **Training data** | 2 × ≈ 256 k tokens from the datasets listed above |
37
+ | **Training compute** | CPU‑only (Intel), FP32 |
38
+ | **Training hyper‑parameters** | LR = 5e‑4 (AdamW), batch = 4, seq ≤ 512, 500 k total tokens |
39
+ | **Final loss** | ≈ 0.30 (train) |
40
+ | **License** | Apache‑2.0 |
41
+ | **Safety** | No alignment or safety fine‑tuning – outputs may be biased or inaccurate. |
42
+ | **Intended use** | Research on geometry‑aware attention, structured sparsity, and mixture‑of‑attention models. |
43
+ | **Limitations** | • No KV‑cache → slower generation. <br>• Small token budget → not a general‑purpose LM. <br>• No safety/alignment training. |
44
+ | **Out‑of‑scope** | High‑stakes applications (medical, legal, etc.) without further evaluation. |
45
 
46
+ ---
 
 
 
 
 
 
 
 
 
47
 
48
  ## Overview
49
 
50
+ MoA replaces the classic dotproduct attention with **metricbased attention** and blends **four** distinct heads per Transformer block:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
+ | Head type | Description |
53
+ |-----------|-------------|
54
+ | **LocalConvHead** | Depthwise‑separable 1‑D convolution → captures short‑range context. |
55
+ | **Metric Multi‑Head Attention (MetricMHAttention)** | Soft‑min over **L2 / cosine / diagonal‑Mahalanobis** distances: <br> \(\displaystyle \text{attn}_{h}(i,j) \propto \exp\!\big(-\alpha_h\|q_i-k_j\|^2\big)\) |
56
+ | **Metric MQA** | Multi‑Query attention (shared K/V) in the same metric space – cheaper than full MHA. |
57
+ | **ChannelMixHead** | Per‑token MLP that mixes channel dimensions (no positional mixing). |
58
 
59
+ A **token‑wise router** decides, for each token, which head(s) to use and applies **feature‑gates** (FiLM‑style) and **router‑bias gates** for up/down‑scaling.
60
 
61
+ The **FFN** is a **HyperFFN** – three parallel branches (SwiGLU MLP, separable‑conv, low‑rank) combined by a **branch router**. LayerScale and optional DropPath keep training stable.
 
 
 
62
 
63
+ ### Regularisation (optional)
64
 
65
+ * **Triangle‑inequality (TI) penalty** on sampled triples to encourage true‑metric behaviour.
66
+ * **Ball pruning** – each head learns an **origin** \(o_h\) and **radius** \(r_h\); keys outside the ball are masked, giving structured sparsity.
67
 
68
+ ---
69
 
70
+ ## Architecture diagram (high‑level)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
 
72
+ ```
73
+ Input → Embedding → (PreNorm) → Block₁ → … → Blockₙ → LM‑Head → Output
74
+
75
+ ├─ LocalConvHead
76
+ ├─ MetricMHAttention
77
+ ├─ MetricMQA
78
+ └─ ChannelMixHead
79
+ (router decides per‑token)
80
+
81
+ Each Block also contains:
82
+ → HyperFFN (SwiGLU | Conv | Low‑rank) ← branch router
83
+ → LayerScale + DropPath
84
+ ```
85
 
86
+ ---
87
 
88
  ## Configuration (example)
89
 
 
119
  "eos_token_id": 50256
120
  }
121
  ```
 
122
 
123
+ > **Tip:** If you use the GPT‑2 tokenizer, set `pad_token = eos_token` and make sure `vocab_size` matches the tokenizer (50257).
124
 
125
+ ---
 
 
126
 
127
+ ## Quick‑start (inference)
128
 
129
  ```python
130
+ >>> from transformers import AutoTokenizer, AutoModelForCausalLM
131
+
132
+ >>> model_id = "reaperdoesntknow/MoA-100M"
133
+ >>> tokenizer = AutoTokenizer.from_pretrained(model_id)
134
+ >>> tokenizer.pad_token = tokenizer.eos_token # needed for the GPT‑2 tokenizer
135
+
136
+ >>> model = AutoModelForCausalLM.from_pretrained(model_id)
137
+
138
+ >>> prompt = "Explain metric‑based attention in simple terms:"
139
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
140
+ >>> output_ids = model.generate(
141
+ ... **inputs,
142
+ ... max_new_tokens=128,
143
+ ... do_sample=False, # deterministic; set temperature>0 for sampling
144
+ ... pad_token_id=tokenizer.pad_token_id,
145
+ ... )
146
+ >>> print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
147
+ ```
148
 
149
+ *Note:* Because KV‑cache is not implemented, generation time grows linearly with the total context length.
 
 
 
150
 
151
+ ---
152
 
153
+ ## Training (custom loop sketch)
154
 
155
+ ```python
156
+ from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForLanguageModeling
157
  from torch.utils.data import DataLoader
158
  import torch, torch.nn.functional as F
159
 
160
+ tokenizer = AutoTokenizer.from_pretrained("gpt2")
161
+ tokenizer.pad_token = tokenizer.eos_token
162
+
163
+ def collate_fn(examples):
164
+ batch = tokenizer(
165
+ [ex["text"] for ex in examples],
166
+ padding="max_length",
167
+ truncation=True,
168
+ max_length=512,
169
+ return_tensors="pt",
170
+ )
171
  labels = batch["input_ids"].clone()
172
  labels[batch["attention_mask"] == 0] = -100
173
  batch["labels"] = labels
174
  return batch
175
 
176
+ # dataset = load_dataset(..., split="train") # must contain a 'text' field
177
+ # loader = DataLoader(dataset, batch_size=4, shuffle=True, collate_fn=collate_fn)
178
+
179
+ model = AutoModelForCausalLM.from_pretrained("reaperdoesntknow/MoA-100M")
180
+ optimizer = torch.optim.AdamW(
181
+ model.parameters(),
182
+ lr=5e-4,
183
+ betas=(0.9, 0.95),
184
+ weight_decay=0.01,
185
+ )
186
+
187
+ for batch in loader:
188
+ out = model(**batch)
189
+ out.loss.backward()
190
+ torch.nn.utils.clip_grad_norm_(model.parameters(), 1.2)
191
+ optimizer.step()
192
+ optimizer.zero_grad()
193
  ```
194
 
195
+ ---
196
+
197
+ ## Evaluation checklist
198
+
199
+ * **Perplexity** on a held‑out split of the two training datasets.
200
+ * **Ablation studies** (keep total token budget constant):
201
+ * L2 vs. cosine vs. diagonal‑Mahalanobis distance.
202
+ * With / without ball pruning.
203
+ * With / without HyperFFN branch router.
204
+ * With / without TI regulariser.
205
+ * **Speed / memory** comparison against a vanilla GPT‑2‑size model (same `dim`/`layers`).
206
+
207
+ ---
208
+
209
+ ## Efficiency notes
210
 
211
+ | Feature | What it does |
212
+ |---------|--------------|
213
+ | **Ball pruning** | Masks keys that lie outside a learned radius → reduces the quadratic attention cost. |
214
+ | **Metric MQA** | Shares K/V across heads → fewer projection matrices, lower FLOPs. |
215
+ | **HyperFFN branch router** | Token‑wise top‑k routing means only the most useful branch is evaluated per token. |
216
+ | **CPU tips** | Set `OMP_NUM_THREADS` / `MKL_NUM_THREADS` to the number of physical cores; use `torch.set_num_threads()` if needed. |
 
217
 
218
+ Future roadmap: metric‑aware KV‑cache, kernelised distance approximations (e.g., Random Fourier Features), quantisation & mixed‑precision inference.
219
 
220
+ ---
221
 
222
+ ## Safety, Bias & Risks
 
 
 
 
223
 
224
+ * The model **has not been fine‑tuned for safety or alignment**.
225
+ * Outputs may contain **biases, profanity, or factual errors**.
226
+ * Do **not** deploy in high‑stakes contexts without additional evaluation, moderation, and possibly further fine‑tuning.
 
 
227
 
228
+ ---
229
 
230
+ ## License
231
 
232
+ Apache2.0 see the `LICENSE` file in the repository.
233
 
234
+ ---
235
 
236
+ ## Citation
237
 
238
+ ```bibtex
239
  @misc{moametriclm185m,
240
+ title = {reaperdoesntknow/MoA-100M: A Geometry-Aware Mixture-of-Attentions Language Model},
241
+ author = {Colca, Roy Shawn and collaborators},
242
+ year = {2025},
243
+ url = {https://huggingface.co/reaperdoesntknow/MoA-100M}
244
  }
245
+ ```
246
+
247
+ ---
248
 
249
+ ## Changelog
250
 
251
+ | Version | Date | Notes |
252
+ |---------|------|-------|
253
+ | **v0.2** | 2025‑09‑20 | 500 k‑token CPU run, GPT‑2 tokenizer, LR = 5e‑4, final loss ≈ 0.30. |
254
+ | **v0.1** | 2025‑09‑20 | Initial public release: metric heads, MQA, ball pruning, HyperFFN, router & gates; HF‑compatible; no KV cache. |
255
 
256
+ ---
257
+
258
+ ## Maintainers
259
+
260
+ * **Author:** reaper (Convergent Intelligence LLC)
261
+ * **Contact:** *Email* ([email protected])*
262
+
263
+
264
+ ---
265
 
266
+ ## Special Remarks
267
 
268
+ - This models still in an extremely experimental state. As are most of them, but im working on stabilizing this one for general inference.
269
+ - I design create and train all of my models using my mathematical research and pure disgust for the dot product!
270
+ - For those of you who actually read this and use my models, you make my day everytime I see another download, so thank you for being awesome!
271