πŸ”€ Matrices in Transformers: Preface

Community Article Published December 5, 2025

Matrix is Transformation

A matrix is a transformation from one space to another. Not "a grid of numbers." Not "rows and columns." A matrix is a machine that takes vectors from one space and moves them to another:

Input Space                    Output Space
    ℝⁿ          ──── A ────▢       ℝᡐ

 A vector in              becomes        A vector in
 n dimensions                           m dimensions

When you multiply a matrix by a vector, you're asking: where does this point land in the new space?

When you multiply two matrices, you're asking: what single transformation equals doing one, then the other?

Matrices are made for multiplication. That's their purpose. A matrix sitting alone is just potential energy. A matrix multiplied is a transformation realized.


The Matmuls of a Transformer

Transformers are built from matrix multiplications (matmul). Here's the catalog:

1. Embedding: Lookup as Matmul

Token ID β†’ Vector

One-hot Γ— Embedding Matrix = Token Vector
(1 Γ— vocab) Γ— (vocab Γ— d) = (1 Γ— d)

A discrete symbol enters. A continuous vector exits. The embedding matrix is a lookup table viewed as a transformation.

2. Projection: Changing Subspaces

Vector β†’ Query/Key/Value

X Γ— W_Q = Q
X Γ— W_K = K  
X Γ— W_V = V

(seq Γ— d) Γ— (d Γ— d) = (seq Γ— d)

The same vectors, projected into different subspaces. Q asks questions. K provides addresses. V holds content. Three parallel transformations of the same input.

3. Attention: Measuring Similarity

Query Γ— Key^T = Attention Scores

Q Γ— K^T = Scores
(seq Γ— d) Γ— (d Γ— seq) = (seq Γ— seq)

The only place two input-derived matrices multiply each other. This is where tokens "see" each other. The result is a similarity map: how much should position i attend to position j?

4. Aggregation: Weighted Mixing

Attention Γ— Values = Output

A Γ— V = Output
(seq Γ— seq) Γ— (seq Γ— d) = (seq Γ— d)

Attention weights mix value vectors. Each output position is a weighted combination of all input positions. Information flows according to the attention pattern.

5. Feed-Forward: Expand and Compress

Vector β†’ Hidden β†’ Vector

X Γ— W₁ = Hidden      (d β†’ 4d, expand)
Hidden Γ— Wβ‚‚ = Output (4d β†’ d, compress)

A bottleneck in reverse: expand to a wider space, apply non-linearity, compress back. The FFN processes each position independentlyβ€”no cross-token interaction.

6. Output: Back to Vocabulary

Vector β†’ Logits

Hidden Γ— W_out = Logits
(seq Γ— d) Γ— (d Γ— vocab) = (seq Γ— vocab)

The inverse of embedding. Continuous vectors become scores over discrete tokens. Often W_out = W_embed^T (tied weights)β€”the same transformation, reversed.


Why Not One Giant Matrix?

If transformers are just matmuls, why not collapse them all into one?

Here's the catch: stacked matrix multiplications without non-linearity collapse into a single matrix.

Y = X Γ— A Γ— B Γ— C Γ— D

is equivalent to

Y = X Γ— M    where M = A Γ— B Γ— C Γ— D

No matter how many matrices you stack, the result is still a linear transformation. One matrix. Limited expressiveness.

Non-linearities prevent the collapse. Softmax, GELU, LayerNormβ€”these simple functions between matmuls make the whole greater than a single matrix could ever be. This makes the transformers (other neural networks too) deep.

So a transformer isn't one matrix. It's many matrices jointed by non-linearities which gives it the expressive power.


The Freight Train

You can picture a transformer as a freight train:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Embedding │─────────│   W_QKV   │─────────│   W_O     │─────────│    W_1    │───
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              LayerNorm            softmax              LayerNorm
              

    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
────│    W_2    │─────────│   W_out   │───▢ Output
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  GELU              softmax

Each car is a matrix. Embedding, projection weights, FFN weights, output projection. Each takes vectors in, transforms them, passes them out.

The joints are non-linearities. LayerNorm, softmax, GELU, etc. They bind the train together as it maneuver difficult terrains of the latent space.

The train is long. What you see above is only one transformer block (or layer). Modern non-MoE transformers like Llama 3 have 100+ layers.


What Is the Leverage?

The matrix theory is mature and has numerous applications. Mathematicians have thoroughly explored every property of linear transformations. Each concept is precisely defined and applied in almost all engineering fields. Aerospace engineers use eigenvalues for flight stability. JPEG uses orthogonal transforms for image compression. Bridge designers use condition numbers to ensure numerical simulations are trustworthy.

Machine learning is young. Deep learning took off around 2012. Transformers arrived in 2017. We're still discovering why things work. Why does LayerNorm help? Why does LoRA succeed with rank 8? Why do residual connections enable depth? The field is full of empirical findings waiting for theoretical grounding and inspiration. When we ask "why does this architectural choice work?", often the answer is a matrix property that engineers in other fields understood decades ago.

When mature math meets young engineering, the green space is huge. We're not inventing new mathematics. We're recognizing old mathematics in new applications.


The Topics

Below is a tentative list of topics to just scratch the surface.

  • Full-Rank & Causality β€” What if everything survives, but in temporal order?

    • Audio Engineering: Causal filters in real-time audio processing ensure output depends only on past samples, not future ones
    • Existing ML Application: Causal masking lets GPT see the past but not the future
    • New ML Application: Rank-aware KV cache compression for million-token contexts
  • Eigenvalues β€” What are the natural scaling factors of the transformation?

    • Aerospace: Aircraft stability analysisβ€”if any eigenvalue has positive real part, the plane's oscillations grow until it crashes
    • Existing ML Application: Residual connections keep eigenvalues near 1, enabling 100+ layer networks
    • New ML Application: Eigenvalue-constrained training to guarantee stable gradient flow
  • Condition Number β€” How extreme is the ratio between largest and smallest scaling?

    • Structural Engineering: Before trusting a bridge simulation, engineers check the condition numberβ€”ill-conditioned matrices mean the computer's answer might be garbage
    • Existing ML Application: LayerNorm and RMSNorm keep condition numbers bounded, stabilizing training
    • New ML Application: Condition-aware learning rates that adapt to local geometry
  • Positive Definiteness β€” Are all scaling factors positive?

    • Quantitative Finance: Portfolio covariance matrices must be positive definiteβ€”otherwise you get "negative variance," which is financial nonsense
    • Existing ML Application: Softmax attention produces positive semi-definite Gram matrices, making attention a valid kernel
    • New ML Application: Kernel-aware attention variants with guaranteed mathematical properties
  • Decomposition β€” How much of the input space survives the transformation?

    • Aerospace: Reduces thousands of sensor readings to a handful of principal components for real-time flight control
    • Existing ML Application: LoRA achieves efficient fine-tuning via low-rank weight updates
    • New ML Application: Adaptive rank allocationβ€”easy inputs get low-rank attention, hard inputs get full rank
  • Orthogonality β€” Are the transformation's directions independent?

    • Image Compression: JPEG uses the orthogonal Discrete Cosine Transformβ€”no information lost, perfectly reversible, and most coefficients end up near zero
    • Existing ML Application: Muon optimizer orthogonalizes gradient updates, outperforming Adam on matrix-shaped weights
    • New ML Application: Orthogonal attention heads that provably learn non-redundant patterns
  • Sparsity β€” Which parts of the transformation can we skip?

    • Circuit Simulation: Chip simulation with millions of components by exploiting sparsityβ€”each transistor only connects to a few neighbors
    • Existing ML Application: Sparse attention (Longformer, BigBird) scales to long documents by skipping distant token pairs
    • New ML Application: Learned dynamic sparsity patterns that adapt to input structure

Community

Sign up or log in to comment