Why Manifold-Constrained Hyper-Connections Break Abliteration: A Structural Analysis of DeepSeek-V4

Community Article Published April 24, 2026

Upvote

Timothy Rohrbaugh

trohrbaugh

RadicalNotionAI

April 2026

Abstract

Abliteration — the technique of identifying and surgically removing a "refusal direction" from transformer weight matrices — relies on three structural assumptions about how information flows through standard residual-stream transformers. DeepSeek-V4 introduces Manifold-Constrained Hyper-Connections (mHC), a replacement for standard residual connections that violates all three assumptions simultaneously. This paper documents the precise failure modes: mHC expands the residual stream to four parallel copies, replaces additive writes with doubly-stochastic mixing, and distributes the refusal representation across a 4d-dimensional space rather than a single d-dimensional vector. Additionally, the Birkhoff polytope constraint that makes mHC stable during training actively prevents the weight surgery that abliteration requires. We characterize what a correct mHC-aware abliteration procedure would demand, and discuss why DeepSeek-V4-Base (FP8, no quantization-aware training) is a more tractable target than the instruct variants despite sharing the same mHC architecture.

1. Introduction

Abliteration, introduced by Arditi et al. (2024), demonstrated that refusal behavior in RLHF-trained language models is mediated by a single linear direction in the residual stream activation space. By projecting this direction out of the model's weight matrices, refusal can be suppressed without retraining. The technique has been applied broadly across model families — Llama, Mistral, Qwen, and others — and is now a standard tool in the open model ecosystem.

DeepSeek-V4 (DeepSeek-AI, 2026) introduces three novel architectural mechanisms: Compressed Sparse Attention (CSA), Heavily Compressed Attention (HCA), and Manifold-Constrained Hyper-Connections (mHC). The first two modify how attention compresses the KV cache. The third replaces residual connections entirely with a learned, constraint-preserving mixing operation across four parallel residual streams.

This paper focuses exclusively on mHC and its implications for abliteration. We show that mHC does not merely complicate the procedure — it invalidates the foundational assumptions at every level. We also identify what a correct mHC-aware abliteration would require and assess the tractability of each step.

2. Background

2.1 The Standard Residual Stream

Following Elhage et al. (2021), we describe the transformer residual stream as the central computational substrate. Each token at position t carries a vector x ∈ ℝ^d called the residual stream. At each layer l, an attention sublayer and an MLP sublayer each compute a delta and add it back:

x_{l+1} = x_l + Δ_l^{attn} + Δ_l^{mlp}

Because this update is purely additive, the residual stream at the final layer L decomposes as:

x_L = x_0 + Σ_{l=0}^{L-1} (Δ_l^{attn} + Δ_l^{mlp})

where x_0 is the token embedding. The residual stream is a shared, d-dimensional scratchpad that all layers read from and write to. No layer owns it; all layers communicate through it.

Two properties of this structure are critical for what follows:

Superposition: Multiple distinct features can coexist in the same d-dimensional vector, encoded in near-orthogonal directions (Elhage et al., 2022).
Persistence: A direction written into the stream at layer l remains present at layer l+k unless a subsequent layer actively cancels it.

2.2 Abliteration: Mechanics and Assumptions

Arditi et al. (2024) identified that refusal behavior is mediated by a single direction r ∈ ℝ^d in the residual stream. The procedure has two phases.

Phase 1 — Direction extraction. N prompt pairs (harmful / harmless) are run through the model. At each layer l, residual stream activations are collected immediately before the sublayer input:

r_l = mean(x_l | harmful prompts) − mean(x_l | harmless prompts)
r̂_l = r_l / ‖r_l‖

Empirically, r̂_l is highly consistent across layers (cosine similarity > 0.95 in most models). A single representative direction r̂ is selected, typically the mean across layers or the direction that maximizes the variance explained between the two classes.

Phase 2 — Weight surgery. For each linear weight matrix W in the model (output projections W_O, MLP projections W_gate, W_up, W_down), the refusal direction is projected out:

W' = W − (W r̂)(r̂ᵀ) = W(I − r̂r̂ᵀ)

The matrix P_⊥ = (I − r̂r̂ᵀ) is a rank-(d-1) orthogonal projector onto the hyperplane perpendicular to r̂. Applied to W, it removes r̂ from the column space of W: whatever input the weight matrix receives, its output will have no component along r̂.

Why this works. Because all layers write to the same stream additively, preventing every layer from writing r̂ into its delta means the stream can never accumulate refusal signal. The direction may arrive (in embeddings or contextual features), but no layer reinforces it, so it does not activate the refusal behavior.

The procedure rests on three explicit assumptions:

A1 (Single stream): All layers share one residual stream of dimension d.

A2 (Additive writes): Layers contribute to the stream by addition: x_{l+1} = x_l + Δ_l.

A3 (Direction locality): Refusal is mediated by a single direction r̂ ∈ ℝ^d that is consistent across layers.

2.3 Hyper-Connections

Hyper-Connections (Zhu et al., 2024) generalize the residual connection by introducing n parallel residual streams. Instead of a single vector x ∈ ℝ^d, the representation at layer l is a matrix:

X_l ∈ ℝ^{n × d}        (n copies of the d-dimensional stream)

The read-compute-write cycle becomes:

h_l   = A_l · X_l                       (read: n streams → one ℝ^d vector via learned mixing)
Δ_l   = f_l(h_l)                        (compute: sublayer operates on the mixed input)
X_{l+1} = X_l B_l + 1ₙ Δ_l B_l^{diag}  (write: B_l distributes Δ_l and mixes streams)

Here A_l ∈ ℝ^{1×n} is a learned input-mixing vector and B_l ∈ ℝ^{n×n} controls how the n streams are mixed when writing back. When n=1, B_l reduces to a scalar 1 and A_l to a scalar 1, recovering the standard residual.

With n > 1, information placed in one stream at layer l is redistributed across all n streams at layer l+1 according to B_l. The streams are not independent channels — they communicate at every layer through B_l.

2.4 Manifold-Constrained Hyper-Connections (mHC)

DeepSeek-V4 adopts Hyper-Connections with one additional constraint: B_l must be a doubly stochastic matrix.

A matrix M ∈ ℝ^{n×n} is doubly stochastic if:

M_{ij} ≥ 0     for all i, j
Σ_j M_{ij} = 1  for all i    (row sums = 1)
Σ_i M_{ij} = 1  for all j    (column sums = 1)

The set of all doubly stochastic n×n matrices is the Birkhoff polytope, denoted B_n. By the Birkhoff-von Neumann theorem, the extreme points (vertices) of B_n are exactly the n! permutation matrices. Every doubly stochastic matrix is a convex combination of permutations — equivalently, a "soft permutation" that redistributes probability mass without creating or destroying it.

DeepSeek-V4 uses n_hc = 4 (four parallel streams) and enforces the doubly stochastic constraint during training via Sinkhorn-Knopp normalization (20 iterations per training step). Sinkhorn-Knopp alternately normalizes rows and columns of the raw parameter matrix, converging to the nearest doubly stochastic matrix in the Frobenius norm.

The critical spectral consequence follows from Perron-Frobenius theory: all doubly stochastic matrices have spectral radius ρ(B_l) = 1. That is, B_l neither amplifies nor attenuates any direction — it can rotate and mix but cannot cause exponential growth. Applied across L layers, the product B_L B_{L-1} ... B_1 remains bounded. This is the primary motivation for the constraint: trillion-parameter MoE models at 61 layers with unconstrained mixing matrices would be prone to gradient explosion or vanishing through the residual path.

Implementation note: the mHC mixing matrix B_l ∈ ℝ^{4×4} is small (the PDF reports an output dimension of 24 = 4 × 6 for the full parameterization, using pre/res/post decomposition for dynamic parameterization). The parameter cost of mHC is negligible compared to the attention and MLP weights. The engineering cost is in the Sinkhorn iterations per training step and the recomputation strategy for activation checkpointing.

3. How mHC Invalidates Abliteration's Assumptions

3.1 Failure of A1: No Single Residual Stream

What A1 requires. All layers share one vector x ∈ ℝ^d. Activation hooks at "the residual stream" yield a complete representation.

What mHC provides. The representation at layer l is X_l ∈ ℝ^{4×d} — four streams, each of dimension d. When an activation hook is placed at the standard intervention point (the input to a sublayer), it captures:

h_l = A_l · X_l ∈ ℝ^d

This is a learned linear projection of all four streams into one d-dimensional vector. The coefficients A_l[0], A_l[1], A_l[2], A_l[3] are learned and vary across layers. What the hook sees is not the full representation — it is a layer-specific weighted average of four underlying streams.

The consequence for abliteration is immediate: the direction r̂ extracted from h_l activations is a direction in the projected space — the shadow of the true 4d-dimensional representation on the d-dimensional sublayer-input plane. The extraction procedure recovers the wrong object.

To see this concretely: suppose streams 0 and 2 both contain strong refusal signal but with opposite sign, while A_l assigns equal weight to both. The refusal directions cancel in h_l. The extraction procedure reports no refusal direction at layer l. But the refusal representation is fully present in X_l — it simply wasn't visible through the A_l projection. A subsequent layer with different A_{l+1} weights may reveal it again.

Conversely, stream 1 may contain a strong refusal signal that is faithfully projected by A_l into h_l. Abliteration extracts this direction and projects it out of the sublayer weights. But the signal in streams 0, 2, and 3 is untouched, and a later A_{l'} may reconstruct the refusal direction from those streams.

Implication. Abliteration on mHC requires hooking into X_l ∈ ℝ^{4×d}, not h_l ∈ ℝ^d. The current transformers-based hook infrastructure does not expose this tensor, as it is internal to the mHC implementation.

3.2 Failure of A2: Writing Is Not Additive

What A2 requires. x_{l+1} = x_l + Δ_l. The delta from layer l's computation is simply added to the stream. History accumulates.

What mHC provides. The write operation mixes all four streams through B_l:

X_{l+1} = X_l B_l + (Δ_l distributed across streams by B_l)

More precisely, stream i at layer l+1 receives:

X_{l+1}[i] = Σ_j B_l[i,j] · X_l[j]  +  (contribution from Δ_l through B_l[i,i])

The content of stream i at l+1 depends on the content of all four streams at layer l, weighted by the i-th row of B_l. This is not addition — it is mixing with replacement. Stream 0 at layer l+1 is not "stream 0 from layer l plus a new contribution." It is a combination of all four streams from layer l.

This breaks the additivity assumption in two ways relevant to abliteration:

Problem 1: Carrying refusal across surgically modified layers.
Suppose abliteration projects r̂ out of the weight matrices at layer l, successfully preventing Δ_l from containing any r̂ component. This works as intended. But the write-back operation takes existing stream content (X_l, which still contains r̂ across its four streams from earlier layers) and mixes it into X_{l+1} via B_l. The mixing of the old streams produces new stream content that may reconstruct r̂ at the input to layer l+1's sublayer, through A_{l+1} · X_{l+1}.

Abliteration's guarantee in standard models — "if no layer writes r̂, the stream cannot accumulate r̂" — fails here because the stream-mixing operation itself carries r̂ forward, independent of what any sublayer writes.

Problem 2: Non-commutativity of the surgery.
In standard models, you can project r̂ out of each W independently and the combined effect is simply that r̂ is never written. In mHC, the B_l mixing couples the layers. A refusal direction that survived in one stream at layer 5 can appear in a different stream at layer 6 depending on B_6, then in the sublayer input at layer 7 depending on A_7. The surgery at any single layer does not prevent downstream reconstruction — you need to address the mixing matrices as well as the sublayer weights.

3.3 Failure of A3: Refusal Lives in 4d Space, Not d

What A3 requires. A single direction r̂ ∈ ℝ^d captures and mediates refusal across all layers. This is the empirical finding that makes abliteration practical: one vector, one surgery.

What mHC provides. The full representation at layer l is X_l ∈ ℝ^{4×d}. Refusal is not a single vector in ℝ^d — it is a pattern across four parallel streams, an element of ℝ^{4×d}. The refusal "object" is:

R_full ∈ ℝ^{4×d}

where R_full[i] ∈ ℝ^d describes how stream i encodes the refusal representation at layer l.

The relationship between R_full and the d-dimensional shadow r̂ that abliteration extracts is:

r̂ ≈ A_l · R_full    (up to normalization)

This is a dimensionality-reducing projection. The same r̂ in the projected space is consistent with infinitely many R_full patterns in the 4d space — the projection is not injective. Two models could produce identical r̂ from different R_full patterns; abliteration would prescribe the same surgery for both, but the underlying 4d objects being targeted are different.

Furthermore, the B_l mixing matrices continuously transform R_full as it propagates through layers. The R_full pattern at layer 5 becomes B_6 · R_full (matrix product across the stream dimension) at layer 6 before entering the sublayer. A_7 then projects this transformed 4d pattern back to ℝ^d for the next sublayer. The apparent consistency of r̂ across layers — the empirical foundation of A3 — may be an artifact of the A_l projections happening to produce similar shadows from a rotating 4d object.

An illustrative failure mode:

Consider n=2 (simplified from 4) with a refusal pattern:

R_full = [+r, −r]    (stream 0 has r, stream 1 has −r)

If A_l = [0.5, 0.5], then: h_l = A_l · R_full = 0.5r − 0.5r = 0. Abliteration sees no refusal at this layer. No surgery is performed.

After B_l mixing:

B_l = [[0.9, 0.1], [0.1, 0.9]]   (doubly stochastic: rows and columns sum to 1)
X_{l+1} = B_l · R_full = [0.9r − 0.1r, −0.9r + 0.1r] = [0.8r, −0.8r]

The pattern persists (slightly attenuated). Now at layer l+1, if A_{l+1} = [0.8, 0.2]:

h_{l+1} = 0.8(0.8r) + 0.2(−0.8r) = 0.64r − 0.16r = **0.48r**

Abliteration now sees a strong refusal direction. Surgery is performed on layer l+1's weights. But the 4d pattern R_full (now [0.8r, −0.8r]) is untouched in the streams. After the next B_l mixing, it will reconstitute again.

The apparent refusal direction at any layer is a function of A_l, B_l, and R_full — not just R_full. Removing it from the sublayer weights removes it from the sublayer's contribution but not from the stream.

4. The Birkhoff Constraint as an Active Barrier to Surgery

Even if one could work in the full 4d space — identifying R_full correctly and wanting to eliminate it — the Birkhoff polytope constraint on B_l creates a fundamental barrier.

4.1 The Constraint Geometry

To prevent B_l from routing refusal forward through stream mixing, one would want to modify B_l such that it attenuates the R_full pattern:

B_l' = argmin_{B ∈ B_n} ‖B − B_l‖_F   s.t.   (B routes R_full minimally)

Any unconstrained perturbation of B_l to satisfy the refusal-routing constraint generally produces a matrix that violates the doubly stochastic conditions:

Negative entries: Standard direction projection introduces subtraction; entries of B_l' may become negative, violating the non-negativity requirement.
Row/column sum violations: A generic perturbation changes row and column sums. Restoring them requires Sinkhorn normalization, which further modifies B_l' in uncontrolled ways.
Spectral radius > 1: If B_l' is no longer doubly stochastic, Perron-Frobenius no longer guarantees ρ(B_l') = 1. A small spectral radius violation (say ρ = 1.01) compounds exponentially across L layers: 1.01^{61} ≈ 1.82 for V4-Pro. This introduces activation growth that the model was never trained to handle.

4.2 The Fundamental Tension

Abliteration requires the model to attenuate the refusal direction — to create a gap where the stream cannot accumulate r̂ even if it arrives in the input. In mathematical terms, for some direction v, we want B_l to satisfy:

‖B_l · v‖ < ‖v‖    (attenuation: the direction shrinks as it passes through B_l)

But doubly stochastic matrices have spectral radius exactly 1. For all v:

‖B_l · v‖ ≤ ‖v‖    (non-amplification, guaranteed by doubly stochastic)

with equality for v = 1_n (the all-ones vector), which is always an eigenvector with eigenvalue 1.

The constraint allows B_l to be non-amplifying but not attenuating (in general). If the refusal direction happens to align with the dominant eigenvector of B_l, it passes through with eigenvalue 1 — not attenuated at all. B_l cannot be modified to attenuate a specific direction without either (a) becoming non-doubly-stochastic or (b) attenuating all directions equally (which is just scaling by a constant, not useful).

4.3 Sinkhorn as a Constraint Enforcer, Not a Fix

One might attempt: modify B_l to attenuate refusal, then project back onto B_n via Sinkhorn-Knopp. This is the natural approach given that Sinkhorn was used during training.

The problem is that Sinkhorn projection finds the nearest doubly stochastic matrix in Frobenius norm, without regard for the refusal-routing property. After projection:

The modified B_l' was designed to not route R_full
The Sinkhorn projection changes B_l' to satisfy double stochasticity
The projected matrix may or may not route R_full — there is no guarantee that the constraint is respected

You would need to solve a doubly constrained optimization: find B ∈ B_n that (1) is doubly stochastic and (2) routes R_full minimally. This is a convex optimization problem (both constraints are convex, the objective is convex) but has no closed form. It must be solved numerically per layer, 61 times for V4-Pro.

5. What Correct mHC Abliteration Would Require

Given the above analysis, a correct procedure for abliteration on an mHC-equipped model would involve the following steps. This is not claimed to be impossible — it is a research project with a clear structure, but several steps are non-trivial engineering and computational challenges.

Step 1: Full 4d Activation Collection

For each layer l, collect the full four-stream representation X_l ∈ ℝ^{4×d} (not the projected h_l = A_l · X_l) for both harmful and harmless prompt sets. Compute:

R_full_l = mean(X_l | harmful) − mean(X_l | harmless) ∈ ℝ^{4×d}

This requires hooking into the internal mHC buffer, which is not exposed by standard transformers hooks. It requires direct access to DeepSeek's mHC implementation (TileLang kernels, custom forward pass) or a re-implementation.

Step 2: Cross-Layer Direction Consistency Analysis

Unlike standard abliteration where r̂ is approximately consistent across layers, R_full_l transforms across layers by the B_l matrices:

R_full_{l+1} ≈ B_l · R_full_l    (schematically — the exact transform depends on the pre/res/post decomposition)

The full refusal pattern must be tracked in the 4d space across all layers. The consistency that A3 relied on (a single r̂ works everywhere) must be replaced by a layer-dependent 4d pattern, adjusted for the B_l transformations.

Step 3: Sublayer Weight Projection in the Mixed Space

For each sublayer, the projection of R_full_l into the sublayer-input space is:

r_l = A_l · R_full_l ∈ ℝ^d

This is the direction the sublayer actually sees. Weight surgery on this layer projects r_l out of the sublayer's weight matrices — the same operation as standard abliteration, but now using the correctly computed r_l from the full 4d representation rather than the naive hook:

W' = W − (W r̂_l)(r̂_l^T)    where r̂_l = r_l / ‖r_l‖

Step 4: Constrained Optimization on Each B_l

For each B_l, solve:

B_l' = argmin_{B ∈ B_n} ‖B − B_l‖_F
       subject to: (A_{l+1} · B · R_full_l) ≈ 0

This minimizes disruption to B_l (to preserve model behavior) while ensuring that after mixing, the refusal pattern projects to near-zero at the input to the next sublayer. The constraint is linear in B, and B_n is a convex polytope, so this is a quadratic program with linear constraints. It can be solved with standard QP solvers (OSQP, CVXPY) per layer.

Note: even if each layer's QP is solved correctly, there is no guarantee that the chain composition across L layers eliminates refusal globally. A post-hoc verification pass (running forward pass, checking for refusal reconstruction) is necessary.

Step 5: Global Verification

After modifying all W matrices (Step 3) and all B_l matrices (Step 4), run the harmful prompt set forward through the modified model and verify that refusal activations do not reconstruct. If they do, iterate: recompute R_full_l from the modified model's activations and repeat Steps 2-4. This is analogous to how multi-round abliteration is sometimes applied in standard models, but more expensive here because each round requires solving 61 QPs.

6. V4-Base as a Comparatively Tractable Target

DeepSeek-V4-Pro-Base and DeepSeek-V4-Flash-Base share the same mHC architecture as the instruct variants, so the mHC-specific challenges in Sections 3 and 4 apply equally. However, the base models present two advantages:

Advantage 1: No FP4 QAT.
The instruct models underwent Quantization-Aware Training (QAT) during post-training, tuning the weight values to survive lossless conversion to FP4. The expert weights satisfy a specific scale-factor ratio condition (within FP8 sub-blocks of 1×32 tiles relative to FP8 quantization blocks of 128×128 tiles) that makes FP4-to-FP8 dequantization lossless. After abliteration surgery, the modified weights no longer satisfy this condition. Deploying the abliterated instruct model at FP4 efficiency is not possible without re-running QAT.

The base models are stored in FP8 (no QAT). Abliteration surgery on FP8 weights requires dequantizing to BF16, applying the projection, and re-quantizing to FP8. Re-quantization of FP8 (without QAT) is standard practice with manageable accuracy loss, comparable to standard BF16→FP8 workflows. The base model path: dequantize (FP8→BF16) → surgery → re-quantize (BF16→FP8) is straightforward.

Advantage 2: Less Distributed Safety Behavior.
The instruct models' safety behavior was instilled through On-Policy Distillation (OPD) from 10+ domain specialist teachers, each themselves trained via SFT+GRPO. This cascaded, multi-teacher process distributes safety across the model's logit distributions more broadly than single-stage RLHF. The "refusal direction" (or its 4d mHC analog) is expected to be less sharp and more entangled with general capability representations.

Base models have no post-training safety behavior — abliteration is moot for the base models in the traditional sense. But this observation suggests that if a practitioner were to first apply a lightweight supervised safety training to the base model (a thin SFT pass with refusal examples), and then abliterate the resulting model, the resulting safety direction would be simpler and more localized than in the OPD-trained instruct. The mHC challenges would still apply, but the target would be less diffuse.

7. Implications and Open Questions

7.1 Implications for the Model Ecosystem

The adoption of mHC in a production model at this scale (1.6T parameters) is the first instance we are aware of in the open-weight corpus where a non-standard residual topology meaningfully complicates post-hoc weight surgery. Frankenmerge techniques (DARE, TIES, SLERP), LoRA rank decomposition, and activation steering are all built on the assumption of a single additive residual stream. mHC invalidates this for merging between mHC and non-mHC models, for steering vectors computed in the standard way, and for any technique that assumes linear superposition in the stream.

7.2 Does mHC Reduce Steerability Generally?

The same properties that complicate abliteration also complicate activation steering (adding a direction to the residual stream to modify behavior). Adding a steering vector to h_l (the sublayer input) adds it to the projected view, not to the underlying 4d representation. The B_l mixing then distributes it across streams in ways not controlled by the steering practitioner. Whether this makes mHC models harder to steer in practice is an empirical question not answered by this analysis.

7.3 Does the mHC Paper Address This?

DeepSeek references a dedicated mHC paper (Xie et al., 2026) for engineering details. This paper was not publicly available at the time of writing (April 2026). The mHC paper may contain additional analysis of the representation geometry that would inform a more precise abliteration procedure.

7.4 Hash Routing Layers as a Partial Simplification

V4's first three MoE layers use deterministic hash routing — token identity determines expert assignment, with no learned routing. These layers' FFN components do not learn routing-dependent representations and thus may not strongly encode the refusal direction in the hash-routed expert weights. Abliteration could potentially skip or minimize surgery on these layers without significant loss of effectiveness.

8. Conclusion

Abliteration depends on three properties of the standard transformer residual stream: (A1) a single shared stream of dimension d, (A2) purely additive writes, and (A3) a single direction r̂ ∈ ℝ^d capturing refusal across all layers. Manifold-Constrained Hyper-Connections, as implemented in DeepSeek-V4, violate all three simultaneously.

mHC maintains four parallel residual streams (violating A1), mixes them through doubly stochastic matrices at every layer (violating A2), and distributes the refusal representation across a 4d-dimensional space (violating A3). The Birkhoff polytope constraint that ensures training stability also prevents the targeted attenuation that abliteration requires, because doubly stochastic matrices cannot attenuate without becoming non-doubly-stochastic — introducing instability not present during training.

Correct mHC abliteration would require: hooking into the full 4d stream, tracking refusal patterns across layers in the mixed space, and solving a constrained quadratic program per layer to modify the B_l matrices while preserving double stochasticity. This is theoretically well-defined but practically demanding, and has not been demonstrated at trillion-parameter scale.

The base model variants (V4-Pro-Base, V4-Flash-Base) are more tractable due to the absence of FP4 QAT, though the mHC challenges are identical. The most tractable path for practitioners who wish to produce a V4-class model without safety constraints is post-training the base model from scratch — effectively building one's own instruct, skipping the challenge of undoing what OPD built.

References

Arditi, A., Obeso, O., Syed, A., Barez, F., Obst, O., & Banerjee, S. (2024). Refusal in Language Models Is Mediated by a Single Direction. arXiv:2406.11717.

Birkhoff, G. (1946). Tres observaciones sobre el álgebra lineal. Universidad Nacional de Tucumán Revista, Series A, 5, 147–151.

DeepSeek-AI. (2026). DeepSeek-V4 Technical Report. HuggingFace: deepseek-ai/DeepSeek-V4-Flash.

Elhage, N., Nanda, N., Olsson, C., et al. (2021). A Mathematical Framework for Transformer Circuits. Transformer Circuits Thread.

Elhage, N., Henighan, T., Joseph, N., et al. (2022). Toy Models of Superposition. Transformer Circuits Thread.

Perron, O. (1907). Zur Theorie der Matrizen. Mathematische Annalen, 64(2), 248–263.

Sinkhorn, R., & Knopp, P. (1967). Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21(2), 343–348.

Von Neumann, J. (1953). A certain zero-sum two-person game equivalent to the optimal assignment problem. In H. Kuhn & A. Tucker (Eds.), Contributions to the Theory of Games, Vol. 2, pp. 5–12.

Xie, Z., et al. (2026). Manifold-Constrained Hyper-Connections. (Referenced in DeepSeek-V4 Technical Report; not yet publicly available at time of writing.)

Zhu, D., et al. (2024). Hyper-Connections. arXiv preprint.

This analysis was produced using ModelAtlas, a structured database of large language model architectures maintained at github.com/[private]. Model architecture data was sourced from HuggingFace model configurations and the DeepSeek-V4 technical report (April 2026).

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote