mHC: Manifold-Constrained Hyper-Connections
https://www.arxiv.org/pdf/2512.24880
Problem Statement
In the simplest deep neural network, a layer generates an output by multiplying the input by its weights. The output then becomes the input for the next layer.
Consider the simplest case where the input is a single data point (no batching) which has only one feature (or one dimension). Layer L will produce an output—let’s say Y. And Y will then become the input to layer L + 1, which will produce an output Z. The layers are kind of independent; layer L + 1 doesn’t really need to know about layer L. It needs just the input Y, and it will compute Z. Now, let’s say layer L + 1 adds some delta meaning to the existing Y, so Z can be represented as Y + Δvalue. So instead of letting the L + 1 layer determine the whole Z (Y + Δvalue), why not just provide Y directly and only try to determine that delta in L + 1? This is the idea of ResNet. It was mainly designed to fix the problem of the vanishing gradient.
Now, let’s talk about language models. In a Transformer architecture, residual connections are used around each sub-layer (the Multi-Head Attention and the Feed-Forward Network). This allows the model to retain information from earlier processing stages and helps gradients flow through the very deep stacks of blocks typically found in large language models.
The input matrix to a Transformer block is S × C. Ignore the batches for now; S is the number of tokens in the given input (we are talking about just one input prompt) and C is the model dimension, i.e., the embedding of each input. The role of C, in general, is to capture the features of the token in the current context. The larger the C, the more features we can capture. But increasing C has a significant compute cost.
As C scales, the computation for Transformer blocks scales by C². Apart from the compute cost, we are trying to fit all concepts into a single space of C dimensions. So, instead of having nC features in one dimension (1D), we represent the same number of features with n × C (2D), where n is the dimension. We hope that they would all capture either distinct information or some hierarchy of abstract concepts. These extra dimensions are called “data streams.”
Like the attention compute mechanism where individual tokens share information among themselves, we want these rows to share information among themselves. How much information one row should consume from another is simply decided by a weighted learning matrix. This matrix is called Hres.
But this doesn’t really solve the problem of increased computation because, overall, we still added nC input parameters. To solve this problem, we use the same old trick of projecting down and projecting up which is used again and again across popular techniques (LoRA, DeepSeek Attention, VAE, etc.). Take an n × C matrix, project it down to 1 × C (of course, using a weighted matrix called Hpre), then run the regular Transformer block and project the Transformer output back to an n × C matrix (again using another weighted matrix—called Hpost).
Now, the problem is that the weights we used for mixing, projecting down, and projecting up aren’t really stable during training. The weights could go really high or really low—and the model never achieves its full potential. And that’s exactly the problem this paper is trying to solve.
Proposed Solution
To solve this instability problem, the proposed solution is to bound the Hres values. The way we bound them is by finding a matrix which satisfies certain properties:
All entries in the matrix should be non-negative.
All columns should sum to 1.
All rows should sum to 1.
The multiplication of such matrices again satisfies properties 1–4 (recursively).
My understanding of these conditions is as follows:
The non-negativity is what helps achieve (2) and (3). More intuitively, we don’t want the signals to cancel each other out. We want them to be constructive and not destructive.
The column sum condition ensures that the total input value (the sum of all the cells in the matrix) remains the same even after applying Hres.
The row sum condition basically ensures we are mixing input streams in a weighted average way. Some are amplified and some are dampened.
This multiplication condition is basically ensuring that, layer after layer, the properties 1–3 are successfully met.
So, how do we create such an Hres? We use something called the Sinkhorn-Knopp algorithm.
All the values should be non-negative: To achieve this, we take the exponent of each value. This is again a common pattern in deep learning architectures.
Each row and column should add up to 1 individually: For this, we repeatedly take the Hres and normalize the columns and then the rows. The normalization means each column will be divided by its sum, and the same goes for rows then. If you do this infinitely, we will get a matrix which satisfies the conditions 1–4. But in practice, we iterate only 20 times.
And that’s the crux of the paper.
Observed Results
If you just use the “stream of data” approach without any constraints, the authors observed the initial input was amplified up to 3,000 times. But the proposed approach kept this explosion down to just 1.6 times (the ideal would be 1).
The model performed 2% better on reasoning benchmarks.
Despite using 4 data streams, it only took 6.7% extra time to train the models.
And the results are scalable, as they are tested across 3B, 9B, and 27B models.
Thank you!
