Particle News: DeepSeek Publishes mHC to Stabilize Wider Residual Paths in Large AI Models

Overview

mHC projects mixing gates onto the Birkhoff polytope, enforcing doubly stochastic matrices to conserve signal and prevent the loss surges seen with unconstrained hyper-connections.
DeepSeek reports stable training on 3B, 9B, and 27B parameter models, with the 27B trials avoiding collapse that earlier designs encountered.
Systems optimizations—TileLang kernel fusion, selective recomputation, and DualPipe scheduling—expand effective residual width roughly 4x with about a 6.7% training time overhead.
Internal benchmarks on the 27B model show reasoning improvements over a standard baseline, including MMLU (+4.4), GSM8K (+7.1), and BBH (+7.2).
The approach builds on ResNet and ByteDance’s 2024 hyper-connection work, and outside experts note its significance as observers speculate a related DeepSeek model could arrive before Spring Festival 2026.