Overview
- mHC projects mixing gates onto the Birkhoff polytope, enforcing doubly stochastic matrices to conserve signal and prevent the loss surges seen with unconstrained hyper-connections.
- DeepSeek reports stable training on 3B, 9B, and 27B parameter models, with the 27B trials avoiding collapse that earlier designs encountered.
- Systems optimizations—TileLang kernel fusion, selective recomputation, and DualPipe scheduling—expand effective residual width roughly 4x with about a 6.7% training time overhead.
- Internal benchmarks on the 27B model show reasoning improvements over a standard baseline, including MMLU (+4.4), GSM8K (+7.1), and BBH (+7.2).
- The approach builds on ResNet and ByteDance’s 2024 hyper-connection work, and outside experts note its significance as observers speculate a related DeepSeek model could arrive before Spring Festival 2026.