Particle.news
Download on the App Store

DeepSeek Publishes mHC to Stabilize Wider Residual Paths in Large AI Models

An arXiv paper uploaded by CEO Liang Wenfeng outlines a geometric constraint that keeps widened residual mixers stable at scale.

Overview

  • mHC projects mixing gates onto the Birkhoff polytope, enforcing doubly stochastic matrices to conserve signal and prevent the loss surges seen with unconstrained hyper-connections.
  • DeepSeek reports stable training on 3B, 9B, and 27B parameter models, with the 27B trials avoiding collapse that earlier designs encountered.
  • Systems optimizations—TileLang kernel fusion, selective recomputation, and DualPipe scheduling—expand effective residual width roughly 4x with about a 6.7% training time overhead.
  • Internal benchmarks on the 27B model show reasoning improvements over a standard baseline, including MMLU (+4.4), GSM8K (+7.1), and BBH (+7.2).
  • The approach builds on ResNet and ByteDance’s 2024 hyper-connection work, and outside experts note its significance as observers speculate a related DeepSeek model could arrive before Spring Festival 2026.