Overview
- The Efficiency Leverage (EL) metric quantifies MoE computational advantages over dense equivalents and forms the basis of new predictive scaling laws.
- Over 300 models up to 28 billion parameters were trained to reveal power-law links between EL, expert activation ratios and compute budgets, with expert granularity acting as a nonlinear modulator.
- Ling-mini-beta, a pilot MoE model with 0.85 billion active parameters, matched a 6.1 billion dense LLM on a 1 trillion-token dataset while using over seven times fewer compute resources.
- Hessian-Aware Quantization uses smoothed Hessian information to enable joint 8-bit quantization of activations and weights for sparse models, significantly reducing accuracy loss from outliers.
- A CPU–GPU collaborative offloading scheme for expert modules lowers GPU memory usage by about 60% and boosts inference latency on OPT and Mixtral models evaluated on Wikitext2 and C4.