Particle News: Studies Unveil Scaling Laws and Edge Quantization for Mixture-of-Experts Language Models

Overview

The Efficiency Leverage (EL) metric quantifies MoE computational advantages over dense equivalents and forms the basis of new predictive scaling laws.
Over 300 models up to 28 billion parameters were trained to reveal power-law links between EL, expert activation ratios and compute budgets, with expert granularity acting as a nonlinear modulator.
Ling-mini-beta, a pilot MoE model with 0.85 billion active parameters, matched a 6.1 billion dense LLM on a 1 trillion-token dataset while using over seven times fewer compute resources.
Hessian-Aware Quantization uses smoothed Hessian information to enable joint 8-bit quantization of activations and weights for sparse models, significantly reducing accuracy loss from outliers.
A CPU–GPU collaborative offloading scheme for expert modules lowers GPU memory usage by about 60% and boosts inference latency on OPT and Mixtral models evaluated on Wikitext2 and C4.