Particle.news

Download on the App Store

Studies Unveil Scaling Laws and Edge Quantization for Mixture-of-Experts Language Models

Researchers validated a metric for predicting sparse model compute efficiency, developing Hessian-aware low-bit inference with expert offloading to reduce on-device memory by roughly 60%

Overview

  • The Efficiency Leverage (EL) metric quantifies MoE computational advantages over dense equivalents and forms the basis of new predictive scaling laws.
  • Over 300 models up to 28 billion parameters were trained to reveal power-law links between EL, expert activation ratios and compute budgets, with expert granularity acting as a nonlinear modulator.
  • Ling-mini-beta, a pilot MoE model with 0.85 billion active parameters, matched a 6.1 billion dense LLM on a 1 trillion-token dataset while using over seven times fewer compute resources.
  • Hessian-Aware Quantization uses smoothed Hessian information to enable joint 8-bit quantization of activations and weights for sparse models, significantly reducing accuracy loss from outliers.
  • A CPU–GPU collaborative offloading scheme for expert modules lowers GPU memory usage by about 60% and boosts inference latency on OPT and Mixtral models evaluated on Wikitext2 and C4.