Particle.news
Download on the App Store

Alibaba’s Aegaeon Slashes LLM Inference GPUs by 82% in Peer-Reviewed Tests

Peer review at SOSP lends credibility, with analysts noting uncertainty about applying the approach beyond Alibaba’s integrated cloud.

Overview

  • Aegaeon pooled GPU resources at the token level during a multi‑month production beta, cutting the fleet for dozens of LLMs from 1,192 to 213 accelerators.
  • The system raised effective output by up to nine times versus Alibaba’s prior serverless setup by scheduling tiny slices of work across shared GPUs.
  • Coverage cites Nvidia H20s as the test hardware, reflecting China’s constrained access to high‑end accelerators under U.S. export controls.
  • The research was presented at the 2025 ACM SOSP in Seoul by authors from Peking University and Alibaba, including CTO Jingren Zhou.
  • Analyses say results may hinge on Alibaba’s vertically integrated stack and network fabric, and some on Wall Street linked the reports to weakness in data‑center‑related stocks.