Particle News: Alibaba Cloud’s Aegaeon Cuts Nvidia GPU Needs by 82% in SOSP Paper

Overview

The research details a reduction from 1,192 to 213 Nvidia H20 GPUs to serve dozens of models of up to 72 billion parameters.
Aegaeon was beta-tested for more than three months in Alibaba Cloud’s model marketplace before the results were presented at SOSP.
The team measured skewed demand, with 17.7% of allocated GPUs processing only 1.35% of inference requests.
The system pools GPU resources with token-level autoscaling; secondary reports state one H20 can run up to seven LLMs and model-switching latency fell 97%.
The paper lists Peking University and Alibaba Cloud researchers as authors, including CTO Zhou Jingren, while broader coverage links the work to China’s evolving AI chip landscape.