Overview
- The research details a reduction from 1,192 to 213 Nvidia H20 GPUs to serve dozens of models of up to 72 billion parameters.
- Aegaeon was beta-tested for more than three months in Alibaba Cloud’s model marketplace before the results were presented at SOSP.
- The team measured skewed demand, with 17.7% of allocated GPUs processing only 1.35% of inference requests.
- The system pools GPU resources with token-level autoscaling; secondary reports state one H20 can run up to seven LLMs and model-switching latency fell 97%.
- The paper lists Peking University and Alibaba Cloud researchers as authors, including CTO Zhou Jingren, while broader coverage links the work to China’s evolving AI chip landscape.