Particle.news

Download on the App Store

Nvidia Introduces Rubin CPX to Accelerate Million-Token AI by Splitting Inference

The new chip uses cost‑efficient GDDR7 to take on context processing alongside HBM GPUs in an NVL144 CPX rack with first systems targeted for late 2026.

Overview

  • Nvidia’s Rubin CPX is a context/prefill accelerator designed for very long inputs, offloading that phase so HBM-equipped GPUs focus on generation/decoding.
  • Within the Vera Rubin NVL144 CPX platform, 144 CPX GPUs work with 144 Rubin GPUs and 36 Vera CPUs to deliver about 8 exaflops (NVFP4), 100 TB of fast memory, and 1.7 PB/s bandwidth in a single rack.
  • Each CPX device provides roughly 30 petaFLOPS (NVFP4) and up to 128 GB of GDDR7, with integrated video encode/decode and a reported 3x attention speedup versus GB300-class systems.
  • Nvidia says CPX trays can ship with new racks or be added to existing Vera Rubin NVL144 deployments to scale long‑context inference without relying solely on costly HBM.
  • The company projects major gains and returns, citing up to 7.5x rack-level AI performance versus GB300 NVL72 and estimating $5 billion in token revenue per $100 million invested, though real‑world results remain to be proven.