Particle.news

OpenAI Open-Sources MRC, a Networking Protocol to Steady Giant AI Training Runs

The system tackles slow last-mile links in synchronized training by spreading packets across many preset paths that can switch in microseconds.

Overview

  • OpenAI released the Multipath Reliable Connection protocol through the Open Compute Project, developed with NVIDIA, AMD, Intel, Broadcom and Microsoft.
  • MRC is a low-level fabric for training clusters that aims to keep jobs moving even when links fail, with recovery measured in microseconds and support for 100,000-plus GPUs.
  • The design flattens the network into two layers by splitting an 800 Gb/s port into smaller links that feed separate switches, which cuts switch count, power and single points of failure.
  • Traffic uses adaptive packet spraying that sends pieces of one transfer over hundreds of routes, with packet headers carrying the target memory address so out-of-order packets land in the right place.
  • Source routing with SRv6 puts the full path in the packet at the sender, avoiding slow control-plane convergence and replacing protocols like BGP, and OpenAI says it is already running on GB200 systems at OCI Abilene and Microsoft’s Fairwater.