Overview
- OpenAI released the Multipath Reliable Connection protocol through the Open Compute Project, developed with NVIDIA, AMD, Intel, Broadcom and Microsoft.
- MRC is a low-level fabric for training clusters that aims to keep jobs moving even when links fail, with recovery measured in microseconds and support for 100,000-plus GPUs.
- The design flattens the network into two layers by splitting an 800 Gb/s port into smaller links that feed separate switches, which cuts switch count, power and single points of failure.
- Traffic uses adaptive packet spraying that sends pieces of one transfer over hundreds of routes, with packet headers carrying the target memory address so out-of-order packets land in the right place.
- Source routing with SRv6 puts the full path in the packet at the sender, avoiding slow control-plane convergence and replacing protocols like BGP, and OpenAI says it is already running on GB200 systems at OCI Abilene and Microsoft’s Fairwater.