Particle: NVIDIA Releases Nemotron‑Labs Diffusion With Three Decoding Modes in One Checkpoint

Overview

NVIDIA announced the Nemotron‑Labs Diffusion family on Saturday, May 23, 2026, and published open checkpoints (3B, 8B, 14B text models and an 8B vision‑language model), training recipes, and a technical report on Hugging Face and GitHub.
A single Nemotron checkpoint can run in three selectable modes: standard autoregressive decoding, block-wise diffusion drafting and refinement, or self‑speculation which drafts with diffusion then verifies outputs with autoregressive decoding.
NVIDIA reports large throughput gains from the new modes—diffusion at about 2.6× tokens-per-forward-pass and self‑speculation up to roughly 6×–6.4× in their benchmarks—while claiming comparable or slightly improved accuracy versus prior 8B models.
Practical deployment is emphasized: models are released under permissive licenses, and inference support is being added to SGLang with working access via an open GitHub issue/PR but not yet merged to main.
The models were built by continuing autoregressive pretraining with a joint AR+diffusion objective to preserve AR behavior and KV‑cache compatibility, and independent benchmarking is needed to validate vendor‑reported speed and accuracy across real workloads and hardware.