Overview
- NVIDIA announced the Nemotron‑Labs Diffusion family on Saturday, May 23, 2026, and published open checkpoints (3B, 8B, 14B text models and an 8B vision‑language model), training recipes, and a technical report on Hugging Face and GitHub.
- A single Nemotron checkpoint can run in three selectable modes: standard autoregressive decoding, block-wise diffusion drafting and refinement, or self‑speculation which drafts with diffusion then verifies outputs with autoregressive decoding.
- NVIDIA reports large throughput gains from the new modes—diffusion at about 2.6× tokens-per-forward-pass and self‑speculation up to roughly 6×–6.4× in their benchmarks—while claiming comparable or slightly improved accuracy versus prior 8B models.
- Practical deployment is emphasized: models are released under permissive licenses, and inference support is being added to SGLang with working access via an open GitHub issue/PR but not yet merged to main.
- The models were built by continuing autoregressive pretraining with a joint AR+diffusion objective to preserve AR behavior and KV‑cache compatibility, and independent benchmarking is needed to validate vendor‑reported speed and accuracy across real workloads and hardware.