Particle.news

NVIDIA Releases Nemotron‑Labs Diffusion With Three Decoding Modes in One Checkpoint

The open-source models let developers switch between autoregressive, diffusion, or self‑speculation modes to speed up single-query and low-batch inference.

Overview

  • NVIDIA announced the Nemotron‑Labs Diffusion family on Saturday, May 23, 2026, and published open checkpoints (3B, 8B, 14B text models and an 8B vision‑language model), training recipes, and a technical report on Hugging Face and GitHub.
  • A single Nemotron checkpoint can run in three selectable modes: standard autoregressive decoding, block-wise diffusion drafting and refinement, or self‑speculation which drafts with diffusion then verifies outputs with autoregressive decoding.
  • NVIDIA reports large throughput gains from the new modes—diffusion at about 2.6× tokens-per-forward-pass and self‑speculation up to roughly 6×–6.4× in their benchmarks—while claiming comparable or slightly improved accuracy versus prior 8B models.
  • Practical deployment is emphasized: models are released under permissive licenses, and inference support is being added to SGLang with working access via an open GitHub issue/PR but not yet merged to main.
  • The models were built by continuing autoregressive pretraining with a joint AR+diffusion objective to preserve AR behavior and KV‑cache compatibility, and independent benchmarking is needed to validate vendor‑reported speed and accuracy across real workloads and hardware.