Particle.news

Google DeepMind’s DiffusionGemma Brings 4x Faster Text Generation

The experimental open-weight model trades some output quality for much faster on-device inference that depends on a specialized drafter runtime to reach practical local use.

Overview

  • DiffusionGemma, which Google DeepMind released June 10, 2026, is a 26-billion-parameter mixture-of-experts model that activates 3.8 billion parameters at inference and is available under an Apache 2.0 license on Hugging Face.
  • The model uses diffusion-style block generation to draft and iteratively refine up to 256 tokens in parallel, a design that shifts the bottleneck from memory bandwidth to compute and yields vendor-reported speedups up to about 4x.
  • Google and NVIDIA published day-one benchmarks showing throughput figures such as roughly 1,000+ tokens per second on an H100 and 700+ tokens per second on an RTX 5090, with NVIDIA providing platform playbooks and optimizations across RTX and DGX systems.
  • DiffusionGemma is explicitly experimental and trades raw output quality for speed, with Google recommending standard Gemma 4 for applications that require the highest quality.
  • Practical local deployment is limited today because the model relies on a specialized drafter/speculative-decoding component that is not yet integrated into many common public runtimes, so community toolchain work and independent benchmarking will determine real-world adoption and performance.