Overview
- Diffusion LLMs differ from autoregressive models because they use masked tokens and bidirectional attention, which changes token context across denoising steps and blocks standard speculative decoding.
- SimSD restores token-level speculative verification by injecting reference tokens and a custom attention mask so a diffusion model can verify draft tokens in one pass, reporting up to 7.46x higher decoding throughput on benchmark dLLMs.
- dLLM-Cache reuses stable intermediate results across denoising iterations with an adaptive, training-free cache and reports up to 9.1x reduction in FLOPs while cutting latency close to autoregressive model speeds on tested workloads.
- FLARE shows a hybrid-attention conversion that lets a single checkpoint support both autoregressive verification and diffusion denoising but finds that the quality of transfer data strongly determines how much model capability is preserved.
- All three approaches are experimental and largely training-free, so their real-world impact will hinge on replication across models, integration with serving stacks and hardware, and fresh work on transfer data and training objectives.