Overview
- Two arXiv preprints published Tuesday propose complementary fixes that aim to bring autoregressive-style token verification to diffusion LLMs and hybrid-attention models so they can generate text with lower latency.
- SimSD is a training-free, plug-in speculative-decoding method that inserts reference tokens and a carefully designed attention mask so a diffusion model can verify multiple drafted tokens in a single forward pass.
- The SimSD paper reports up to 7.46x higher decoding throughput on SDAR-family dLLMs while preserving or improving average generation quality and says the method can work with KV caches and blockwise decoding.
- FLARE offers a conversion and inference framework that trains hybrid-attention checkpoints with a token-equal AR-and-diffusion objective, hardware-aware kernels, and unified serving, and it finds transfer data quality is the main factor in keeping model capabilities after conversion.
- Both results are early-stage preprints benchmarked on SDAR-family models, so broader replication and real-world deployment tests are still needed before these techniques can be judged ready for production use.