Overview
- The paper identifies a U-shaped sparsity-allocation law, with roughly 20–25% of sparse parameters assigned to Engram yielding the lowest validation loss.
- Under matched parameter and FLOPs budgets, Engram-27B outperforms a MoE-27B baseline on knowledge, reasoning, code, and math benchmarks (e.g., MMLU +3.0, BBH +5.0, HumanEval +3.0, MATH +2.4).
- The approach improves long-context and variable-tracking performance in the authors' tests, including RULER Multi-Query NIAH rising from 84.2 to 97.0 and Variable Tracking from 77.0 to 89.0.
- Deterministic retrieval enables offloading very large embedding tables to CPU memory with prefetching, with reported inference throughput overhead around 3% on H800-class GPUs.
- Media reports suggest DeepSeek may integrate Engram into an upcoming V4 sparse model before the Lunar New Year, though this has not been independently confirmed.