Particle.news
Download on the App Store

Apple’s MLX Taps M5 Neural Accelerators for Up to 4x Faster Local LLMs

Apple says macOS 26.2 lets MLX use the M5’s GPU TensorOps for on‑device inference gains.

Overview

  • Apple’s Machine Learning Research post reports up to roughly 4x reductions in time to first token versus M4 when running LLMs with MLX on M5.
  • Benchmarks cut first‑token latency to under 10 seconds for a dense 14B model and under 3 seconds for a 30B mixture‑of‑experts model.
  • Throughput for subsequent tokens rises by about 19–27%, which Apple attributes to the M5’s higher memory bandwidth of 153GB/s versus 120GB/s on M4.
  • Image generation with FLUX‑dev‑4bit at 1024×1024 runs more than 3.8x faster on M5 in Apple’s tests.
  • To realize these gains, users need an M5 Mac running macOS 26.2 or later, and MLX’s quantization and unified memory enable models like 8B BF16 or 30B MoE 4‑bit to fit within a 24GB MacBook Pro.