Particle News: Apple Details Architecture and Training of Its New On-Device and Cloud AI Models

Overview

The on-device foundation model of roughly 3 billion parameters is split into two blocks, with Block 1 housing 62.5 percent of transformer layers and Block 2 stripped of key and value projections to cut memory use and first-token latency by about 37.5 percent.
Apple’s server-side model uses a custom Parallel-Track Mixture-of-Experts architecture on its Private Cloud Compute platform, processing tokens across parallel transformer tracks and activating only relevant expert subnetworks to enhance scalability and throughput.
Training data comes from filtered web crawls via Applebot, licensed publisher content, synthetic data for fine-tuning and vision-language tasks, and over 10 billion image-caption pairs refined with Apple’s own models.
Apple’s privacy-first approach ensures Applebot honors robots.txt exclusions during data collection and all inference for on-device features occurs locally, safeguarding user information.
Multilingual coverage rose from 8 percent to 30 percent of training data and the tokenizer expanded by 50 percent to 150,000 tokens, yielding notable gains on non-English benchmarks.