Overview
- The on-device foundation model of roughly 3 billion parameters is split into two blocks, with Block 1 housing 62.5 percent of transformer layers and Block 2 stripped of key and value projections to cut memory use and first-token latency by about 37.5 percent.
- Apple’s server-side model uses a custom Parallel-Track Mixture-of-Experts architecture on its Private Cloud Compute platform, processing tokens across parallel transformer tracks and activating only relevant expert subnetworks to enhance scalability and throughput.
- Training data comes from filtered web crawls via Applebot, licensed publisher content, synthetic data for fine-tuning and vision-language tasks, and over 10 billion image-caption pairs refined with Apple’s own models.
- Apple’s privacy-first approach ensures Applebot honors robots.txt exclusions during data collection and all inference for on-device features occurs locally, safeguarding user information.
- Multilingual coverage rose from 8 percent to 30 percent of training data and the tokenizer expanded by 50 percent to 150,000 tokens, yielding notable gains on non-English benchmarks.