Overview
- The family ships in compact configurations of 270M-270M (~370M total excluding the vision encoder), 1B-1B (~1.7B) and 4B-4B (~7B) parameters suitable for on-device use.
- Tied encoder–decoder embeddings and a merged decoder self- and cross-attention layer reduce parameters and simplify the architecture for more efficient inference.
- A built-in vision encoder enables image-plus-text understanding for tasks such as visual question answering, and training data expands coverage to more than 140 languages.
- The models handle context windows up to 128K tokens using Gemma 3’s alternating local and global attention, with reported quality gains over Gemma 3 and the original T5Gemma.
- Pre-trained checkpoints and the paper are live on arXiv, Kaggle, Hugging Face, Colab and Vertex AI, and Google is not releasing post-trained or instruction-tuned checkpoints.