Particle: Google Releases T5Gemma 2, a Compact Multimodal Encoder-Decoder With 128K Context

Overview

The family ships in compact configurations of 270M-270M (~370M total excluding the vision encoder), 1B-1B (~1.7B) and 4B-4B (~7B) parameters suitable for on-device use.
Tied encoder–decoder embeddings and a merged decoder self- and cross-attention layer reduce parameters and simplify the architecture for more efficient inference.
A built-in vision encoder enables image-plus-text understanding for tasks such as visual question answering, and training data expands coverage to more than 140 languages.
The models handle context windows up to 128K tokens using Gemma 3’s alternating local and global attention, with reported quality gains over Gemma 3 and the original T5Gemma.
Pre-trained checkpoints and the paper are live on arXiv, Kaggle, Hugging Face, Colab and Vertex AI, and Google is not releasing post-trained or instruction-tuned checkpoints.