Particle.news

Download on the App Store

Alibaba Cloud Open-Sources Qwen3-Omni, TTS and Image-Edit Models With Native Multimodality

Alibaba says the suite matches top proprietary systems on many audiovisual tests, a claim the developer community will now scrutinize.

Overview

  • The Sept. 23 release makes Qwen3-Omni, Qwen3-TTS and Qwen3-TTS-Flash, Qwen-Image-Edit-2509, and new Qwen3-Next-80B variants publicly available with code and demos on GitHub, Hugging Face and ModelScope.
  • Qwen3-Omni handles text, images, audio and video inputs and streams responses as text or natural speech in real time using a MoE “thinker–speaker” design with AuT pretraining and multi-codebooks for lower latency.
  • Alibaba reports coverage of 119 text languages plus 19 speech input languages and 10 speech output languages across the suite.
  • Company benchmarks cite leadership on 32 of 36 audio and video tests within the open-source range and parity with Gemini 2.5 Pro for ASR and audio understanding.
  • Qwen3-TTS offers 17 voices each spanning 10 languages with multiple Chinese dialects, while TTS-Flash targets faster first-packet latency and stability, and Qwen-Image-Edit-2509 adds multi-image editing, improved single-image consistency and native ControlNet support.