Overview
- LongCat-Video uses a single Diffusion Transformer that natively handles text-to-video, image-to-video and multi-frame video continuation by varying the number of conditional frames.
- The team reports stable minute-scale outputs of around five minutes with improved cross-frame consistency and physically plausible motion.
- The base model is about 13.6 billion parameters, with pretraining on video continuation plus Block-Causal Attention and GRPO post-training to boost long-sequence coherence.
- Efficiency features include block-sparse attention and conditional token caching, with a two-stage coarse-to-fine pipeline and distillation that the team says can deliver up to ~10.1× faster inference in certain settings.
- Code and weights are publicly available on GitHub and Hugging Face, and team-cited evaluations, including VBench, describe open-source SOTA results pending broader verification.