Particle.news
Download on the App Store

Meituan’s LongCat Team Open-Sources LongCat-Video, a Unified Model for Long-Form Video Generation

The team frames video generation as a step toward broader world models.

Overview

  • LongCat-Video uses a single Diffusion Transformer that natively handles text-to-video, image-to-video and multi-frame video continuation by varying the number of conditional frames.
  • The team reports stable minute-scale outputs of around five minutes with improved cross-frame consistency and physically plausible motion.
  • The base model is about 13.6 billion parameters, with pretraining on video continuation plus Block-Causal Attention and GRPO post-training to boost long-sequence coherence.
  • Efficiency features include block-sparse attention and conditional token caching, with a two-stage coarse-to-fine pipeline and distillation that the team says can deliver up to ~10.1× faster inference in certain settings.
  • Code and weights are publicly available on GitHub and Hugging Face, and team-cited evaluations, including VBench, describe open-source SOTA results pending broader verification.