Join!
Try Kandinsky 5.0
Core Components
Text Encoder:

  • Qwen2.5-VL generates rich text embeddings, refined by a Linguistic Token Refiner to enhance prompt alignment and remove positional bias.

Visual Encoder:

  • FLUX VAE is used for spatial latent encoding in Image Editing and Image-to-Video tasks.
  • HunyuanVideo 3D VAE produces compact, temporally consistent latents for both image and video inputs.

CrossDiT Backbone:

  • The core generative engine, where the number of blocks scales with model size.
Model Inputs
  1. Text
  • Embeddings from Qwen2.5-VL (transformer decoder).
  • Augmented with 1D Rotary Position Embeddings (RoPE).
  • Refined by the Linguistic Token Refiner with bidirectional attention, which prepares text tokens for cross-attention inside DiT.
  1. CLIP Text Embedding
  • A single global embedding of the full video description.
  • Provides semantic conditioning in addition to token-level embeddings.
  1. Time
  • Diffusion step index.
  • Encoded with sinusoidal positional encoding + MLP.
  • Fused with a global CLIP text embedding of the video description.
  1. Visual
  • Latents from the HunyuanVideo 3D VAE.
  • Equipped with 3D Rotary Position Embeddings for spatial–temporal alignment.
Model Overview
All models in the Kandinsky 5.0 family share a unified foundation based on:
  • Latent diffusion pipeline with Flow Matching training paradigm.
  • A scalable Diffusion Transformer (CrossDiT) backbone for multimodal fusion of text and visual data.
Comparison with Other Models
Image
Universal
LITE
Kandinsky 5.0 Image is a high-resolution text-to-image generation model (6B parameters) that achieves state-of-the-art visual quality and prompt alignment. It outperforms leading open-source models like FLUX.1[dev] and Qwen-Image
in aesthetic realism and compositional accuracy.

We provide a comprehensive suite of optimized variants for different workflows:

  • RL-finetuned model — delivers the highest visual fidelity and realism.
  • SFT-soup model — excels in prompt following and overall visual quality.
  • Pretrain checkpoint — designed for researchers to conduct further fine-tuning and experimentation.

Additionally, we provide Kandinsky 5.0 Image Editing , a specialized variant derived from the base Image model, fine-tuned on an instructive dataset for precise, context-aware image editing
(e.g., inpainting, object replacement, style transfer).

All models are available for generating images
at resolutions up to 1024x1024 pixels.
Try now
Comparison with Other Models
Video
Fast
LITE
Kandinsky 5.0 Video Lite is a line-up of lightweight (2B parameters), high-speed models for text-to-video and image-to-video generation of up to
10-second clips at up to 768×512 resolution.
It achieves state-of-the-art visual quality, motion consistency, and prompt alignment.

We deliver eight optimized variants for different use cases:
  • Supervised Fine-Tuned (SFT) models — provide the highest generation quality after fine-tuning on a curated dataset of high-quality videos
    and images.
  • No CFG distilled models — offer 2× faster inference by removing classifier-free guidance.
  • Distilled 16 steps models (Flash) — enable ultra-fast generation with only 16 function evaluations, achieving a 6× speedup while preserving visual fidelity through via Trajectory Segmented Consistency Distillation and adversarial post-training.
  • Pretrain checkpoints — designed for researchers to conduct further fine-tuning
    and experimentation.

All models are available in both 5-second
and 10-second versions.
Try now
Comparison with Other Models
Video
Smart
PRO
Kandinsky 5.0 Video Pro is a line-up of high-capacity (19B parameters) models for text-to-video and image-to-video generation of up to 10-second clips at high resolution. It delivers state-of-the-art visual fidelity, cinematic motion dynamics,
and precise prompt adherence, outperforming leading open and proprietary systems on complex compositional tasks.

We deliver the following optimized variants
for different use cases:

  • Supervised Fine-Tuned (SFT) models — provide the highest generation quality after fine-tuning on a curated dataset of high-quality videos
    and images.
  • Distilled 16 steps models (Flash) — enable ultra-fast generation with only 16 function evaluations, achieving a 6× speedup while preserving visual fidelity through Trajectory Segmented Consistency Distillation
    and adversarial post-training.
  • Pretrain checkpoints — designed for researchers to conduct further fine-tuning
    and experimentation.

All models are available in both 5-second
and 10-second versions.
Try now