# Video Generation Comparison

Compare the leading open-source video generation models for deployment on Clore.ai GPU servers.

{% hint style="info" %}
**AI video generation** has exploded in 2024-2025. This guide compares the top open-source models — Hunyuan Video, Wan2.1, CogVideoX, Mochi 1, and LTX-Video — covering quality, speed, VRAM requirements, and use cases.
{% endhint %}

***

## Quick Decision Matrix

|                    | Hunyuan Video | Wan2.1     | CogVideoX  | Mochi 1    | LTX-Video  |
| ------------------ | ------------- | ---------- | ---------- | ---------- | ---------- |
| **Developer**      | Tencent       | Alibaba    | Zhipu AI   | Genmo      | LightRicks |
| **Quality**        | ⭐⭐⭐⭐⭐         | ⭐⭐⭐⭐⭐      | ⭐⭐⭐⭐       | ⭐⭐⭐⭐       | ⭐⭐⭐        |
| **Speed**          | Slow          | Medium     | Medium     | Medium     | **Fast**   |
| **Min VRAM**       | 24GB          | 16GB       | 16GB       | 24GB       | **8GB**    |
| **Max resolution** | 1280×720      | 1280×720   | 1440×960   | 848×480    | 1216×704   |
| **Max length**     | 5s            | 5s         | 6s         | 5.4s       | 2min       |
| **License**        | CLA           | Apache 2.0 | Apache 2.0 | Apache 2.0 | Apache 2.0 |
| **GitHub stars**   | 10K+          | 7K+        | 6K+        | 4K+        | 5K+        |

***

## Overview

### Hunyuan Video

Tencent's Hunyuan Video is widely considered the best open-source video generation model as of early 2025. It uses a transformer-based architecture with exceptional motion quality.

**Key specs**: 13B parameters, 5s at 720p, requires 24GB+ VRAM

### Wan2.1

Alibaba's Wan (Wenying) 2.1 is a strong competitor to Hunyuan, offering similar quality with lower minimum VRAM requirements. Available in 1.3B and 14B parameter variants.

**Key specs**: 1.3B (lite) or 14B, 5s at 720p, 16GB+ VRAM for 1.3B

### CogVideoX

Zhipu AI's CogVideoX focuses on precise text-following and coherent long-form video. It's particularly strong for cinematic content and story-driven generation.

**Key specs**: 5B/10B parameters, 6s at 1440×960, 16GB+ VRAM

### Mochi 1

Genmo's Mochi 1 is known for smooth, fluid motion and realistic physics. It uses a novel AsymmDiT architecture. Available fully open-source (weights + training code).

**Key specs**: 10B parameters, 5.4s at 848×480, 24GB VRAM

### LTX-Video

LightRick's LTX-Video prioritizes inference speed above all. It can generate video in real-time or near-real-time on modern GPUs — ideal for interactive applications.

**Key specs**: 2B parameters, up to 2 minutes of video, 8GB VRAM

***

## Quality Comparison

### EvalCrafter Benchmark (2025)

{% hint style="info" %}
Quality is subjective. These scores reflect community consensus from VBench and EvalCrafter benchmarks.
{% endhint %}

| Model         | VBench Score | Motion Quality | Text Alignment | Aesthetic  |
| ------------- | ------------ | -------------- | -------------- | ---------- |
| Hunyuan Video | **83.2**     | **Excellent**  | Excellent      | Excellent  |
| Wan2.1 (14B)  | **82.8**     | Excellent      | Excellent      | Excellent  |
| CogVideoX-5B  | 79.6         | Good           | **Very Good**  | Good       |
| Mochi 1       | 77.4         | Very Good      | Good           | Good       |
| LTX-Video     | 71.2         | Good           | Good           | Acceptable |

### Qualitative Strengths

| Model         | Best At                            | Weaknesses                  |
| ------------- | ---------------------------------- | --------------------------- |
| Hunyuan Video | Overall quality, cinematography    | Very slow, VRAM hungry      |
| Wan2.1        | Balance of quality/efficiency, I2V | Occasionally over-saturated |
| CogVideoX     | Long-form narrative, text accuracy | Less dynamic motion         |
| Mochi 1       | Fluid motion, physics              | Lower resolution limit      |
| LTX-Video     | Speed, long videos                 | Quality gap vs others       |

***

## Speed Benchmarks

### Generation Time (A100 80GB, single GPU)

| Model         | 480p 5s    | 720p 5s   | 1080p 5s |
| ------------- | ---------- | --------- | -------- |
| Hunyuan Video | 45 min     | \~3 hours | ❌ OOM    |
| Wan2.1 (14B)  | 15 min     | 45 min    | ❌ OOM    |
| Wan2.1 (1.3B) | 3 min      | 8 min     | ❌ OOM    |
| CogVideoX-5B  | 10 min     | 25 min    | ❌ OOM    |
| Mochi 1       | 8 min      | ❌ OOM     | ❌ OOM    |
| LTX-Video     | **45 sec** | **3 min** | 8 min    |

{% hint style="warning" %}
**Times are approximate** and vary with sampler steps (20-50), guidance scale, and hardware. Use fewer steps for previews.
{% endhint %}

### With Optimization (TeaCache / FORA / Step Distillation)

Optimized inference can reduce generation time significantly:

| Model         | With Cache      | Speedup |
| ------------- | --------------- | ------- |
| Hunyuan Video | \~15 min (720p) | 4×      |
| Wan2.1        | \~12 min (720p) | \~4×    |
| CogVideoX     | \~8 min (720p)  | \~3×    |
| LTX-Video     | \~45s (720p)    | 4×      |

***

## VRAM Requirements

### Minimum VRAM by Model and Resolution

| Model         | 480p    | 720p  | 1080p |
| ------------- | ------- | ----- | ----- |
| Hunyuan Video | 24GB    | 40GB+ | ❌     |
| Wan2.1 (14B)  | 24GB    | 40GB+ | ❌     |
| Wan2.1 (1.3B) | **8GB** | 16GB  | 24GB  |
| CogVideoX-5B  | 16GB    | 24GB  | ❌     |
| CogVideoX-2B  | **8GB** | 16GB  | ❌     |
| Mochi 1       | 24GB    | ❌     | ❌     |
| LTX-Video     | **8GB** | 12GB  | 24GB  |

### Memory Optimization Techniques

#### Quantization

```python
# CogVideoX with 8-bit quantization (halves VRAM)
from diffusers import CogVideoXPipeline
import torch

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
    torch_dtype=torch.float16
)
pipe.enable_model_cpu_offload()  # Further reduces VRAM
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
```

#### CPU Offloading

```python
# Wan2.1 with CPU offload for lower VRAM
from diffusers import WanPipeline

pipe = WanPipeline.from_pretrained(
    "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
    torch_dtype=torch.bfloat16
)
pipe.enable_model_cpu_offload()
```

***

## Hunyuan Video: Deep Dive

### Architecture

* **13B DiT** (Diffusion Transformer) parameters
* Full attention over all spatial and temporal tokens
* Trained on 1B+ video clips

### Deployment on Clore.ai

```bash
# Clone and install
git clone https://github.com/Tencent/HunyuanVideo
cd HunyuanVideo
pip install -r requirements.txt

# Download weights (~87GB)
huggingface-cli download tencent/HunyuanVideo --local-dir ./weights

# Generate
python sample_video.py \
  --video-size 720 1280 \
  --video-length 129 \
  --infer-steps 50 \
  --prompt "A majestic eagle soaring over snow-capped mountains" \
  --flow-shift 7.0 \
  --embedded-cfg-scale 6.0 \
  --save-path ./outputs
```

### Via ComfyUI

```bash
# Install HunyuanVideo nodes for ComfyUI
cd ComfyUI/custom_nodes
git clone https://github.com/kijai/ComfyUI-HunyuanVideoWrapper
pip install -r ComfyUI-HunyuanVideoWrapper/requirements.txt
```

**Best for**: Highest quality cinematic video generation, no VRAM constraints

***

## Wan2.1: Deep Dive

### Architecture

* **Two variants**: Wan2.1-T2V-1.3B and Wan2.1-T2V-14B
* **Image-to-Video** (I2V) model also available
* Strong multilingual (Chinese + English) prompts

### Deployment on Clore.ai

```python
from diffusers import WanPipeline
from diffusers.utils import export_to_video
import torch

# 1.3B model — fits in 8-16GB VRAM
pipe = WanPipeline.from_pretrained(
    "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")

output = pipe(
    prompt="A serene Japanese garden with cherry blossoms falling",
    negative_prompt="low quality, blurry",
    height=480,
    width=832,
    num_frames=81,
    num_inference_steps=50,
    guidance_scale=5.0,
).frames[0]

export_to_video(output, "wan_video.mp4", fps=16)
```

### Image-to-Video with Wan2.1

```python
from diffusers import WanImageToVideoPipeline
from PIL import Image

pipe = WanImageToVideoPipeline.from_pretrained(
    "Wan-AI/Wan2.1-I2V-14B-480P-Diffusers",
    torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()

image = Image.open("input.jpg")
output = pipe(
    image=image,
    prompt="The person walks forward confidently",
    num_frames=81,
).frames[0]
```

**Best for**: Balance of quality and efficiency, I2V, multilingual

***

## CogVideoX: Deep Dive

### Architecture

* **Expert Transformer** with 3D full attention
* **5B and 10B** parameter variants
* CogView3 image encoder for visual quality

### Deployment on Clore.ai

```python
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video
import torch

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
    torch_dtype=torch.bfloat16
)
pipe.to("cuda")
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

video = pipe(
    prompt="A time-lapse of a city at night with light trails from cars",
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "cogvideo.mp4", fps=8)
```

**Best for**: Precise text-to-video, narrative content, long-form generation

***

## Mochi 1: Deep Dive

### Architecture

* **AsymmDiT** — asymmetric diffusion transformer
* Focus on temporal consistency and fluid motion
* Fully open-source including training code

### Deployment on Clore.ai

```bash
pip install mochi-preview

python -c "
from mochi_preview.pipelines import DecoderModelFactory, DitModelFactory, MochiSingleGPUPipeline, T5ModelFactory
import tempfile
from pathlib import Path

pipeline = MochiSingleGPUPipeline(
    text_encoder_factory=T5ModelFactory(),
    dit_factory=DitModelFactory(model_path='./weights/mochi-dit.safetensors'),
    decoder_factory=DecoderModelFactory(model_path='./weights/mochi-vae.safetensors'),
    cpu_offload=True,
    decode_type='tiled_full',
)

video = pipeline(
    height=480, width=848,
    num_frames=163,
    num_inference_steps=64,
    sigma_schedule_type='linear_quadratic',
    cfg_schedule_type='linear',
    conditioning_args={'prompt': 'A dolphin leaping through ocean waves at sunset'},
)
"
```

**Best for**: Fluid motion, realistic physics, research use cases

***

## LTX-Video: Deep Dive

### Architecture

* **2B parameter** DiT — smaller, faster
* Native **long video** support (up to 2 minutes)
* Designed for real-time or near-real-time generation

### Deployment on Clore.ai

```python
from diffusers import LTXPipeline
from diffusers.utils import export_to_video
import torch

pipe = LTXPipeline.from_pretrained(
    "Lightricks/LTX-Video",
    torch_dtype=torch.bfloat16
)
pipe.to("cuda")

video = pipe(
    prompt="A butterfly landing on a flower in a summer garden",
    negative_prompt="worst quality, inconsistent motion, blurry",
    width=704,
    height=480,
    num_frames=161,
    decode_timestep=0.03,
    decode_noise_scale=0.025,
    num_inference_steps=50,
).frames[0]

export_to_video(video, "ltx_video.mp4", fps=24)
```

**Best for**: Fast generation, interactive applications, long videos, limited VRAM (8GB)

***

## Feature Comparison

### Capabilities Overview

| Feature           | Hunyuan | Wan2.1 | CogVideoX | Mochi | LTX |
| ----------------- | ------- | ------ | --------- | ----- | --- |
| Text-to-Video     | ✅       | ✅      | ✅         | ✅     | ✅   |
| Image-to-Video    | ✅       | ✅      | ✅         | ❌     | ✅   |
| Video-to-Video    | ❌       | ❌      | ✅         | ❌     | ✅   |
| ControlNet        | Partial | ❌      | ✅         | ❌     | ❌   |
| LoRA support      | ✅       | ✅      | ✅         | ❌     | ✅   |
| ComfyUI nodes     | ✅       | ✅      | ✅         | ✅     | ✅   |
| Long video (>10s) | ❌       | ❌      | Partial   | ❌     | ✅   |
| Chinese prompts   | ✅       | ✅      | ✅         | ❌     | ❌   |

***

## Clore.ai GPU Recommendations

### For Each Model

| Model         | Minimum GPU     | Recommended  | Ideal       |
| ------------- | --------------- | ------------ | ----------- |
| Hunyuan Video | RTX 3090 (24GB) | A6000 (48GB) | A100 (80GB) |
| Wan2.1 14B    | RTX 3090 (24GB) | A6000 (48GB) | A100 (80GB) |
| Wan2.1 1.3B   | RTX 3080 (10GB) | RTX 3090     | RTX 4090    |
| CogVideoX-5B  | RTX 3090 (24GB) | A6000 (48GB) | A100        |
| CogVideoX-2B  | RTX 3080 (10GB) | RTX 3090     | RTX 4090    |
| Mochi 1       | RTX 3090 (24GB) | A6000 (48GB) | A100        |
| LTX-Video     | RTX 3080 (10GB) | RTX 4080     | RTX 4090    |

### Cost Estimate per Video

```
Hunyuan Video (720p, 5s) on A100 80GB (~$1.50/hr):
  Time: ~45 min → Cost: ~$1.12 per video

Wan2.1-1.3B (480p, 5s) on RTX 3090 (~$0.50/hr):
  Time: ~3 min → Cost: ~$0.025 per video

LTX-Video (720p, 5s) on RTX 4090 (~$0.60/hr):
  Time: ~3 min → Cost: ~$0.030 per video
```

***

## When to Use Which

### Decision Guide

```
Maximum quality (no cost limit)?
  → Hunyuan Video on A100

Best quality/cost balance?
  → Wan2.1 14B on A6000

Limited VRAM (8-12GB)?
  → LTX-Video or Wan2.1 1.3B

Need fast generation?
  → LTX-Video

Need Image-to-Video?
  → Wan2.1 I2V or CogVideoX

Need long videos (>10s)?
  → LTX-Video

Research/fine-tuning?
  → Mochi 1 (open training code) or CogVideoX

ComfyUI workflow?
  → All supported, Hunyuan/Wan best nodes
```

***

## Useful Links

* [Hunyuan Video GitHub](https://github.com/Tencent/HunyuanVideo)
* [Wan2.1 on HuggingFace](https://huggingface.co/Wan-AI)
* [CogVideoX GitHub](https://github.com/THUDM/CogVideo)
* [Mochi 1 GitHub](https://github.com/genmoai/mochi)
* [LTX-Video GitHub](https://github.com/Lightricks/LTX-Video)
* [Video Generation Leaderboard](https://huggingface.co/spaces/ArtificialAnalysis/video-generation-arena-leaderboard)

***

## Summary

| Model             | Use When                                   |
| ----------------- | ------------------------------------------ |
| **Hunyuan Video** | Best quality matters most, A100+ available |
| **Wan2.1**        | Best balance of quality and efficiency     |
| **CogVideoX**     | Precise text-to-video, long narrative      |
| **Mochi 1**       | Fluid motion, physics, open research       |
| **LTX-Video**     | Speed, low VRAM, long videos               |

The open-source video generation ecosystem moves fast. For most Clore.ai deployments, **Wan2.1** (1.3B for budget, 14B for quality) offers the best combination of quality, speed, and resource efficiency.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/comparisons/video-gen-comparison.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
