# 视频生成对比

比较适用于在 Clore.ai GPU 服务器上部署的领先开源视频生成模型。

{% hint style="info" %}
**AI 视频生成** 在 2024–2025 年爆发。本指南比较了顶级开源模型 — 昊元视频、Wan2.1、CogVideoX、Mochi 1 和 LTX-Video — 涵盖质量、速度、显存要求和使用场景。
{% endhint %}

***

## 快速决策矩阵

|               | 昊元视频     | Wan2.1     | CogVideoX  | Mochi 1    | LTX-Video  |
| ------------- | -------- | ---------- | ---------- | ---------- | ---------- |
| **开发者**       | 腾讯       | 阿里巴巴       | 智谱AI       | Genmo      | LightRicks |
| **速度**        | ⭐⭐⭐⭐⭐    | ⭐⭐⭐⭐⭐      | ⭐⭐⭐⭐       | ⭐⭐⭐⭐       | ⭐⭐⭐        |
| **适用场景**      | 较慢       | 高细节        | 高细节        | 高细节        | **通用使用**   |
| **最低显存**      | 24GB     | 16GB       | 16GB       | 24GB       | **8GB**    |
| **最大分辨率**     | 1280×720 | 1280×720   | 1440×960   | 848×480    | 1216×704   |
| **最大时长**      | 5秒       | 5秒         | 6秒         | 5.4秒       | 2 分钟       |
| **许可**        | CLA      | Apache 2.0 | Apache 2.0 | Apache 2.0 | Apache 2.0 |
| **GitHub 星标** | 1万+      | 7千+        | 6千+        | 4千+        | 5千+        |

***

## 概览

### 昊元视频

腾讯的昊元视频被广泛认为是截至 2025 年初最佳的开源视频生成模型。它使用基于 Transformer 的架构，运动质量出色。

**关键规格**：13B 参数，720p 下 5 秒，要求 24GB+ 显存

### Wan2.1

阿里巴巴的 Wan（文影）2.1 是昊元的有力竞争者，提供类似质量且最低显存要求更低。提供 1.3B 和 14B 参数变体。

**关键规格**：1.3B（精简）或 14B，720p 下 5 秒，1.3B 需 16GB+ 显存

### CogVideoX

智谱 AI 的 CogVideoX 注重精确的文本遵从和连贯的长视频生成。它在电影化内容和叙事驱动生成方面尤其强势。

**关键规格**：5B/10B 参数，1440×960 下 6 秒，16GB+ 显存

### Mochi 1

Genmo 的 Mochi 1 以平滑流畅的运动和逼真的物理效果著称。它使用新颖的 AsymmDiT 架构。完全开源（权重 + 训练代码）。

**关键规格**：10B 参数，848×480 下 5.4 秒，24GB 显存

### LTX-Video

LightRicks 的 LTX-Video 将推理速度置于首位。它可以在现代 GPU 上实时或近实时生成视频——非常适合交互式应用。

**关键规格**：2B 参数，最长可达 2 分钟视频，8GB 显存

***

## 质量比较

### EvalCrafter 基准（2025）

{% hint style="info" %}
质量是主观的。这些评分反映了来自 VBench 和 EvalCrafter 基准的社区共识。
{% endhint %}

| 模型           | VBench 得分 | 运动质量        | 文本对齐    | 美学      |
| ------------ | --------- | ----------- | ------- | ------- |
| 昊元视频         | **83.2**  | **适合照片级写实** | 适合照片级写实 | 适合照片级写实 |
| Wan2.1（14B）  | **82.8**  | 适合照片级写实     | 适合照片级写实 | 适合照片级写实 |
| CogVideoX-5B | 79.6      | 快速          | **非常好** | 快速      |
| Mochi 1      | 77.4      | 非常好         | 快速      | 快速      |
| LTX-Video    | 71.2      | 快速          | 快速      | 可接受     |

### 定性优势

| 模型        | 最擅长                | 弱点         |
| --------- | ------------------ | ---------- |
| 昊元视频      | 整体质量、电影感           | 非常慢，显存需求高  |
| Wan2.1    | 质量/效率平衡，图像到视频（I2V） | 偶尔过饱和      |
| CogVideoX | 长篇叙事，文本准确性         | 运动较不动态     |
| Mochi 1   | 流畅运动，物理逼真          | 分辨率下限较低    |
| LTX-Video | 速度、长视频             | 与其他模型的质量差距 |

***

## 速度基准

### 生成时间（A100 80GB，单 GPU）

| 模型           | 480p 5秒  | 720p 5秒  | 1080p 5秒 |
| ------------ | -------- | -------- | -------- |
| 昊元视频         | 45 分钟    | 约 3 小时   | ❌ 内存溢出   |
| Wan2.1（14B）  | 15 分钟    | 45 分钟    | ❌ 内存溢出   |
| Wan2.1（1.3B） | 3 分钟     | 8 分钟     | ❌ 内存溢出   |
| CogVideoX-5B | 10 分钟    | 25 分钟    | ❌ 内存溢出   |
| Mochi 1      | 8 分钟     | ❌ 内存溢出   | ❌ 内存溢出   |
| LTX-Video    | **45 秒** | **3 分钟** | 8 分钟     |

{% hint style="warning" %}
**时间为近似值** 并随采样步骤（20–50）、引导尺度和硬件而变化。预览时使用更少步骤。
{% endhint %}

### 经过优化（TeaCache / FORA / 步骤蒸馏）

优化后的推理可以显著减少生成时间：

| 模型        | 启用缓存时        | 加速比 |
| --------- | ------------ | --- |
| 昊元视频      | 〜15 分钟（720p） | 4×  |
| Wan2.1    | 〜12 分钟（720p） | 〜4× |
| CogVideoX | 〜8 分钟（720p）  | 〜3× |
| LTX-Video | 〜45 秒（720p）  | 4×  |

***

## 显存需求

### 按模型和分辨率的最小显存

| 模型           | 480p    | 720p  | 1080p |
| ------------ | ------- | ----- | ----- |
| 昊元视频         | 24GB    | 40GB+ | ❌     |
| Wan2.1（14B）  | 24GB    | 40GB+ | ❌     |
| Wan2.1（1.3B） | **8GB** | 16GB  | 24GB  |
| CogVideoX-5B | 16GB    | 24GB  | ❌     |
| CogVideoX-2B | **8GB** | 16GB  | ❌     |
| Mochi 1      | 24GB    | ❌     | ❌     |
| LTX-Video    | **8GB** | 12GB  | 24GB  |

### 内存优化技术

#### 量化

```python
# 对 CogVideoX 使用 8-bit 量化（显存减半）
from diffusers import CogVideoXPipeline
import torch

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
    torch_dtype=torch.float16
)
pipe.enable_model_cpu_offload()  # 进一步减少显存
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
```

#### CPU 卸载

```python
# 对 Wan2.1 使用 CPU 卸载以降低显存
from diffusers import WanPipeline

pipe = WanPipeline.from_pretrained(
    "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
    torch_dtype=torch.bfloat16
)
pipe.enable_model_cpu_offload()
```

***

## 昊元视频：深入解析

### 架构

* **13B DiT** （扩散 Transformer）参数
* 对所有空间和时间 token 采用全注意力
* 在 10 亿+ 视频片段上训练

### 在 Clore.ai 上部署

```bash
# 克隆并安装
git clone https://github.com/Tencent/HunyuanVideo
cd HunyuanVideo
pip install -r requirements.txt

# 下载权重（约 87GB）
huggingface-cli download tencent/HunyuanVideo --local-dir ./weights

# 生成
python sample_video.py \
  --video-size 720 1280 \
  --video-length 129 \
  --infer-steps 50 \
  --prompt "一只雄伟的鹰在被雪覆盖的山脉上空翱翔" \
  --flow-shift 7.0 \
  --embedded-cfg-scale 6.0 \
  --save-path ./outputs
```

### 通过 ComfyUI

```bash
# 为 ComfyUI 安装 HunyuanVideo 节点
cd ComfyUI/custom_nodes
git clone https://github.com/kijai/ComfyUI-HunyuanVideoWrapper
pip install -r ComfyUI-HunyuanVideoWrapper/requirements.txt
```

**适用场景**：最高质量的电影级视频生成，无显存限制时首选

***

## Wan2.1：深入解析

### 架构

* **两种变体**：Wan2.1-T2V-1.3B 和 Wan2.1-T2V-14B
* **图像到视频** （I2V）模型也可用
* 强大的多语种（中文 + 英文）提示支持

### 在 Clore.ai 上部署

```python
from diffusers import WanPipeline
from diffusers.utils import export_to_video
import torch

# 1.3B 模型 — 可在 8–16GB 显存中运行
pipe = WanPipeline.from_pretrained(
    "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")

output = pipe(
    prompt="一个宁静的日本庭院，樱花飘落",
    negative_prompt="低质量，模糊",
    height=480,
    width=832,
    num_frames=81,
    num_inference_steps=50,
    guidance_scale=5.0,
).frames[0]

export_to_video(output, "wan_video.mp4", fps=16)
```

### 使用 Wan2.1 的图像到视频

```python
from diffusers import WanImageToVideoPipeline
from PIL import Image

pipe = WanImageToVideoPipeline.from_pretrained(
    "Wan-AI/Wan2.1-I2V-14B-480P-Diffusers",
    torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()

image = Image.open("input.jpg")
output = pipe(
    image=image,
    prompt="那个人自信地向前走",
    num_frames=81,
).frames[0]
```

**适用场景**：质量与效率平衡，I2V，多语种

***

## CogVideoX：深入解析

### 架构

* **专家级 Transformer** 配备 3D 全注意力
* **5B 和 10B** 参数变体
* 采用 CogView3 图像编码器以提升视觉质量

### 在 Clore.ai 上部署

```python
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video
import torch

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
    torch_dtype=torch.bfloat16
)
pipe.to("cuda")
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

video = pipe(
    prompt="一段城市夜景的延时摄影，车灯留下光迹",
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "cogvideo.mp4", fps=8)
```

**适用场景**：精确的文本到视频、叙事内容、长篇生成

***

## Mochi 1：深入解析

### 架构

* **AsymmDiT** — 非对称扩散 Transformer
* 注重时间一致性和流畅运动
* 完全开源，包括训练代码

### 在 Clore.ai 上部署

```bash
pip install mochi-preview

python -c "
from mochi_preview.pipelines import DecoderModelFactory, DitModelFactory, MochiSingleGPUPipeline, T5ModelFactory
import tempfile
from pathlib import Path

pipeline = MochiSingleGPUPipeline(
    text_encoder_factory=T5ModelFactory(),
    dit_factory=DitModelFactory(model_path='./weights/mochi-dit.safetensors'),
    decoder_factory=DecoderModelFactory(model_path='./weights/mochi-vae.safetensors'),
    cpu_offload=True,
    decode_type='tiled_full',
)

video = pipeline(
    height=480, width=848,
    num_frames=163,
    num_inference_steps=64,
    sigma_schedule_type='linear_quadratic',
    cfg_schedule_type='linear',
    conditioning_args={'prompt': '一只海豚在日落时穿越海浪跃起'},
)
"
```

**适用场景**：流畅运动、逼真物理、研究用途

***

## LTX-Video：深入解析

### 架构

* **2B 参数** DiT — 更小、更快
* 原生 **长视频** 支持（最长可达 2 分钟）
* 为实时或近实时生成设计

### 在 Clore.ai 上部署

```python
from diffusers import LTXPipeline
from diffusers.utils import export_to_video
import torch

pipe = LTXPipeline.from_pretrained(
    "Lightricks/LTX-Video",
    torch_dtype=torch.bfloat16
)
pipe.to("cuda")

video = pipe(
    prompt="一只蝴蝶停在夏日花园的一朵花上",
    negative_prompt="最差质量，运动不一致，模糊",
    width=704,
    height=480,
    num_frames=161,
    decode_timestep=0.03,
    decode_noise_scale=0.025,
    num_inference_steps=50,
).frames[0]

export_to_video(video, "ltx_video.mp4", fps=24)
```

**适用场景**：快速生成、交互式应用、长视频、显存受限（8GB）

***

## 功能比较

### 能力概览

| 功能         | 昊元   | Wan2.1 | CogVideoX | Mochi | LTX |
| ---------- | ---- | ------ | --------- | ----- | --- |
| 文本到视频      | ✅    | ✅      | ✅         | ✅     | ✅   |
| 图像到视频      | ✅    | ✅      | ✅         | ❌     | ✅   |
| 视频到视频      | ❌    | ❌      | ✅         | ❌     | ✅   |
| ControlNet | 部分支持 | ❌      | ✅         | ❌     | ❌   |
| LoRA 支持    | ✅    | ✅      | ✅         | ❌     | ✅   |
| ComfyUI 节点 | ✅    | ✅      | ✅         | ✅     | ✅   |
| 长视频（>10 秒） | ❌    | ❌      | 部分支持      | ❌     | ✅   |
| 中文提示       | ✅    | ✅      | ✅         | ❌     | ❌   |

***

## Clore.ai 的 GPU 建议

### 针对每个模型

| 模型           | 最低 GPU         | 推荐配置        | 理想         |
| ------------ | -------------- | ----------- | ---------- |
| 昊元视频         | 生产             | A6000（48GB） | A100（80GB） |
| Wan2.1 14B   | 生产             | A6000（48GB） | A100（80GB） |
| Wan2.1 1.3B  | RTX 3080（10GB） | RTX 3090    | RTX 4090   |
| CogVideoX-5B | 生产             | A6000（48GB） | A100       |
| CogVideoX-2B | RTX 3080（10GB） | RTX 3090    | RTX 4090   |
| Mochi 1      | 生产             | A6000（48GB） | A100       |
| LTX-Video    | RTX 3080（10GB） | RTX 4080    | RTX 4090   |

### 每个视频的成本估算

```
昊元视频（720p，5秒）在 A100 80GB（约 $1.50/小时）：
  时间：约 45 分钟 → 成本：约 $1.12 每个视频

Wan2.1-1.3B（480p，5秒）在 RTX 3090（约 $0.50/小时）：
  时间：约 3 分钟 → 成本：约 $0.025 每个视频

LTX-Video（720p，5秒）在 RTX 4090（约 $0.60/小时）：
  时间：约 3 分钟 → 成本：约 $0.030 每个视频
```

***

## 何时使用哪个

### 决策指南

```
追求最高质量（不限制成本）？
  → 在 A100 上使用昊元视频

最佳质量/成本平衡？
  → 在 A6000 上使用 Wan2.1 14B

显存受限（8–12GB）？
  → LTX-Video 或 Wan2.1 1.3B

需要快速生成？
  → LTX-Video

需要图像到视频？
  → Wan2.1 I2V 或 CogVideoX

需要长视频（>10 秒）？
  → LTX-Video

研究/微调？
  → Mochi 1（开源训练代码）或 CogVideoX

ComfyUI 工作流？
  → 全部支持，以昊元/Wan 的节点最佳
```

***

## 有用的链接

* [昊元视频 GitHub](https://github.com/Tencent/HunyuanVideo)
* [Wan2.1 在 HuggingFace](https://huggingface.co/Wan-AI)
* [CogVideoX GitHub](https://github.com/THUDM/CogVideo)
* [Mochi 1 GitHub](https://github.com/genmoai/mochi)
* [LTX-Video GitHub](https://github.com/Lightricks/LTX-Video)
* [视频生成排行榜](https://huggingface.co/spaces/ArtificialAnalysis/video-generation-arena-leaderboard)

***

## 总结

| 模型            | 使用场景                 |
| ------------- | -------------------- |
| **昊元视频**      | 当最优质量最重要且有 A100+ 可用时 |
| **Wan2.1**    | 质量与效率平衡最佳            |
| **CogVideoX** | 精确的文本到视频、长篇叙事        |
| **Mochi 1**   | 流畅运动、物理逼真、开放研究       |
| **LTX-Video** | 速度、低显存、长视频           |

开源视频生成生态发展迅速。对于大多数 Clore.ai 部署， **Wan2.1** （1.3B 适合预算，14B 适合质量）在质量、速度和资源效率方面提供了最佳组合。


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-zh/dui-bi/video-gen-comparison.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.