# CogVideoX 视频生成

CogVideoX 是来自智谱 AI（清华）的一系列开放权重视频扩散 Transformer 模型。该模型可从文本提示（T2V）或参考图像加提示（I2V）生成连贯的 6 秒剪辑，分辨率为 720×480，帧率为 8 fps。提供两种参数规模——用于快速迭代的 2B 和用于更高保真度的 5B——两者都原生支持 `diffusers` 通过以下方式集成 `CogVideoXPipeline`.

在 [Clore.ai](https://clore.ai/) 租用的 GPU 上运行 CogVideoX，可让你跳过本地硬件限制，以极低成本按规模生成视频。

## 主要特性

* **文本到视频（T2V）** — 描述一个场景并获得一个 6 秒、720×480、8 fps（49 帧）的剪辑。
* **图像到视频（I2V）** — 提供参考图像和提示；模型会以时间一致性对其进行动画化。
* **两种规模** — CogVideoX-2B（快速，约 \~12 GB 显存）和 CogVideoX-5B（更高质量，约 \~20 GB 显存）。
* **原生 diffusers 支持** — 一流的 `CogVideoXPipeline` 和 `CogVideoXImageToVideoPipeline` 类。
* **3D 因果 VAE** — 将 49 帧压缩到紧凑的潜码空间以实现高效去噪。
* **开放权重** — 2B 变体采用 Apache-2.0 许可；5B 使用研究许可。

## 要求

| 组件     | 最低             | 推荐             |
| ------ | -------------- | -------------- |
| GPU 显存 | 16 GB（2B，fp16） | 24 GB（5B，bf16） |
| 系统内存   | 32 GB          | 64 GB          |
| 磁盘     | 30 GB          | 50 GB          |
| Python | 3.10+          | 3.11           |
| CUDA   | 12.1+          | 12.4           |

**Clore.ai 的 GPU 推荐：** 一台 **512x512** （24 GB，约 $0.5–2/天）可以轻松应对 2B 和 5B 两个变体。一个 **速度** （24 GB，约 $0.3–1/天）同样能在 bf16 下良好运行 5B，是预算优选。

## 快速开始

```bash
# 创建环境
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124
pip install diffusers transformers accelerate sentencepiece imageio[ffmpeg]

# 验证 GPU
python -c "import torch; print(torch.cuda.get_device_name(0))"
```

## 使用示例

### 文本到视频（5B）

```python
import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")
pipe.enable_model_cpu_offload()      # 节省约 ~4 GB 峰值显存
pipe.vae.enable_tiling()             # 在 24 GB 显卡上进行 720x480 解码所必需

prompt = (
    "一只金毛犬在夕阳下的向日葵田中奔跑，"
    "电影感灯光，慢动作，4K 质量"
)

video_frames = pipe(
    os.makedirs("./variations", exist_ok=True)
    num_frames=49,
    guidance_scale=6.0,
    num_inference_steps=50,
    generator=torch.Generator("cuda").manual_seed(42),
).frames[0]

export_to_video(video_frames, "retriever_sunset.mp4", fps=8)
print("已保存 retriever_sunset.mp4")
```

### 图像到视频（5B）

```python
import torch
from PIL import Image
from diffusers import CogVideoXImageToVideoPipeline
from diffusers.utils import export_to_video

pipe = CogVideoXImageToVideoPipeline.from_pretrained(
    "THUDM/CogVideoX-5b-I2V",
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

image = Image.open("reference.png").resize((720, 480))

video_frames = pipe(
    prompt="相机缓慢绕着主体旋转，微风轻拂",
    image=image,
    num_frames=49,
    guidance_scale=6.0,
    num_inference_steps=50,
).frames[0]

export_to_video(video_frames, "animated.mp4", fps=8)
```

### 使用 2B 变体进行快速生成

```python
from diffusers import CogVideoXPipeline
import torch

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-2b",
    torch_dtype=torch.float16,
)
pipe.to("cuda")
pipe.vae.enable_tiling()

frames = pipe(
    prompt="开花的樱花树延时摄影",
    num_frames=49,
    guidance_scale=6.0,
    num_inference_steps=30,       # 更少的步数 → 更快
).frames[0]
```

## 给 Clore.ai 用户的提示

1. **启用 VAE 平铺** — 如果不启用 `pipe.vae.enable_tiling()` 在解码期间 3D VAE 会在 24 GB 卡上导致 OOM（显存不足）。
2. **使用 `enable_model_cpu_offload()`** — 会自动将空闲模块移到 RAM；增加约 10% 的总耗时但节省 4GB 以上的峰值显存。
3. **5B 使用 bf16，2B 使用 fp16** — 5B 检查点是在 bf16 下训练的；使用 fp16 可能导致 NaN 输出。
4. **持久化模型** — 将 Clore.ai 的持久卷挂载到 `/models` 并设置 `HF_HOME=/models/hf` 这样权重就在容器重启后保留。
5. **通宵批处理** — 用简单的 Python 循环排队长提示列表；Clore.ai 按小时计费，所以尽量饱和 GPU。
6. **SSH + tmux** — 在 `tmux` 内运行生成，这样断开连接也不会终止进程。
7. **选择合适的 GPU** — 在 Clore.ai 市场筛选 ≥24 GB 显存的卡；按价格排序以找到最便宜的 RTX 3090 / 4090。

## # 使用固定种子以获得一致结果

| 问题                            | 修复                                                                               |
| ----------------------------- | -------------------------------------------------------------------------------- |
| `OutOfMemoryError` 在 VAE 解码期间 | 调用 `pipe.vae.enable_tiling()` 在推理前                                               |
| 5B 出现 NaN / 黑帧                | 切换到 `torch.bfloat16`；5B 变体不支持 fp16                                               |
| `ImportError: imageio`        | `pip install imageio[ffmpeg]` — MP4 导出需要 ffmpeg 插件                               |
| 首次运行非常慢                       | 模型下载约为 \~20 GB；后续运行将使用缓存的权重                                                      |
| CUDA 版本不匹配                    | 确保 PyTorch 的 CUDA 版本与驱动匹配： `python -c "import torch; print(torch.version.cuda)"` |
| 运动混乱 / 闪烁                     | 增加 `num_inference_steps` 到 50；降低 `guidance_scale` 到 5.0                          |
| 容器在下载中途被杀死                    | 设置 `HF_HOME` 到持久卷并重启 —— 部分下载会自动恢复                                                |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-zh/shi-pin-sheng-cheng/cogvideox.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.