# CogVideoX Video Generation

CogVideoX is a family of open-weight video diffusion transformers from Zhipu AI (Tsinghua). The models generate coherent 6-second clips at 720×480 resolution and 8 fps from either a text prompt (T2V) or a reference image plus prompt (I2V). Two parameter scales are available — 2B for fast iteration and 5B for higher fidelity — both with native `diffusers` integration through `CogVideoXPipeline`.

Running CogVideoX on a rented GPU from [Clore.ai](https://clore.ai/) lets you skip local hardware constraints and generate video at scale for pennies per clip.

## Key Features

* **Text-to-Video (T2V)** — describe a scene and get a 6-second 720×480 clip at 8 fps (49 frames).
* **Image-to-Video (I2V)** — supply a reference image plus prompt; the model animates it with temporal consistency.
* **Two scales** — CogVideoX-2B (fast, \~12 GB VRAM) and CogVideoX-5B (higher quality, \~20 GB VRAM).
* **Native diffusers support** — first-class `CogVideoXPipeline` and `CogVideoXImageToVideoPipeline` classes.
* **3D causal VAE** — compresses 49 frames into a compact latent space for efficient denoising.
* **Open weights** — Apache-2.0 license for the 2B variant; research license for 5B.

## Requirements

| Component  | Minimum          | Recommended      |
| ---------- | ---------------- | ---------------- |
| GPU VRAM   | 16 GB (2B, fp16) | 24 GB (5B, bf16) |
| System RAM | 32 GB            | 64 GB            |
| Disk       | 30 GB            | 50 GB            |
| Python     | 3.10+            | 3.11             |
| CUDA       | 12.1+            | 12.4             |

**Clore.ai GPU recommendation:** An **RTX 4090** (24 GB, \~$0.5–2/day) handles both the 2B and 5B variants comfortably. An **RTX 3090** (24 GB, \~$0.3–1/day) works equally well for 5B at bf16 and is the budget pick.

## Quick Start

```bash
# Create environment
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124
pip install diffusers transformers accelerate sentencepiece imageio[ffmpeg]

# Verify GPU
python -c "import torch; print(torch.cuda.get_device_name(0))"
```

## Usage Examples

### Text-to-Video (5B)

```python
import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")
pipe.enable_model_cpu_offload()      # saves ~4 GB peak VRAM
pipe.vae.enable_tiling()             # required for 720x480 on 24 GB cards

prompt = (
    "A golden retriever running through a sunflower field at sunset, "
    "cinematic lighting, slow motion, 4K quality"
)

video_frames = pipe(
    prompt=prompt,
    num_frames=49,
    guidance_scale=6.0,
    num_inference_steps=50,
    generator=torch.Generator("cuda").manual_seed(42),
).frames[0]

export_to_video(video_frames, "retriever_sunset.mp4", fps=8)
print("Saved retriever_sunset.mp4")
```

### Image-to-Video (5B)

```python
import torch
from PIL import Image
from diffusers import CogVideoXImageToVideoPipeline
from diffusers.utils import export_to_video

pipe = CogVideoXImageToVideoPipeline.from_pretrained(
    "THUDM/CogVideoX-5b-I2V",
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

image = Image.open("reference.png").resize((720, 480))

video_frames = pipe(
    prompt="The camera slowly orbits around the subject, gentle wind",
    image=image,
    num_frames=49,
    guidance_scale=6.0,
    num_inference_steps=50,
).frames[0]

export_to_video(video_frames, "animated.mp4", fps=8)
```

### Fast Generation with the 2B Variant

```python
from diffusers import CogVideoXPipeline
import torch

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-2b",
    torch_dtype=torch.float16,
)
pipe.to("cuda")
pipe.vae.enable_tiling()

frames = pipe(
    prompt="Timelapse of a blooming cherry blossom tree",
    num_frames=49,
    guidance_scale=6.0,
    num_inference_steps=30,       # fewer steps → faster
).frames[0]
```

## Tips for Clore.ai Users

1. **Enable VAE tiling** — without `pipe.vae.enable_tiling()` the 3D VAE will OOM on 24 GB cards during decode.
2. **Use `enable_model_cpu_offload()`** — shifts idle modules to RAM automatically; adds \~10 % wall-time but saves 4+ GB peak VRAM.
3. **bf16 for 5B, fp16 for 2B** — the 5B checkpoint was trained in bf16; using fp16 can cause NaN outputs.
4. **Persist models** — mount a Clore.ai persistent volume to `/models` and set `HF_HOME=/models/hf` so weights survive container restarts.
5. **Batch overnight** — queue long prompt lists with a simple Python loop; Clore.ai billing is per-hour, so saturate the GPU.
6. **SSH + tmux** — run generation inside `tmux` so a dropped connection doesn't kill the process.
7. **Select the right GPU** — filter Clore.ai marketplace for ≥24 GB VRAM cards; sort by price to find the cheapest RTX 3090 / 4090 available.

## Troubleshooting

| Problem                              | Fix                                                                                                   |
| ------------------------------------ | ----------------------------------------------------------------------------------------------------- |
| `OutOfMemoryError` during VAE decode | Call `pipe.vae.enable_tiling()` before inference                                                      |
| NaN / black frames with 5B           | Switch to `torch.bfloat16`; fp16 is not supported for the 5B variant                                  |
| `ImportError: imageio`               | `pip install imageio[ffmpeg]` — the ffmpeg plugin is needed for MP4 export                            |
| Very slow first run                  | Model download is \~20 GB; subsequent runs use the cached weights                                     |
| CUDA version mismatch                | Ensure PyTorch CUDA version matches the driver: `python -c "import torch; print(torch.version.cuda)"` |
| Garbled motion / flickering          | Increase `num_inference_steps` to 50; lower `guidance_scale` to 5.0                                   |
| Container killed mid-download        | Set `HF_HOME` to a persistent volume and restart — partial downloads resume automatically             |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/video-generation/cogvideox.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.