# CogVideoX Video Generation

CogVideoX is a family of open-weight video diffusion transformers from Zhipu AI (Tsinghua). The models generate coherent 6-second clips at 720×480 resolution and 8 fps from either a text prompt (T2V) or a reference image plus prompt (I2V). Two parameter scales are available — 2B for fast iteration and 5B for higher fidelity — both with native `diffusers` integration through `CogVideoXPipeline`.

Running CogVideoX on a rented GPU from [Clore.ai](https://clore.ai/) lets you skip local hardware constraints and generate video at scale for pennies per clip.

## Key Features

* **Text-to-Video (T2V)** — describe a scene and get a 6-second 720×480 clip at 8 fps (49 frames).
* **Image-to-Video (I2V)** — supply a reference image plus prompt; the model animates it with temporal consistency.
* **Two scales** — CogVideoX-2B (fast, \~12 GB VRAM) and CogVideoX-5B (higher quality, \~20 GB VRAM).
* **Native diffusers support** — first-class `CogVideoXPipeline` and `CogVideoXImageToVideoPipeline` classes.
* **3D causal VAE** — compresses 49 frames into a compact latent space for efficient denoising.
* **Open weights** — Apache-2.0 license for the 2B variant; research license for 5B.

## Requirements

| Component  | Minimum          | Recommended      |
| ---------- | ---------------- | ---------------- |
| GPU VRAM   | 16 GB (2B, fp16) | 24 GB (5B, bf16) |
| System RAM | 32 GB            | 64 GB            |
| Disk       | 30 GB            | 50 GB            |
| Python     | 3.10+            | 3.11             |
| CUDA       | 12.1+            | 12.4             |

**Clore.ai GPU recommendation:** An **RTX 4090** (24 GB, \~$0.5–2/day) handles both the 2B and 5B variants comfortably. An **RTX 3090** (24 GB, \~$0.3–1/day) works equally well for 5B at bf16 and is the budget pick.

## Quick Start

```bash
# Create environment
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124
pip install diffusers transformers accelerate sentencepiece imageio[ffmpeg]

# Verify GPU
python -c "import torch; print(torch.cuda.get_device_name(0))"
```

## Usage Examples

### Text-to-Video (5B)

```python
import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")
pipe.enable_model_cpu_offload()      # saves ~4 GB peak VRAM
pipe.vae.enable_tiling()             # required for 720x480 on 24 GB cards

prompt = (
    "A golden retriever running through a sunflower field at sunset, "
    "cinematic lighting, slow motion, 4K quality"
)

video_frames = pipe(
    prompt=prompt,
    num_frames=49,
    guidance_scale=6.0,
    num_inference_steps=50,
    generator=torch.Generator("cuda").manual_seed(42),
).frames[0]

export_to_video(video_frames, "retriever_sunset.mp4", fps=8)
print("Saved retriever_sunset.mp4")
```

### Image-to-Video (5B)

```python
import torch
from PIL import Image
from diffusers import CogVideoXImageToVideoPipeline
from diffusers.utils import export_to_video

pipe = CogVideoXImageToVideoPipeline.from_pretrained(
    "THUDM/CogVideoX-5b-I2V",
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

image = Image.open("reference.png").resize((720, 480))

video_frames = pipe(
    prompt="The camera slowly orbits around the subject, gentle wind",
    image=image,
    num_frames=49,
    guidance_scale=6.0,
    num_inference_steps=50,
).frames[0]

export_to_video(video_frames, "animated.mp4", fps=8)
```

### Fast Generation with the 2B Variant

```python
from diffusers import CogVideoXPipeline
import torch

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-2b",
    torch_dtype=torch.float16,
)
pipe.to("cuda")
pipe.vae.enable_tiling()

frames = pipe(
    prompt="Timelapse of a blooming cherry blossom tree",
    num_frames=49,
    guidance_scale=6.0,
    num_inference_steps=30,       # fewer steps → faster
).frames[0]
```

## Tips for Clore.ai Users

1. **Enable VAE tiling** — without `pipe.vae.enable_tiling()` the 3D VAE will OOM on 24 GB cards during decode.
2. **Use `enable_model_cpu_offload()`** — shifts idle modules to RAM automatically; adds \~10 % wall-time but saves 4+ GB peak VRAM.
3. **bf16 for 5B, fp16 for 2B** — the 5B checkpoint was trained in bf16; using fp16 can cause NaN outputs.
4. **Persist models** — mount a Clore.ai persistent volume to `/models` and set `HF_HOME=/models/hf` so weights survive container restarts.
5. **Batch overnight** — queue long prompt lists with a simple Python loop; Clore.ai billing is per-hour, so saturate the GPU.
6. **SSH + tmux** — run generation inside `tmux` so a dropped connection doesn't kill the process.
7. **Select the right GPU** — filter Clore.ai marketplace for ≥24 GB VRAM cards; sort by price to find the cheapest RTX 3090 / 4090 available.

## Troubleshooting

| Problem                              | Fix                                                                                                   |
| ------------------------------------ | ----------------------------------------------------------------------------------------------------- |
| `OutOfMemoryError` during VAE decode | Call `pipe.vae.enable_tiling()` before inference                                                      |
| NaN / black frames with 5B           | Switch to `torch.bfloat16`; fp16 is not supported for the 5B variant                                  |
| `ImportError: imageio`               | `pip install imageio[ffmpeg]` — the ffmpeg plugin is needed for MP4 export                            |
| Very slow first run                  | Model download is \~20 GB; subsequent runs use the cached weights                                     |
| CUDA version mismatch                | Ensure PyTorch CUDA version matches the driver: `python -c "import torch; print(torch.version.cuda)"` |
| Garbled motion / flickering          | Increase `num_inference_steps` to 50; lower `guidance_scale` to 5.0                                   |
| Container killed mid-download        | Set `HF_HOME` to a persistent volume and restart — partial downloads resume automatically             |
