CogVideoX Video Generation

Generate 6-second videos from text or images with Zhipu AI's CogVideoX diffusion transformer on Clore.ai GPUs.

CogVideoX is a family of open-weight video diffusion transformers from Zhipu AI (Tsinghua). The models generate coherent 6-second clips at 720×480 resolution and 8 fps from either a text prompt (T2V) or a reference image plus prompt (I2V). Two parameter scales are available — 2B for fast iteration and 5B for higher fidelity — both with native diffusers integration through CogVideoXPipeline.

Running CogVideoX on a rented GPU from Clore.aiarrow-up-right lets you skip local hardware constraints and generate video at scale for pennies per clip.

Key Features

  • Text-to-Video (T2V) — describe a scene and get a 6-second 720×480 clip at 8 fps (49 frames).

  • Image-to-Video (I2V) — supply a reference image plus prompt; the model animates it with temporal consistency.

  • Two scales — CogVideoX-2B (fast, ~12 GB VRAM) and CogVideoX-5B (higher quality, ~20 GB VRAM).

  • Native diffusers support — first-class CogVideoXPipeline and CogVideoXImageToVideoPipeline classes.

  • 3D causal VAE — compresses 49 frames into a compact latent space for efficient denoising.

  • Open weights — Apache-2.0 license for the 2B variant; research license for 5B.

Requirements

Component
Minimum
Recommended

GPU VRAM

16 GB (2B, fp16)

24 GB (5B, bf16)

System RAM

32 GB

64 GB

Disk

30 GB

50 GB

Python

3.10+

3.11

CUDA

12.1+

12.4

Clore.ai GPU recommendation: An RTX 4090 (24 GB, ~$0.5–2/day) handles both the 2B and 5B variants comfortably. An RTX 3090 (24 GB, ~$0.3–1/day) works equally well for 5B at bf16 and is the budget pick.

Quick Start

Usage Examples

Text-to-Video (5B)

Image-to-Video (5B)

Fast Generation with the 2B Variant

Tips for Clore.ai Users

  1. Enable VAE tiling — without pipe.vae.enable_tiling() the 3D VAE will OOM on 24 GB cards during decode.

  2. Use enable_model_cpu_offload() — shifts idle modules to RAM automatically; adds ~10 % wall-time but saves 4+ GB peak VRAM.

  3. bf16 for 5B, fp16 for 2B — the 5B checkpoint was trained in bf16; using fp16 can cause NaN outputs.

  4. Persist models — mount a Clore.ai persistent volume to /models and set HF_HOME=/models/hf so weights survive container restarts.

  5. Batch overnight — queue long prompt lists with a simple Python loop; Clore.ai billing is per-hour, so saturate the GPU.

  6. SSH + tmux — run generation inside tmux so a dropped connection doesn't kill the process.

  7. Select the right GPU — filter Clore.ai marketplace for ≥24 GB VRAM cards; sort by price to find the cheapest RTX 3090 / 4090 available.

Troubleshooting

Problem
Fix

OutOfMemoryError during VAE decode

Call pipe.vae.enable_tiling() before inference

NaN / black frames with 5B

Switch to torch.bfloat16; fp16 is not supported for the 5B variant

ImportError: imageio

pip install imageio[ffmpeg] — the ffmpeg plugin is needed for MP4 export

Very slow first run

Model download is ~20 GB; subsequent runs use the cached weights

CUDA version mismatch

Ensure PyTorch CUDA version matches the driver: python -c "import torch; print(torch.version.cuda)"

Garbled motion / flickering

Increase num_inference_steps to 50; lower guidance_scale to 5.0

Container killed mid-download

Set HF_HOME to a persistent volume and restart — partial downloads resume automatically

Last updated

Was this helpful?