# Wan2.1 Video

Generate high-quality videos with Alibaba's Wan2.1 text-to-video and image-to-video models on CLORE.AI GPUs.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Why Wan2.1?

* **High quality** - State-of-the-art video generation
* **Multiple modes** - Text-to-video, image-to-video
* **Various sizes** - 1.3B to 14B parameters
* **Long videos** - Up to 81 frames
* **Open weights** - Apache 2.0 license

## Model Variants

| Model           | Parameters | VRAM | Resolution | Frames |
| --------------- | ---------- | ---- | ---------- | ------ |
| Wan2.1-T2V-1.3B | 1.3B       | 8GB  | 480p       | 81     |
| Wan2.1-T2V-14B  | 14B        | 24GB | 720p       | 81     |
| Wan2.1-I2V-14B  | 14B        | 24GB | 720p       | 81     |

## Quick Deploy on CLORE.AI

**Docker Image:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
```

**Ports:**

```
22/tcp
7860/http
```

**Command:**

```bash
pip install diffusers transformers accelerate gradio && \
python -c "
import gradio as gr
import torch
from diffusers import WanPipeline
from diffusers.utils import export_to_video

pipe = WanPipeline.from_pretrained('alibaba-pai/Wan2.1-T2V-1.3B', torch_dtype=torch.float16)
pipe.to('cuda')
pipe.enable_model_cpu_offload()

def generate(prompt, steps, frames, seed):
    generator = torch.Generator('cuda').manual_seed(seed) if seed > 0 else None
    output = pipe(prompt, num_frames=frames, num_inference_steps=steps, generator=generator)
    export_to_video(output.frames[0], 'output.mp4', fps=16)
    return 'output.mp4'

gr.Interface(
    fn=generate,
    inputs=[
        gr.Textbox(label='Prompt'),
        gr.Slider(20, 100, value=50, label='Steps'),
        gr.Slider(16, 81, value=49, step=8, label='Frames'),
        gr.Number(value=-1, label='Seed')
    ],
    outputs=gr.Video(),
    title='Wan2.1 - Text to Video'
).launch(server_name='0.0.0.0', server_port=7860)
"
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

## Hardware Requirements

| Model    | Minimum GPU   | Recommended   | Optimal   |
| -------- | ------------- | ------------- | --------- |
| 1.3B T2V | RTX 3070 8GB  | RTX 3090 24GB | RTX 4090  |
| 14B T2V  | RTX 4090 24GB | A100 40GB     | A100 80GB |
| 14B I2V  | RTX 4090 24GB | A100 40GB     | A100 80GB |

## Installation

```bash
pip install diffusers transformers accelerate torch
```

## Text-to-Video

### Basic Usage (1.3B)

```python
import torch
from diffusers import WanPipeline
from diffusers.utils import export_to_video

pipe = WanPipeline.from_pretrained(
    "alibaba-pai/Wan2.1-T2V-1.3B",
    torch_dtype=torch.float16
)
pipe.to("cuda")
pipe.enable_model_cpu_offload()

prompt = "A cat playing with a ball in a sunny garden"

output = pipe(
    prompt=prompt,
    num_frames=49,
    num_inference_steps=50,
    guidance_scale=7.0
)

export_to_video(output.frames[0], "cat_video.mp4", fps=16)
```

### High Quality (14B)

```python
import torch
from diffusers import WanPipeline
from diffusers.utils import export_to_video

pipe = WanPipeline.from_pretrained(
    "alibaba-pai/Wan2.1-T2V-14B",
    torch_dtype=torch.float16
)
pipe.to("cuda")
pipe.enable_model_cpu_offload()
pipe.enable_vae_tiling()

prompt = "Cinematic shot of a dragon flying over mountains at sunset, 4K, detailed"

output = pipe(
    prompt=prompt,
    negative_prompt="blurry, low quality, distorted",
    num_frames=81,
    height=720,
    width=1280,
    num_inference_steps=50,
    guidance_scale=7.0
)

export_to_video(output.frames[0], "dragon.mp4", fps=24)
```

## Image-to-Video

### Animate an Image

```python
import torch
from diffusers import WanI2VPipeline
from diffusers.utils import load_image, export_to_video

pipe = WanI2VPipeline.from_pretrained(
    "alibaba-pai/Wan2.1-I2V-14B",
    torch_dtype=torch.float16
)
pipe.to("cuda")
pipe.enable_model_cpu_offload()

# Load input image
image = load_image("input.jpg")

prompt = "The person in the image starts walking forward"

output = pipe(
    prompt=prompt,
    image=image,
    num_frames=49,
    num_inference_steps=50,
    guidance_scale=7.0
)

export_to_video(output.frames[0], "animated.mp4", fps=16)
```

## Image-to-Video with Wan2.1-I2V-14B

{% hint style="info" %}
Wan2.1-I2V-14B animates a static image using a text prompt to guide the motion. Requires **24GB VRAM** (RTX 4090 or A100 40GB recommended).
{% endhint %}

### Model Details

| Property       | Value                             |
| -------------- | --------------------------------- |
| Model ID       | `Wan-AI/Wan2.1-I2V-14B-480P`      |
| Parameters     | 14 Billion                        |
| VRAM Required  | **24GB**                          |
| Max Resolution | 480p (854×480) or 720p (1280×720) |
| Max Frames     | 81                                |
| License        | Apache 2.0                        |

### Hardware Requirements

| GPU       | VRAM | Status         |
| --------- | ---- | -------------- |
| RTX 4090  | 24GB | ✅ Recommended  |
| RTX 3090  | 24GB | ✅ Supported    |
| A100 40GB | 40GB | ✅ Optimal      |
| A100 80GB | 80GB | ✅ Best quality |
| RTX 3080  | 10GB | ❌ Insufficient |

### Quick CLI Script

Save as `generate_i2v.py` and run:

```bash
python generate_i2v.py --model Wan-AI/Wan2.1-I2V-14B-480P --image input.jpg --prompt "camera slowly zooms out"
```

### generate\_i2v.py — Full Script

```python
#!/usr/bin/env python3
"""
Wan2.1 Image-to-Video CLI script.
Usage: python generate_i2v.py --model Wan-AI/Wan2.1-I2V-14B-480P \
           --image input.jpg --prompt "camera slowly zooms out"
"""

import argparse
import os
import sys
import torch
from diffusers import WanImageToVideoPipeline
from diffusers.utils import load_image, export_to_video
from PIL import Image


def parse_args():
    parser = argparse.ArgumentParser(description="Wan2.1 Image-to-Video Generator")
    parser.add_argument(
        "--model",
        type=str,
        default="Wan-AI/Wan2.1-I2V-14B-480P",
        help="Model ID from Hugging Face (default: Wan-AI/Wan2.1-I2V-14B-480P)",
    )
    parser.add_argument(
        "--image",
        type=str,
        required=True,
        help="Path to input image (JPEG or PNG)",
    )
    parser.add_argument(
        "--prompt",
        type=str,
        required=True,
        help='Text prompt describing the desired motion (e.g. "camera slowly zooms out")',
    )
    parser.add_argument(
        "--negative-prompt",
        type=str,
        default="blurry, low quality, distorted, jerky motion, artifacts",
        help="Negative prompt to avoid unwanted artifacts",
    )
    parser.add_argument(
        "--frames",
        type=int,
        default=49,
        help="Number of video frames to generate (default: 49, max: 81)",
    )
    parser.add_argument(
        "--steps",
        type=int,
        default=50,
        help="Number of diffusion steps (default: 50)",
    )
    parser.add_argument(
        "--guidance",
        type=float,
        default=7.0,
        help="Classifier-free guidance scale (default: 7.0)",
    )
    parser.add_argument(
        "--seed",
        type=int,
        default=-1,
        help="Random seed for reproducibility (-1 = random)",
    )
    parser.add_argument(
        "--fps",
        type=int,
        default=16,
        help="Output video FPS (default: 16)",
    )
    parser.add_argument(
        "--output",
        type=str,
        default="output_i2v.mp4",
        help="Output video file path (default: output_i2v.mp4)",
    )
    parser.add_argument(
        "--height",
        type=int,
        default=480,
        help="Output video height in pixels (default: 480)",
    )
    parser.add_argument(
        "--width",
        type=int,
        default=854,
        help="Output video width in pixels (default: 854)",
    )
    parser.add_argument(
        "--cpu-offload",
        action="store_true",
        default=True,
        help="Enable model CPU offloading to save VRAM (default: True)",
    )
    parser.add_argument(
        "--vae-tiling",
        action="store_true",
        default=False,
        help="Enable VAE tiling for high-resolution outputs",
    )
    return parser.parse_args()


def load_and_resize_image(image_path: str, width: int, height: int) -> Image.Image:
    """Load image from path and resize to target dimensions."""
    if not os.path.exists(image_path):
        print(f"[ERROR] Image not found: {image_path}", file=sys.stderr)
        sys.exit(1)

    img = Image.open(image_path).convert("RGB")
    original_size = img.size
    img = img.resize((width, height), Image.LANCZOS)
    print(f"[INFO] Loaded image: {image_path} ({original_size[0]}x{original_size[1]}) → resized to {width}x{height}")
    return img


def load_pipeline(model_id: str, cpu_offload: bool, vae_tiling: bool):
    """Load the Wan I2V pipeline with memory optimizations."""
    print(f"[INFO] Loading model: {model_id}")
    print(f"[INFO] CUDA available: {torch.cuda.is_available()}")
    if torch.cuda.is_available():
        vram_gb = torch.cuda.get_device_properties(0).total_memory / 1e9
        print(f"[INFO] GPU: {torch.cuda.get_device_name(0)} ({vram_gb:.1f} GB VRAM)")
        if vram_gb < 23:
            print("[WARN] Less than 24GB VRAM detected — enable --cpu-offload or use the 1.3B model")

    pipe = WanImageToVideoPipeline.from_pretrained(
        model_id,
        torch_dtype=torch.float16,
    )

    if cpu_offload:
        print("[INFO] Enabling model CPU offload")
        pipe.enable_model_cpu_offload()
    else:
        pipe.to("cuda")

    if vae_tiling:
        print("[INFO] Enabling VAE tiling for high-res generation")
        pipe.enable_vae_tiling()

    return pipe


def generate_video(pipe, args) -> None:
    """Run the I2V pipeline and save the output video."""
    image = load_and_resize_image(args.image, args.width, args.height)

    generator = None
    if args.seed >= 0:
        generator = torch.Generator("cuda").manual_seed(args.seed)
        print(f"[INFO] Using seed: {args.seed}")
    else:
        print("[INFO] Using random seed")

    print(f"[INFO] Generating {args.frames} frames at {args.width}x{args.height}")
    print(f"[INFO] Steps: {args.steps} | Guidance: {args.guidance} | FPS: {args.fps}")
    print(f"[INFO] Prompt: {args.prompt}")

    output = pipe(
        prompt=args.prompt,
        negative_prompt=args.negative_prompt,
        image=image,
        num_frames=args.frames,
        height=args.height,
        width=args.width,
        num_inference_steps=args.steps,
        guidance_scale=args.guidance,
        generator=generator,
    )

    export_to_video(output.frames[0], args.output, fps=args.fps)
    print(f"[INFO] Video saved to: {os.path.abspath(args.output)}")
    duration = args.frames / args.fps
    print(f"[INFO] Duration: {duration:.1f}s at {args.fps}fps ({args.frames} frames)")


def main():
    args = parse_args()

    if not torch.cuda.is_available():
        print("[ERROR] CUDA GPU not found. Wan2.1-I2V-14B requires a CUDA-capable GPU.", file=sys.stderr)
        sys.exit(1)

    pipe = load_pipeline(args.model, args.cpu_offload, args.vae_tiling)
    generate_video(pipe, args)
    print("[DONE] Image-to-video generation complete!")


if __name__ == "__main__":
    main()
```

### Advanced I2V Pipeline (Python API)

```python
import torch
from diffusers import WanImageToVideoPipeline
from diffusers.utils import load_image, export_to_video
from PIL import Image

# ── Load pipeline ──────────────────────────────────────────────────────────────
pipe = WanImageToVideoPipeline.from_pretrained(
    "Wan-AI/Wan2.1-I2V-14B-480P",
    torch_dtype=torch.float16,
)
pipe.enable_model_cpu_offload()   # keeps VRAM under 24GB
pipe.enable_vae_tiling()          # optional: helps for 720p

# ── Load & prepare input image ─────────────────────────────────────────────────
image = load_image("input.jpg").resize((854, 480))

# ── Generate ───────────────────────────────────────────────────────────────────
prompt = "camera slowly zooms out, revealing the full landscape"
negative_prompt = "blurry, low quality, distorted, flickering, artifacts"

generator = torch.Generator("cuda").manual_seed(42)

output = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    image=image,
    num_frames=49,          # ~3 seconds at 16fps
    height=480,
    width=854,
    num_inference_steps=50,
    guidance_scale=7.5,
    generator=generator,
)

export_to_video(output.frames[0], "i2v_output.mp4", fps=16)
print("Saved: i2v_output.mp4")
```

### I2V Prompt Tips

| Goal                | Prompt Example                                     |
| ------------------- | -------------------------------------------------- |
| Camera movement     | `"camera slowly zooms out from the subject"`       |
| Parallax effect     | `"subtle parallax motion, depth of field shift"`   |
| Character animation | `"the figure turns their head and smiles"`         |
| Nature animation    | `"leaves rustle in a gentle breeze, light shifts"` |
| Abstract motion     | `"colors swirl and blend, fluid motion"`           |

### Memory Tips for I2V (24GB GPUs)

```python
# Mandatory on 24GB GPUs
pipe.enable_model_cpu_offload()

# Optional: reduces peak VRAM ~10%
pipe.enable_vae_tiling()
pipe.enable_vae_slicing()

# Clear between runs
import gc
gc.collect()
torch.cuda.empty_cache()
```

## Prompt Examples

### Nature & Landscapes

```python
prompts = [
    "Time-lapse of clouds moving over mountain peaks, dramatic lighting",
    "Ocean waves crashing on rocks, slow motion, cinematic",
    "Northern lights dancing in the night sky, vibrant colors",
    "Forest in autumn with leaves falling, peaceful atmosphere"
]
```

### Animals & Characters

```python
prompts = [
    "A golden retriever running through a field of flowers",
    "A butterfly emerging from its cocoon, macro shot",
    "Samurai warrior drawing sword, dramatic lighting",
    "Robot walking through futuristic city streets"
]
```

### Abstract & Artistic

```python
prompts = [
    "Colorful paint swirling in water, abstract art",
    "Geometric shapes transforming and morphing, neon colors",
    "Ink drops spreading in milk, macro photography"
]
```

## Advanced Settings

### Quality vs Speed

```python
# Fast preview
output = pipe(
    prompt=prompt,
    num_frames=17,
    num_inference_steps=25,
    guidance_scale=5.0
)

# Balanced
output = pipe(
    prompt=prompt,
    num_frames=49,
    num_inference_steps=50,
    guidance_scale=7.0
)

# Maximum quality
output = pipe(
    prompt=prompt,
    num_frames=81,
    num_inference_steps=100,
    guidance_scale=7.5
)
```

### Resolution Options

```python
# 480p (1.3B model)
output = pipe(prompt, height=480, width=854, num_frames=49)

# 720p (14B model)
output = pipe(prompt, height=720, width=1280, num_frames=49)

# 1080p (14B model, high VRAM)
output = pipe(prompt, height=1080, width=1920, num_frames=33)
```

## Batch Generation

```python
import os
import torch
from diffusers import WanPipeline
from diffusers.utils import export_to_video

pipe = WanPipeline.from_pretrained("alibaba-pai/Wan2.1-T2V-1.3B", torch_dtype=torch.float16)
pipe.to("cuda")
pipe.enable_model_cpu_offload()

prompts = [
    "A rocket launching into space",
    "Fish swimming in coral reef",
    "Rain falling on a city street at night"
]

output_dir = "./videos"
os.makedirs(output_dir, exist_ok=True)

for i, prompt in enumerate(prompts):
    print(f"Generating {i+1}/{len(prompts)}: {prompt[:40]}...")

    output = pipe(
        prompt=prompt,
        num_frames=49,
        num_inference_steps=50
    )

    export_to_video(output.frames[0], f"{output_dir}/video_{i:03d}.mp4", fps=16)
    torch.cuda.empty_cache()
```

## Gradio Interface

```python
import gradio as gr
import torch
from diffusers import WanPipeline
from diffusers.utils import export_to_video
import tempfile

pipe = WanPipeline.from_pretrained("alibaba-pai/Wan2.1-T2V-1.3B", torch_dtype=torch.float16)
pipe.to("cuda")
pipe.enable_model_cpu_offload()

def generate_video(prompt, negative_prompt, frames, steps, guidance, seed):
    generator = torch.Generator("cuda").manual_seed(seed) if seed > 0 else None

    output = pipe(
        prompt=prompt,
        negative_prompt=negative_prompt,
        num_frames=frames,
        num_inference_steps=steps,
        guidance_scale=guidance,
        generator=generator
    )

    with tempfile.NamedTemporaryFile(suffix=".mp4", delete=False) as f:
        export_to_video(output.frames[0], f.name, fps=16)
        return f.name

demo = gr.Interface(
    fn=generate_video,
    inputs=[
        gr.Textbox(label="Prompt", lines=2),
        gr.Textbox(label="Negative Prompt", value="blurry, low quality"),
        gr.Slider(17, 81, value=49, step=8, label="Frames"),
        gr.Slider(20, 100, value=50, step=5, label="Steps"),
        gr.Slider(3, 12, value=7, step=0.5, label="Guidance"),
        gr.Number(value=-1, label="Seed")
    ],
    outputs=gr.Video(label="Generated Video"),
    title="Wan2.1 - Text to Video Generation",
    description="Generate videos from text prompts. Running on CLORE.AI."
)

demo.launch(server_name="0.0.0.0", server_port=7860)
```

## Memory Optimization

```python
# Enable all optimizations
pipe.enable_model_cpu_offload()
pipe.enable_vae_tiling()
pipe.enable_vae_slicing()

# For very low VRAM
pipe.enable_sequential_cpu_offload()

# Clear cache between generations
torch.cuda.empty_cache()
```

## Performance

| Model | Resolution | Frames | GPU       | Time      |
| ----- | ---------- | ------ | --------- | --------- |
| 1.3B  | 480p       | 49     | RTX 4090  | \~2 min   |
| 1.3B  | 480p       | 49     | A100 40GB | \~1.5 min |
| 14B   | 720p       | 49     | A100 40GB | \~5 min   |
| 14B   | 720p       | 81     | A100 80GB | \~8 min   |

## Cost Estimate

Typical CLORE.AI marketplace rates:

| GPU           | Hourly Rate | \~49 frame videos/hour   |
| ------------- | ----------- | ------------------------ |
| RTX 3090 24GB | \~$0.06     | \~20 (1.3B)              |
| RTX 4090 24GB | \~$0.10     | \~30 (1.3B)              |
| A100 40GB     | \~$0.17     | \~40 (1.3B) / \~12 (14B) |
| A100 80GB     | \~$0.25     | \~8 (14B high-res)       |

*Prices vary. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

## Troubleshooting

### Out of Memory

```python
# Use smaller model
pipe = WanPipeline.from_pretrained("alibaba-pai/Wan2.1-T2V-1.3B")

# Enable all optimizations
pipe.enable_model_cpu_offload()
pipe.enable_vae_tiling()

# Reduce frames
output = pipe(prompt, num_frames=17)

# Reduce resolution
output = pipe(prompt, height=480, width=854)
```

### Poor Quality

* Increase steps (75-100)
* Write more detailed prompts
* Use negative prompts
* Try 14B model for better quality

### Video Too Short

* Increase `num_frames` (max 81)
* Use RIFE interpolation for frame interpolation
* Chain multiple generations

### Artifacts/Flickering

* Increase guidance scale
* Use fixed seed for consistency
* Post-process with video stabilization

## Wan2.1 vs Others

| Feature     | Wan2.1     | Hunyuan   | SVD  | CogVideoX |
| ----------- | ---------- | --------- | ---- | --------- |
| Quality     | Excellent  | Excellent | Good | Great     |
| Speed       | Fast       | Medium    | Fast | Slow      |
| Max Frames  | 81         | 129       | 25   | 49        |
| Resolution  | 720p       | 720p      | 576p | 720p      |
| I2V Support | Yes        | Yes       | Yes  | Yes       |
| License     | Apache 2.0 | Open      | Open | Open      |

**Use Wan2.1 when:**

* Need open-source video generation
* Want fast generation speed
* Apache 2.0 license required
* Balanced quality/speed needed

## Next Steps

* [Hunyuan Video](https://docs.clore.ai/guides/video-generation/hunyuan-video) - Alternative T2V
* [OpenSora](https://docs.clore.ai/guides/video-generation/opensora) - Open Sora alternative
* [Stable Video Diffusion](https://docs.clore.ai/guides/video-generation/stable-video-diffusion) - Image animation
* [RIFE Interpolation](https://docs.clore.ai/guides/video-processing/rife-interpolation) - Frame interpolation
