For the complete documentation index, see llms.txt. This page is also available as Markdown.

Wan 2.2 VBVR (Motion Control)

Wan 2.2 VBVR (Video-Based Video Reference) is Alibaba's April 2026 extension to the Wan 2.2 image-to-video foundation model. It adds a powerful new capability: you provide a reference video clip to control motion patterns in your generated video, not just a starting image. The result is consistent, controllable animation — the same character, product, or scene follows the motion path from your reference footage.

This guide covers deploying Wan 2.2 VBVR via ComfyUI on a Clore.ai GPU rental.


What Is VBVR (Video-Based Video Reference)?

Traditional image-to-video models take a static image and generate motion from scratch. The motion is guided by your text prompt, but it can be unpredictable — especially for specific gestures, camera moves, or character actions.

VBVR changes the equation: you supply:

  1. A starting image — your subject (character, product, scene)

  2. A reference motion video — a short clip demonstrating the motion you want

  3. A text prompt — describing the content and style

The model extracts the motion pattern from the reference video and applies it to your starting image, generating a new video where your subject performs that motion naturally.

Example Applications

Input Image
Reference Video Motion
Output

Product photo

Hand picking up similar item

Product pick-up animation

Character illustration

Actor walking cycle

Character walking

Fashion model

Runway walk footage

Clothing in motion

Building exterior

Camera pan from drone footage

Cinematic B-roll reveal


Model Overview

  • Full name: Wan 2.2 I2V-A14B with VBVR (Video-Based Video Reference)

  • Released: April 2026 by Alibaba / Wan-AI team

  • Built on: Wan 2.2 I2V-A14B (Image-to-Video, 14B params, up to 480p resolution)

  • HuggingFace: Wan-AI/Wan2.2-I2V-A14B

  • VBVR workflow: distributed via ComfyUI Manager community nodes

  • License: Apache 2.0

Variants

Variant
VRAM Required
Quality
Speed

FP8

16–24 GB

High

Fast

BF16

24–40 GB

Highest

Moderate

The FP8 variant runs on RTX 3090 (24 GB) and can squeeze into 16 GB cards with reduced batch size. The BF16 variant delivers the best quality and runs comfortably on an RTX 4090 (24 GB) or A6000 (48 GB).


Hardware Requirements

GPU
VRAM
Variant
Price on Clore.ai

RTX 3090

24 GB

FP8 ✅

~$0.30/day

RTX 4090

24 GB

FP8 ✅ / BF16 ✅

~$0.50/day

A6000 48GB

48 GB

BF16 ✅

~$1.20/day

A100 80GB

80 GB

BF16 ✅

~$2.50/day

For most users, an RTX 4090 at ~$0.50/day is the best balance of price and quality, running BF16 at full 480p resolution.


Step-by-Step Setup on Clore.ai

Step 1: Rent a GPU

Visit clore.ai/marketplace:

  • Budget: RTX 3090 (~$0.30/day) — FP8 only

  • Recommended: RTX 4090 (~$0.50/day) — BF16 quality

  • Premium: A6000 (~$1.20/day) — batch processing, high throughput

Use a ComfyUI Docker image or the base CUDA image (we'll install ComfyUI manually).

Step 2: Install ComfyUI

Step 3: Install VBVR Custom Nodes via ComfyUI Manager

Start ComfyUI:

Open http://YOUR_CLORE_IP:8188 in your browser. Then:

  1. Click Manager button (top menu)

  2. Search for "Wan 2.2 VBVR" or "WanVideo"

  3. Install the ComfyUI-WanVideo node pack

  4. Restart ComfyUI after installation

Alternatively, install the nodes directly:

Step 4: Download Model Checkpoints

Tip: Use huggingface-cli download --include "*.safetensors" to skip non-essential files and save disk space.

Step 5: Download VAE and Text Encoder


Building the VBVR Workflow in ComfyUI

Workflow Overview

The VBVR workflow connects these node groups:

Loading the Workflow

  1. Download the pre-built VBVR workflow JSON from the ComfyUI-WanVideoWrapper repository:

  2. In ComfyUI: Load → select wan22_vbvr.json

Configuring Key Nodes

WanVideoModelLoader

  • model_path: point to Wan2.2-I2V-A14B

  • precision: fp8_e4m3fn for RTX 3090, bf16 for RTX 4090+

VBVRMotionEncoderLoader

  • encoder_path: point to vbvr-motion-encoder

WanVideoSampler

  • steps: 25–30 (quality), 15–20 (speed)

  • cfg: 6.0–7.5 (higher = more prompt-adherent)

  • motion_strength: 0.6–0.9 (how closely to follow reference motion)

  • frames: 25 (approx. 2 seconds at 12fps) or 49 (4 seconds)

  • resolution: 832×480 (default 480p)

LoadVideo (Reference)

  • Load your reference motion clip (MP4, GIF, or image sequence)

  • Recommended: 2–5 seconds, same approximate duration as your target output


Running Your First Generation

Prepare Your Inputs

  1. Starting image: 832×480px or close to it. PNG or JPG. This is your subject.

  2. Reference motion video: ideally 2–5 seconds, shows the motion you want. Resolution doesn't need to match — the model extracts motion vectors, not pixel content.

  3. Text prompt: describe your subject and what it's doing (e.g., "a product bottle rotating smoothly on a white surface, cinematic lighting, 4K, professional photography")

Generation Time Estimates

GPU
Variant
Frames
Time

RTX 3090

FP8

25 frames

~3–5 min

RTX 4090

BF16

25 frames

~2–4 min

RTX 4090

FP8

25 frames

~1.5–2.5 min

A100 80GB

BF16

49 frames

~3–5 min


Practical Workflows

Character Animation

  1. Image: character illustration or photo

  2. Reference: footage of an actor performing the desired action (walk, wave, run)

  3. Prompt: "cartoon character walking through a forest, smooth animation, consistent style"

  4. motion_strength: 0.85 (high fidelity to reference motion)

Product Demo

  1. Image: clean product shot on white background

  2. Reference: hand unboxing or rotating a similar product

  3. Prompt: "premium product reveal, 360 rotation, soft studio lighting, commercial quality"

  4. motion_strength: 0.70 (some creative freedom for lighting/environment)

Cinematic B-Roll

  1. Image: landscape photo or building exterior

  2. Reference: drone footage or camera pan from a stock clip

  3. Prompt: "aerial cinematic B-roll, golden hour, smooth drone movement, 4K quality"

  4. motion_strength: 0.65 (let the model add naturalistic motion)


Troubleshooting

Out of memory on RTX 3090 with BF16

  • Switch to FP8 quantization in WanVideoModelLoader

  • Reduce frames from 25 to 17

  • Disable VAE tiling if enabled

Motion doesn't match reference video

  • Increase motion_strength to 0.85–0.95

  • Ensure reference video is trimmed to match your target duration

  • Use reference videos with clear, unambiguous motion (avoid camera shake)

Generated video flickers or has artifacts

  • Increase steps to 30

  • Reduce CFG to 6.0

  • Use a reference video with consistent lighting

Slow download / HuggingFace timeout

  • Use HF_ENDPOINT=https://hf-mirror.com environment variable for faster downloads from China

  • Or download via aria2c with multiple connections


What's Next: Wan 2.7

Alibaba's Wan 2.7 is the next generation of the Wan video model family, featuring:

  • First + last frame generation: specify both the opening and closing frames

  • Video-to-video editing: modify existing video with text instructions

  • Subject referencing: maintain consistent appearance of specific objects/characters across scenes

Wan 2.7 is currently available via Together AI's API. Open-source weights are expected mid-Q2 2026. A full self-hosting guide will be added to this repository when the weights are released.


Summary

Wan 2.2 VBVR brings reference-driven motion control to open-source video generation. Supply a starting image and a reference motion clip, and the model generates a consistent video where your subject follows that motion naturally. FP8 runs on a 24 GB RTX 3090 for ~$0.30/day; BF16 on an RTX 4090 for ~$0.50/day — both on Clore.ai.

Rent a GPU on Clore.ai and start generating motion-controlled videos today.

Last updated

Was this helpful?