Mochi-1 Video
Mochi-1 is Genmo's open-source 10-billion parameter video generation model producing 848×480 @ 30fps output with physically realistic motion. It uses an asymmetric diffusion transformer (AsymmDiT) architecture and ranks among the highest-quality open-source video models for motion fidelity. Deploy it on Clore.ai's GPU cloud to generate professional-grade videos at a fraction of commercial API costs.
What is Mochi-1?
Mochi-1 is a 10-billion parameter video diffusion model trained to produce videos with:
Smooth, physically plausible motion
High temporal consistency
Strong prompt adherence
848×480 resolution at 30 fps
It uses an asymmetric diffusion transformer (AsymmDiT) architecture — different encoder depths for video and text — enabling efficient inference at scale. The weights are released under the Genmo Open Source License, free for research and commercial use.
Model highlights:
10B parameters
Native 848×480 @ 30 fps output
High-motion fidelity (ranked top in community benchmarks)
Available on Hugging Face with diffusers integration
Gradio demo UI for easy interaction
Prerequisites
GPU VRAM
24 GB
40–80 GB
GPU
RTX 4090
A100 / H100
RAM
32 GB
64 GB
Storage
60 GB
100 GB
CUDA
11.8+
12.1+
Mochi-1 is a large model (≈40 GB in fp8 / ≈80 GB in bf16). A single RTX 4090 (24 GB) can run it with quantization. For full quality, use an A100 40 GB or larger. Multi-GPU setups are supported.
Step 1 — Rent a GPU on Clore.ai
Go to clore.ai and sign in.
Click Marketplace and filter:
VRAM: ≥ 24 GB (RTX 4090 minimum, A100 recommended)
For multi-GPU: filter by GPU count ≥ 2
Select your server and click Configure.
Set Docker image to
pytorch/pytorch:2.4.1-cuda12.4-cudnn9-devel(base image — we install Mochi inside).Set open ports:
22(SSH) and7860(Gradio UI).Click Rent.
Clore.ai lists A100 40 GB instances starting from ~$0.60–$0.90/hr. For Mochi-1 at full quality, this is the most cost-effective choice.
Step 2 — Custom Dockerfile
Build your own image or use this Dockerfile to create a ready-to-use Mochi-1 environment:
Build and Push to Docker Hub
Build the image locally and push it to your own Docker Hub account (replace YOUR_DOCKERHUB_USERNAME with your actual username):
Then use YOUR_DOCKERHUB_USERNAME/mochi-1:latest as your Docker image in Clore.ai.
There is no official pre-built Docker image for Mochi-1 on Docker Hub. You need to build from the Dockerfile above. Alternatively, use pytorch/pytorch:2.4.1-cuda12.4-cudnn9-devel as the base image directly and run the setup commands manually via SSH.
Step 3 — Connect via SSH
Once your instance is running:
Step 4 — Download Mochi-1 Weights
The model weights are hosted on Hugging Face. Download them via the huggingface_hub CLI:
The full bf16 model is approximately 80 GB. The fp8 quantized version is ~40 GB and runs on RTX 4090 (24 GB) with CPU offloading. Specify --include "*fp8*" to download only quantized weights.
Alternative: Download Only fp8 Quantized Weights
Step 5 — Launch the Gradio Demo
Mochi-1 ships with a Gradio web UI for easy text-to-video generation:
For low-VRAM mode (RTX 4090, 24 GB):
The --cpu_offload flag moves model layers to CPU RAM when not in use, reducing peak VRAM to ~18–20 GB at the cost of ~2× slower generation.
Step 6 — Access the Web UI
Open your browser and navigate to:
You will see the Mochi-1 Gradio interface with:
A text prompt input
Generation settings (steps, guidance scale, seed)
Video output player
Step 7 — Generate Your First Video
Example Prompts
Nature scene:
Action scene:
Abstract/artistic:
Recommended Settings
Steps
64
Guidance Scale
4.5
Duration
5.1 seconds (default)
Resolution
848×480 (native)
Generation time varies significantly by GPU. On an A100 80 GB, a 5-second video takes approximately 2–4 minutes. On RTX 4090 with CPU offload, expect 8–15 minutes.
Python API Usage
For programmatic generation, use the diffusers pipeline:
Batch Generation Script
Multi-GPU Inference
For faster generation with multiple GPUs:
Clore.ai offers multi-GPU servers (2×, 4× RTX 4090 or A100). With 2× A100 80 GB, generation time drops to under 60 seconds for a 5-second clip.
Troubleshooting
CUDA Out of Memory
Solutions:
Add
--cpu_offloadto the gradio commandEnable VAE slicing:
pipe.enable_vae_slicing()Reduce
num_frames(try 24 instead of 84)Use fp8 quantized weights instead of bf16
Model Loading Slow
Solution: Ensure weights are on a fast NVMe drive, not HDD. Check storage speed:
Video Artifacts / Temporal Flickering
Solutions:
Increase inference steps (try 80–100)
Adjust guidance scale (3.5–5.0 range is usually best)
Use a specific seed for reproducibility and iteration
Port 7860 Not Accessible
Check that the port was correctly opened in Clore.ai and the Gradio server is binding to 0.0.0.0:
Cost Estimation
RTX 4090
24 GB
~$0.35/hr
~10–15 min
A100 40GB
40 GB
~$0.70/hr
~3–5 min
A100 80GB
80 GB
~$1.20/hr
~2–3 min
2× A100 80GB
160 GB
~$2.20/hr
~60–90 sec
Clore.ai GPU Recommendations
Mochi-1 is VRAM-hungry — the 10B parameter model requires careful GPU selection.
RTX 4090
24 GB
~$0.70/hr
fp8 quantized only
~10–15 min
A100 40GB
40 GB
~$1.20/hr
bf16 recommended
~3–5 min
A100 80GB
80 GB
~$2.00/hr
full bf16, fast
~2–3 min
2× A100 80GB
160 GB
~$4.00/hr
tensor parallel, fastest
~60–90 sec
RTX 3090 (24GB) is not recommended — Mochi-1 in fp8 mode needs 24GB minimum and leaves almost no headroom. The RTX 4090 (24GB) works in fp8 but OOMs frequently on longer sequences. Start with A100 40GB for reliable results.
Best value for quality: A100 40GB at ~$1.20/hr generates a 5-second clip in 3–5 minutes. That's ~$0.08–0.10 per video clip — significantly cheaper than Runway ML ($0.25–0.50/clip) or Pika Labs subscriptions.
Useful Resources
Last updated
Was this helpful?