# Wav2Lip

Sync lips to any audio with Wav2Lip.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Renting on CLORE.AI

1. Visit [CLORE.AI Marketplace](https://clore.ai/marketplace)
2. Filter by GPU type, VRAM, and price
3. Choose **On-Demand** (fixed rate) or **Spot** (bid price)
4. Configure your order:
   * Select Docker image
   * Set ports (TCP for SSH, HTTP for web UIs)
   * Add environment variables if needed
   * Enter startup command
5. Select payment: **CLORE**, **BTC**, or **USDT/USDC**
6. Create order and wait for deployment

### Access Your Server

* Find connection details in **My Orders**
* Web interfaces: Use the HTTP port URL
* SSH: `ssh -p <port> root@<proxy-address>`

## What is Wav2Lip?

Wav2Lip provides:

* Accurate lip-sync for any face
* Works with any audio
* Video or image input
* Real-time capable

## Requirements

| Mode         | VRAM | Recommended |
| ------------ | ---- | ----------- |
| Basic        | 4GB  | RTX 3060    |
| High Quality | 6GB  | RTX 3080    |
| HD           | 8GB  | RTX 4080    |

## Quick Deploy

**Docker Image:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
```

**Ports:**

```
22/tcp
7860/http
```

**Command:**

```bash
cd /workspace && \
git clone https://github.com/Rudrabha/Wav2Lip.git && \
cd Wav2Lip && \
pip install -r requirements.txt && \
wget "https://huggingface.co/spaces/wav2lip/wav2lip/resolve/main/checkpoints/wav2lip_gan.pth" -P checkpoints/ && \
python app.py
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

## Installation

```bash
git clone https://github.com/Rudrabha/Wav2Lip.git
cd Wav2Lip
pip install -r requirements.txt

# Download pretrained models
mkdir -p checkpoints
wget "https://huggingface.co/spaces/wav2lip/wav2lip/resolve/main/checkpoints/wav2lip.pth" -P checkpoints/
wget "https://huggingface.co/spaces/wav2lip/wav2lip/resolve/main/checkpoints/wav2lip_gan.pth" -P checkpoints/
```

## Basic Usage

### Command Line

```bash
python inference.py \
    --checkpoint_path checkpoints/wav2lip_gan.pth \
    --face input_video.mp4 \
    --audio audio.wav \
    --outfile output.mp4
```

### With Image Input

```bash
python inference.py \
    --checkpoint_path checkpoints/wav2lip_gan.pth \
    --face face_image.jpg \
    --audio speech.wav \
    --outfile talking.mp4
```

## Python API

```python
import subprocess

def wav2lip_sync(face_path, audio_path, output_path, checkpoint="checkpoints/wav2lip_gan.pth"):
    cmd = [
        "python", "inference.py",
        "--checkpoint_path", checkpoint,
        "--face", face_path,
        "--audio", audio_path,
        "--outfile", output_path
    ]
    subprocess.run(cmd, check=True)
    return output_path

# Usage
result = wav2lip_sync(
    face_path="video.mp4",
    audio_path="new_audio.wav",
    output_path="synced.mp4"
)
```

## Quality Options

### Standard Quality (Faster)

```bash
python inference.py \
    --checkpoint_path checkpoints/wav2lip.pth \
    --face input.mp4 \
    --audio audio.wav \
    --outfile output.mp4
```

### High Quality (GAN)

```bash
python inference.py \
    --checkpoint_path checkpoints/wav2lip_gan.pth \
    --face input.mp4 \
    --audio audio.wav \
    --outfile output.mp4 \
    --pads 0 10 0 0 \
    --resize_factor 1
```

## Parameters

```bash
python inference.py \
    --checkpoint_path checkpoints/wav2lip_gan.pth \
    --face video.mp4 \
    --audio audio.wav \
    --outfile result.mp4 \
    --pads 0 10 0 0 \      # Padding: top right bottom left
    --resize_factor 1 \    # Downscale factor
    --crop "0 -1 0 -1" \   # Crop region
    --box "-1 -1 -1 -1" \  # Face box (auto-detect)
    --nosmooth            # Disable temporal smoothing
```

### Padding Tips

| Face Position | Recommended Pads |
| ------------- | ---------------- |
| Centered      | 0 10 0 0         |
| Close-up      | 0 15 0 0         |
| Far           | 0 5 0 0          |

## Batch Processing

```python
import os
import subprocess

def batch_wav2lip(faces_dir, audio_path, output_dir):
    os.makedirs(output_dir, exist_ok=True)

    for filename in os.listdir(faces_dir):
        if filename.endswith(('.mp4', '.jpg', '.png')):
            face_path = os.path.join(faces_dir, filename)
            output_path = os.path.join(output_dir, f"synced_{filename}")

            if filename.endswith(('.jpg', '.png')):
                output_path = output_path.rsplit('.', 1)[0] + '.mp4'

            cmd = [
                "python", "inference.py",
                "--checkpoint_path", "checkpoints/wav2lip_gan.pth",
                "--face", face_path,
                "--audio", audio_path,
                "--outfile", output_path
            ]

            try:
                subprocess.run(cmd, check=True)
                print(f"Processed: {filename}")
            except subprocess.CalledProcessError as e:
                print(f"Error processing {filename}: {e}")

# Usage
batch_wav2lip("./faces", "speech.wav", "./outputs")
```

## Gradio Interface

```python
import gradio as gr
import subprocess
import tempfile
import os

def lip_sync(face_video, audio, quality):
    checkpoint = "checkpoints/wav2lip_gan.pth" if quality == "High (GAN)" else "checkpoints/wav2lip.pth"

    with tempfile.NamedTemporaryFile(suffix=".mp4", delete=False) as out_file:
        output_path = out_file.name

    cmd = [
        "python", "inference.py",
        "--checkpoint_path", checkpoint,
        "--face", face_video,
        "--audio", audio,
        "--outfile", output_path
    ]

    subprocess.run(cmd, check=True)
    return output_path

demo = gr.Interface(
    fn=lip_sync,
    inputs=[
        gr.Video(label="Face Video/Image"),
        gr.Audio(type="filepath", label="Audio"),
        gr.Radio(["Standard", "High (GAN)"], value="High (GAN)", label="Quality")
    ],
    outputs=gr.Video(label="Lip-Synced Video"),
    title="Wav2Lip - Lip Sync"
)

demo.launch(server_name="0.0.0.0", server_port=7860)
```

## API Server

```python
from fastapi import FastAPI, UploadFile, File
from fastapi.responses import FileResponse
import tempfile
import subprocess
import os

app = FastAPI()

@app.post("/sync")
async def sync_lips(
    face: UploadFile = File(...),
    audio: UploadFile = File(...),
    quality: str = "gan"
):
    with tempfile.TemporaryDirectory() as tmpdir:
        # Save uploads
        face_ext = os.path.splitext(face.filename)[1]
        face_path = os.path.join(tmpdir, f"face{face_ext}")
        audio_path = os.path.join(tmpdir, "audio.wav")
        output_path = os.path.join(tmpdir, "output.mp4")

        with open(face_path, "wb") as f:
            f.write(await face.read())
        with open(audio_path, "wb") as f:
            f.write(await audio.read())

        # Run Wav2Lip
        checkpoint = "checkpoints/wav2lip_gan.pth" if quality == "gan" else "checkpoints/wav2lip.pth"

        cmd = [
            "python", "inference.py",
            "--checkpoint_path", checkpoint,
            "--face", face_path,
            "--audio", audio_path,
            "--outfile", output_path
        ]

        subprocess.run(cmd, check=True)

        return FileResponse(output_path, media_type="video/mp4")

# Run: uvicorn server:app --host 0.0.0.0 --port 8000
```

## TTS + Wav2Lip Pipeline

Complete text-to-video:

```python
from TTS.api import TTS
import subprocess

def text_to_lipsync(text, face_path, output_path, language="en"):
    # Generate speech
    tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
    audio_path = "temp_speech.wav"
    tts.tts_to_file(text=text, file_path=audio_path, language=language)

    # Lip sync
    cmd = [
        "python", "inference.py",
        "--checkpoint_path", "checkpoints/wav2lip_gan.pth",
        "--face", face_path,
        "--audio", audio_path,
        "--outfile", output_path
    ]
    subprocess.run(cmd, check=True)

    return output_path

# Usage
text_to_lipsync(
    "Hello, welcome to our presentation.",
    "presenter.jpg",
    "talking_presenter.mp4"
)
```

## Post-Processing

### Upscale Result

```python
import subprocess

def upscale_video(input_path, output_path):
    cmd = [
        "python", "-m", "realesrgan",
        "--input", input_path,
        "--output", output_path,
        "--scale", "2"
    ]
    subprocess.run(cmd, check=True)
```

### Add Audio Back

```bash

# If audio was lost, add it back
ffmpeg -i synced_video.mp4 -i original_audio.wav \
    -c:v copy -c:a aac \
    -map 0:v:0 -map 1:a:0 \
    final_output.mp4
```

## Troubleshooting

### Face Not Detected

* Ensure face is clearly visible
* Good lighting
* Front-facing preferred
* Higher resolution input

### Poor Sync Quality

* Use wav2lip\_gan.pth
* Adjust padding
* Check audio sample rate (16kHz recommended)

### Choppy Output

* Increase resize\_factor
* Disable nosmooth
* Use higher quality input video

## Performance

| Input             | GPU      | Processing Time |
| ----------------- | -------- | --------------- |
| 10s video         | RTX 3060 | \~30s           |
| 10s video         | RTX 4090 | \~15s           |
| 30s video         | RTX 4090 | \~45s           |
| Image + 10s audio | RTX 3090 | \~20s           |

## Comparison with SadTalker

| Feature       | Wav2Lip   | SadTalker    |
| ------------- | --------- | ------------ |
| Lip accuracy  | Excellent | Good         |
| Head movement | None      | Natural      |
| Expression    | None      | Controllable |
| Speed         | Faster    | Slower       |
| Best for      | Dubbing   | Avatars      |

## Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

| GPU       | Hourly Rate | Daily Rate | 4-Hour Session |
| --------- | ----------- | ---------- | -------------- |
| RTX 3060  | \~$0.03     | \~$0.70    | \~$0.12        |
| RTX 3090  | \~$0.06     | \~$1.50    | \~$0.25        |
| RTX 4090  | \~$0.10     | \~$2.30    | \~$0.40        |
| A100 40GB | \~$0.17     | \~$4.00    | \~$0.70        |
| A100 80GB | \~$0.25     | \~$6.00    | \~$1.00        |

*Prices vary by provider and demand. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

**Save money:**

* Use **Spot** market for flexible workloads (often 30-50% cheaper)
* Pay with **CLORE** tokens
* Compare prices across different providers

## Next Steps

* [SadTalker](https://docs.clore.ai/guides/talking-heads/sadtalker) - Head movement + lips
* [XTTS](https://docs.clore.ai/guides/audio-and-voice/xtts-coqui) - Generate speech
* [RVC Voice Clone](https://docs.clore.ai/guides/audio-and-voice/rvc-voice-clone) - Voice conversion
