# SadTalker

Animate faces with audio to create realistic talking head videos.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Renting on CLORE.AI

1. Visit [CLORE.AI Marketplace](https://clore.ai/marketplace)
2. Filter by GPU type, VRAM, and price
3. Choose **On-Demand** (fixed rate) or **Spot** (bid price)
4. Configure your order:
   * Select Docker image
   * Set ports (TCP for SSH, HTTP for web UIs)
   * Add environment variables if needed
   * Enter startup command
5. Select payment: **CLORE**, **BTC**, or **USDT/USDC**
6. Create order and wait for deployment

### Access Your Server

* Find connection details in **My Orders**
* Web interfaces: Use the HTTP port URL
* SSH: `ssh -p <port> root@<proxy-address>`

## What is SadTalker?

SadTalker generates talking videos:

* Lip-sync from any audio
* Natural head movements
* Works with single image
* Expression control

## Requirements

| Mode         | VRAM | Recommended |
| ------------ | ---- | ----------- |
| Basic        | 4GB  | RTX 3060    |
| High Quality | 6GB  | RTX 3080    |
| Full Face    | 8GB  | RTX 4080    |

## Quick Deploy

**Docker Image:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
```

**Ports:**

```
22/tcp
7860/http
```

**Command:**

```bash
cd /workspace && \
git clone https://github.com/OpenTalker/SadTalker.git && \
cd SadTalker && \
pip install -r requirements.txt && \
bash scripts/download_models.sh && \
python app.py
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

## Installation

```bash
git clone https://github.com/OpenTalker/SadTalker.git
cd SadTalker

pip install torch torchvision torchaudio
pip install -r requirements.txt

# Download pretrained models
bash scripts/download_models.sh
```

## Basic Usage

### Command Line

```bash
python inference.py \
    --driven_audio audio.wav \
    --source_image face.jpg \
    --result_dir ./results \
    --enhancer gfpgan
```

### Python API

```python
from src.facerender.animate import AnimateFromCoeff
from src.generate_batch import get_data
from src.generate_facerender_batch import get_facerender_data
import torch

class SadTalker:
    def __init__(self):
        self.device = "cuda"
        # Initialize models...

    def generate(self, source_image, driven_audio, **kwargs):
        # Process audio and image
        # Generate animation
        # Return video path
        pass

# Usage
sadtalker = SadTalker()
video_path = sadtalker.generate(
    source_image="face.jpg",
    driven_audio="speech.wav"
)
```

## With Face Enhancement

```bash

# Using GFPGAN for face enhancement
python inference.py \
    --driven_audio audio.wav \
    --source_image face.jpg \
    --enhancer gfpgan \
    --result_dir ./results

# Using Real-ESRGAN for full image
python inference.py \
    --driven_audio audio.wav \
    --source_image face.jpg \
    --enhancer realesrgan \
    --result_dir ./results
```

## Parameters

```bash
python inference.py \
    --driven_audio audio.wav \
    --source_image face.jpg \
    --pose_style 0 \           # 0-46 head movement styles
    --expression_scale 1.0 \   # Expression intensity
    --still \                  # Minimal head movement
    --preprocess crop \        # crop, resize, full
    --size 256 \               # Output size
    --enhancer gfpgan
```

### Pose Styles

| Range | Effect               |
| ----- | -------------------- |
| 0-5   | Subtle movements     |
| 6-20  | Normal movements     |
| 21-46 | Expressive movements |

## Batch Processing

```python
import os
import subprocess

def generate_talking_video(image_path, audio_path, output_dir):
    cmd = [
        "python", "inference.py",
        "--driven_audio", audio_path,
        "--source_image", image_path,
        "--result_dir", output_dir,
        "--enhancer", "gfpgan"
    ]
    subprocess.run(cmd, check=True)

# Process multiple images with same audio
images = ["person1.jpg", "person2.jpg", "person3.jpg"]
audio = "speech.wav"

for i, img in enumerate(images):
    output = f"./results/video_{i}"
    generate_talking_video(img, audio, output)
```

## Gradio Interface

```python
import gradio as gr
import subprocess
import tempfile
import os

def generate_video(image, audio, pose_style, expression_scale, enhancer):
    with tempfile.TemporaryDirectory() as tmpdir:
        # Save inputs
        image_path = os.path.join(tmpdir, "input.jpg")
        audio_path = os.path.join(tmpdir, "audio.wav")
        image.save(image_path)

        # Save audio
        import soundfile as sf
        sf.write(audio_path, audio[1], audio[0])

        # Generate
        cmd = [
            "python", "inference.py",
            "--driven_audio", audio_path,
            "--source_image", image_path,
            "--result_dir", tmpdir,
            "--pose_style", str(pose_style),
            "--expression_scale", str(expression_scale),
            "--enhancer", enhancer
        ]
        subprocess.run(cmd, check=True)

        # Find output video
        for f in os.listdir(tmpdir):
            if f.endswith(".mp4"):
                return os.path.join(tmpdir, f)

    return None

demo = gr.Interface(
    fn=generate_video,
    inputs=[
        gr.Image(type="pil", label="Source Face"),
        gr.Audio(label="Driving Audio"),
        gr.Slider(0, 46, value=0, step=1, label="Pose Style"),
        gr.Slider(0.5, 1.5, value=1.0, step=0.1, label="Expression Scale"),
        gr.Dropdown(["gfpgan", "realesrgan", "none"], value="gfpgan", label="Enhancer")
    ],
    outputs=gr.Video(label="Generated Video"),
    title="SadTalker - Talking Head Generation"
)

demo.launch(server_name="0.0.0.0", server_port=7860)
```

## API Server

```python
from fastapi import FastAPI, UploadFile, File
from fastapi.responses import FileResponse
import tempfile
import subprocess
import os

app = FastAPI()

@app.post("/generate")
async def generate(
    image: UploadFile = File(...),
    audio: UploadFile = File(...),
    pose_style: int = 0,
    expression_scale: float = 1.0
):
    with tempfile.TemporaryDirectory() as tmpdir:
        # Save uploads
        image_path = os.path.join(tmpdir, "input.jpg")
        audio_path = os.path.join(tmpdir, "audio.wav")

        with open(image_path, "wb") as f:
            f.write(await image.read())
        with open(audio_path, "wb") as f:
            f.write(await audio.read())

        # Generate
        cmd = [
            "python", "inference.py",
            "--driven_audio", audio_path,
            "--source_image", image_path,
            "--result_dir", tmpdir,
            "--pose_style", str(pose_style),
            "--expression_scale", str(expression_scale),
            "--enhancer", "gfpgan"
        ]
        subprocess.run(cmd, check=True)

        # Return video
        for f in os.listdir(tmpdir):
            if f.endswith(".mp4"):
                return FileResponse(os.path.join(tmpdir, f), media_type="video/mp4")

# Run: uvicorn server:app --host 0.0.0.0 --port 8000
```

## Text-to-Speech + SadTalker

Complete pipeline:

```python
import subprocess
from TTS.api import TTS

def text_to_talking_video(text, image_path, output_path):
    # Generate speech with TTS
    tts = TTS("tts_models/en/ljspeech/tacotron2-DDC")
    audio_path = "temp_audio.wav"
    tts.tts_to_file(text=text, file_path=audio_path)

    # Generate talking video
    cmd = [
        "python", "inference.py",
        "--driven_audio", audio_path,
        "--source_image", image_path,
        "--result_dir", output_path,
        "--enhancer", "gfpgan"
    ]
    subprocess.run(cmd, check=True)

# Usage
text_to_talking_video(
    "Hello, welcome to our presentation. Today we'll discuss AI.",
    "presenter.jpg",
    "./output"
)
```

## Expression Control

```python

# Minimal expression (news anchor style)
cmd = [
    "python", "inference.py",
    "--driven_audio", "audio.wav",
    "--source_image", "face.jpg",
    "--expression_scale", "0.5",
    "--still"  # Reduces head movement
]

# Expressive (animated character)
cmd = [
    "python", "inference.py",
    "--driven_audio", "audio.wav",
    "--source_image", "face.jpg",
    "--expression_scale", "1.5",
    "--pose_style", "30"
]
```

## Quality Settings

| Setting            | Speed   | Quality |
| ------------------ | ------- | ------- |
| No enhancer, 256px | Fast    | Basic   |
| GFPGAN, 256px      | Medium  | Good    |
| GFPGAN, 512px      | Slow    | Better  |
| RealESRGAN, 512px  | Slowest | Best    |

## Preprocessing Options

```bash

# Crop - focus on face (recommended)
--preprocess crop

# Resize - resize full image
--preprocess resize

# Full - use full image
--preprocess full
```

## Troubleshooting

### Face Not Detected

* Use clear, frontal face image
* Good lighting
* Avoid occlusions (glasses, hair)

### Audio Sync Issues

* Use 16kHz WAV files
* Avoid background music
* Clear speech only

### Choppy Movement

* Increase expression\_scale slightly
* Try different pose\_style
* Use longer audio

### Out of Memory

* Reduce output size
* Disable enhancer
* Use crop preprocessing

## Performance

| Resolution     | GPU      | Time (10s video) |
| -------------- | -------- | ---------------- |
| 256px          | RTX 3060 | \~30s            |
| 256px          | RTX 4090 | \~15s            |
| 512px + GFPGAN | RTX 4090 | \~45s            |

## Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

| GPU       | Hourly Rate | Daily Rate | 4-Hour Session |
| --------- | ----------- | ---------- | -------------- |
| RTX 3060  | \~$0.03     | \~$0.70    | \~$0.12        |
| RTX 3090  | \~$0.06     | \~$1.50    | \~$0.25        |
| RTX 4090  | \~$0.10     | \~$2.30    | \~$0.40        |
| A100 40GB | \~$0.17     | \~$4.00    | \~$0.70        |
| A100 80GB | \~$0.25     | \~$6.00    | \~$1.00        |

*Prices vary by provider and demand. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

**Save money:**

* Use **Spot** market for flexible workloads (often 30-50% cheaper)
* Pay with **CLORE** tokens
* Compare prices across different providers

## Next Steps

* [Wav2Lip](https://docs.clore.ai/guides/talking-heads/wav2lip) - Alternative lip sync
* [Bark TTS](https://docs.clore.ai/guides/audio-and-voice/bark-tts) - Generate speech
* [XTTS](https://docs.clore.ai/guides/audio-and-voice/xtts-coqui) - Voice cloning + TTS
