# SadTalker

Animate faces with audio to create realistic talking head videos.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Renting on CLORE.AI

1. Visit [CLORE.AI Marketplace](https://clore.ai/marketplace)
2. Filter by GPU type, VRAM, and price
3. Choose **On-Demand** (fixed rate) or **Spot** (bid price)
4. Configure your order:
   * Select Docker image
   * Set ports (TCP for SSH, HTTP for web UIs)
   * Add environment variables if needed
   * Enter startup command
5. Select payment: **CLORE**, **BTC**, or **USDT/USDC**
6. Create order and wait for deployment

### Access Your Server

* Find connection details in **My Orders**
* Web interfaces: Use the HTTP port URL
* SSH: `ssh -p <port> root@<proxy-address>`

## What is SadTalker?

SadTalker generates talking videos:

* Lip-sync from any audio
* Natural head movements
* Works with single image
* Expression control

## Requirements

| Mode         | VRAM | Recommended |
| ------------ | ---- | ----------- |
| Basic        | 4GB  | RTX 3060    |
| High Quality | 6GB  | RTX 3080    |
| Full Face    | 8GB  | RTX 4080    |

## Quick Deploy

**Docker Image:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
```

**Ports:**

```
22/tcp
7860/http
```

**Command:**

```bash
cd /workspace && \
git clone https://github.com/OpenTalker/SadTalker.git && \
cd SadTalker && \
pip install -r requirements.txt && \
bash scripts/download_models.sh && \
python app.py
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

## Installation

```bash
git clone https://github.com/OpenTalker/SadTalker.git
cd SadTalker

pip install torch torchvision torchaudio
pip install -r requirements.txt

# Download pretrained models
bash scripts/download_models.sh
```

## Basic Usage

### Command Line

```bash
python inference.py \
    --driven_audio audio.wav \
    --source_image face.jpg \
    --result_dir ./results \
    --enhancer gfpgan
```

### Python API

```python
from src.facerender.animate import AnimateFromCoeff
from src.generate_batch import get_data
from src.generate_facerender_batch import get_facerender_data
import torch

class SadTalker:
    def __init__(self):
        self.device = "cuda"
        # Initialize models...

    def generate(self, source_image, driven_audio, **kwargs):
        # Process audio and image
        # Generate animation
        # Return video path
        pass

# Usage
sadtalker = SadTalker()
video_path = sadtalker.generate(
    source_image="face.jpg",
    driven_audio="speech.wav"
)
```

## With Face Enhancement

```bash

# Using GFPGAN for face enhancement
python inference.py \
    --driven_audio audio.wav \
    --source_image face.jpg \
    --enhancer gfpgan \
    --result_dir ./results

# Using Real-ESRGAN for full image
python inference.py \
    --driven_audio audio.wav \
    --source_image face.jpg \
    --enhancer realesrgan \
    --result_dir ./results
```

## Parameters

```bash
python inference.py \
    --driven_audio audio.wav \
    --source_image face.jpg \
    --pose_style 0 \           # 0-46 head movement styles
    --expression_scale 1.0 \   # Expression intensity
    --still \                  # Minimal head movement
    --preprocess crop \        # crop, resize, full
    --size 256 \               # Output size
    --enhancer gfpgan
```

### Pose Styles

| Range | Effect               |
| ----- | -------------------- |
| 0-5   | Subtle movements     |
| 6-20  | Normal movements     |
| 21-46 | Expressive movements |

## Batch Processing

```python
import os
import subprocess

def generate_talking_video(image_path, audio_path, output_dir):
    cmd = [
        "python", "inference.py",
        "--driven_audio", audio_path,
        "--source_image", image_path,
        "--result_dir", output_dir,
        "--enhancer", "gfpgan"
    ]
    subprocess.run(cmd, check=True)

# Process multiple images with same audio
images = ["person1.jpg", "person2.jpg", "person3.jpg"]
audio = "speech.wav"

for i, img in enumerate(images):
    output = f"./results/video_{i}"
    generate_talking_video(img, audio, output)
```

## Gradio Interface

```python
import gradio as gr
import subprocess
import tempfile
import os

def generate_video(image, audio, pose_style, expression_scale, enhancer):
    with tempfile.TemporaryDirectory() as tmpdir:
        # Save inputs
        image_path = os.path.join(tmpdir, "input.jpg")
        audio_path = os.path.join(tmpdir, "audio.wav")
        image.save(image_path)

        # Save audio
        import soundfile as sf
        sf.write(audio_path, audio[1], audio[0])

        # Generate
        cmd = [
            "python", "inference.py",
            "--driven_audio", audio_path,
            "--source_image", image_path,
            "--result_dir", tmpdir,
            "--pose_style", str(pose_style),
            "--expression_scale", str(expression_scale),
            "--enhancer", enhancer
        ]
        subprocess.run(cmd, check=True)

        # Find output video
        for f in os.listdir(tmpdir):
            if f.endswith(".mp4"):
                return os.path.join(tmpdir, f)

    return None

demo = gr.Interface(
    fn=generate_video,
    inputs=[
        gr.Image(type="pil", label="Source Face"),
        gr.Audio(label="Driving Audio"),
        gr.Slider(0, 46, value=0, step=1, label="Pose Style"),
        gr.Slider(0.5, 1.5, value=1.0, step=0.1, label="Expression Scale"),
        gr.Dropdown(["gfpgan", "realesrgan", "none"], value="gfpgan", label="Enhancer")
    ],
    outputs=gr.Video(label="Generated Video"),
    title="SadTalker - Talking Head Generation"
)

demo.launch(server_name="0.0.0.0", server_port=7860)
```

## API Server

```python
from fastapi import FastAPI, UploadFile, File
from fastapi.responses import FileResponse
import tempfile
import subprocess
import os

app = FastAPI()

@app.post("/generate")
async def generate(
    image: UploadFile = File(...),
    audio: UploadFile = File(...),
    pose_style: int = 0,
    expression_scale: float = 1.0
):
    with tempfile.TemporaryDirectory() as tmpdir:
        # Save uploads
        image_path = os.path.join(tmpdir, "input.jpg")
        audio_path = os.path.join(tmpdir, "audio.wav")

        with open(image_path, "wb") as f:
            f.write(await image.read())
        with open(audio_path, "wb") as f:
            f.write(await audio.read())

        # Generate
        cmd = [
            "python", "inference.py",
            "--driven_audio", audio_path,
            "--source_image", image_path,
            "--result_dir", tmpdir,
            "--pose_style", str(pose_style),
            "--expression_scale", str(expression_scale),
            "--enhancer", "gfpgan"
        ]
        subprocess.run(cmd, check=True)

        # Return video
        for f in os.listdir(tmpdir):
            if f.endswith(".mp4"):
                return FileResponse(os.path.join(tmpdir, f), media_type="video/mp4")

# Run: uvicorn server:app --host 0.0.0.0 --port 8000
```

## Text-to-Speech + SadTalker

Complete pipeline:

```python
import subprocess
from TTS.api import TTS

def text_to_talking_video(text, image_path, output_path):
    # Generate speech with TTS
    tts = TTS("tts_models/en/ljspeech/tacotron2-DDC")
    audio_path = "temp_audio.wav"
    tts.tts_to_file(text=text, file_path=audio_path)

    # Generate talking video
    cmd = [
        "python", "inference.py",
        "--driven_audio", audio_path,
        "--source_image", image_path,
        "--result_dir", output_path,
        "--enhancer", "gfpgan"
    ]
    subprocess.run(cmd, check=True)

# Usage
text_to_talking_video(
    "Hello, welcome to our presentation. Today we'll discuss AI.",
    "presenter.jpg",
    "./output"
)
```

## Expression Control

```python

# Minimal expression (news anchor style)
cmd = [
    "python", "inference.py",
    "--driven_audio", "audio.wav",
    "--source_image", "face.jpg",
    "--expression_scale", "0.5",
    "--still"  # Reduces head movement
]

# Expressive (animated character)
cmd = [
    "python", "inference.py",
    "--driven_audio", "audio.wav",
    "--source_image", "face.jpg",
    "--expression_scale", "1.5",
    "--pose_style", "30"
]
```

## Quality Settings

| Setting            | Speed   | Quality |
| ------------------ | ------- | ------- |
| No enhancer, 256px | Fast    | Basic   |
| GFPGAN, 256px      | Medium  | Good    |
| GFPGAN, 512px      | Slow    | Better  |
| RealESRGAN, 512px  | Slowest | Best    |

## Preprocessing Options

```bash

# Crop - focus on face (recommended)
--preprocess crop

# Resize - resize full image
--preprocess resize

# Full - use full image
--preprocess full
```

## Troubleshooting

### Face Not Detected

* Use clear, frontal face image
* Good lighting
* Avoid occlusions (glasses, hair)

### Audio Sync Issues

* Use 16kHz WAV files
* Avoid background music
* Clear speech only

### Choppy Movement

* Increase expression\_scale slightly
* Try different pose\_style
* Use longer audio

### Out of Memory

* Reduce output size
* Disable enhancer
* Use crop preprocessing

## Performance

| Resolution     | GPU      | Time (10s video) |
| -------------- | -------- | ---------------- |
| 256px          | RTX 3060 | \~30s            |
| 256px          | RTX 4090 | \~15s            |
| 512px + GFPGAN | RTX 4090 | \~45s            |

## Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

| GPU       | Hourly Rate | Daily Rate | 4-Hour Session |
| --------- | ----------- | ---------- | -------------- |
| RTX 3060  | \~$0.03     | \~$0.70    | \~$0.12        |
| RTX 3090  | \~$0.06     | \~$1.50    | \~$0.25        |
| RTX 4090  | \~$0.10     | \~$2.30    | \~$0.40        |
| A100 40GB | \~$0.17     | \~$4.00    | \~$0.70        |
| A100 80GB | \~$0.25     | \~$6.00    | \~$1.00        |

*Prices vary by provider and demand. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

**Save money:**

* Use **Spot** market for flexible workloads (often 30-50% cheaper)
* Pay with **CLORE** tokens
* Compare prices across different providers

## Next Steps

* [Wav2Lip](/guides/talking-heads/wav2lip.md) - Alternative lip sync
* [Bark TTS](/guides/audio-and-voice/bark-tts.md) - Generate speech
* [XTTS](/guides/audio-and-voice/xtts-coqui.md) - Voice cloning + TTS


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/talking-heads/sadtalker.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
