# SadTalker

用音频为面部制作动画，创建逼真的会说话头部视频。

{% hint style="success" %}
所有示例均可在通过以下方式租用的 GPU 服务器上运行： [CLORE.AI 市场](https://clore.ai/marketplace).
{% endhint %}

## 在 CLORE.AI 上租用

1. 访问 [CLORE.AI 市场](https://clore.ai/marketplace)
2. 按 GPU 类型、显存和价格筛选
3. 选择 **按需（On-Demand）** （固定费率）或 **竞价（Spot）** （出价价格）
4. 配置您的订单：
   * 选择 Docker 镜像
   * 设置端口（SSH 使用 TCP，Web UI 使用 HTTP）
   * 如有需要添加环境变量
   * 输入启动命令
5. 选择支付方式： **CLORE**, **BTC**，或 **USDT/USDC**
6. 创建订单并等待部署

### 访问您的服务器

* 在以下位置查找连接详情： **我的订单**
* Web 界面：使用 HTTP 端口 URL
* SSH： `ssh -p <port> root@<proxy-address>`

## 什么是 SadTalker？

SadTalker 生成会说话的视频：

* 从任何音频生成唇动同步
* 自然的头部动作
* 可用单张图像工作
* 表情控制

## 要求

| 模式   | 显存  | 推荐       |
| ---- | --- | -------- |
| 基础   | 4GB | RTX 3060 |
| 高质量  | 6GB | RTX 3080 |
| 完整面部 | 8GB | RTX 4080 |

## 快速部署

**Docker 镜像：**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
```

**端口：**

```
22/tcp
7860/http
```

**命令：**

```bash
cd /workspace && \
git clone https://github.com/OpenTalker/SadTalker.git && \
cd SadTalker && \
pip install -r requirements.txt && \
bash scripts/download_models.sh && \
python app.py
```

## 访问您的服务

部署完成后，查找您的 `http_pub` URL 在 **我的订单**:

1. 前往 **我的订单** 页面
2. 点击您的订单
3. 查找 `http_pub` URL（例如， `abc123.clorecloud.net`)

使用 `https://YOUR_HTTP_PUB_URL` 替代 `localhost` 在下面的示例中。

## 安装

```bash
git clone https://github.com/OpenTalker/SadTalker.git
cd SadTalker

pip install torch torchvision torchaudio
pip install -r requirements.txt

# 下载预训练模型
bash scripts/download_models.sh
```

## 基本用法

### 命令行

```bash
python inference.py \
    --driven_audio audio.wav \
    --source_image face.jpg \
    --result_dir ./results \
    --enh化器 gfpgan
```

### Python API

```python
from src.facerender.animate import AnimateFromCoeff
from src.generate_batch import get_data
from src.generate_facerender_batch import get_facerender_data
import torch

class SadTalker:
    def __init__(self):
        self.device = "cuda"
        # 初始化模型...

    def generate(self, source_image, driven_audio, **kwargs):
        # 处理音频和图像
        # 生成动画
        # 返回视频路径
        pass

# 用法
sadtalker = SadTalker()
video_path = sadtalker.generate(
    source_image="face.jpg",
    driven_audio="speech.wav"
)
```

## 使用面部增强

```bash

# 使用 GFPGAN 进行面部增强
python inference.py \
    --driven_audio audio.wav \
    --source_image face.jpg \
    --enhancer gfpgan \
    --result_dir ./results

# 使用 Real-ESRGAN 对整张图像进行增强
python inference.py \
    --driven_audio audio.wav \
    --source_image face.jpg \
    --enhancer realesrgan \
    --result_dir ./results
```

## 参数量

```bash
python inference.py \
    --driven_audio audio.wav \
    --source_image face.jpg \
    --pose_style 0 \           # 0-46 头部动作风格
    --expression_scale 1.0 \   # 表情强度
    --still \                  # 最小头部动作
    --preprocess crop \        # 裁剪，调整大小，完整
    --size 256 \               # 输出尺寸
    --enh化器 gfpgan
```

### 姿势风格

| 范围    | 效果   |
| ----- | ---- |
| 0-5   | 细微动作 |
| 6-20  | 正常动作 |
| 21-46 | 夸张动作 |

## 批量处理

```python
import os
import subprocess

def generate_talking_video(image_path, audio_path, output_dir):
    cmd = [
        "python", "inference.py",
        "--driven_audio", audio_path,
        "--source_image", image_path,
        "--result_dir", output_dir,
        "--enhancer", "gfpgan"
    ]
    subprocess.run(cmd, check=True)

# 使用相同音频处理多张图片
images = ["person1.jpg", "person2.jpg", "person3.jpg"]
audio = "speech.wav"

for i, img in enumerate(images):
    output = f"./results/video_{i}"
    generate_talking_video(img, audio, output)
```

## Gradio 界面

```python
import gradio as gr
import subprocess
import tempfile
import os

def generate_video(image, audio, pose_style, expression_scale, enhancer):
    with tempfile.TemporaryDirectory() as tmpdir:
        # 保存输入
        image_path = os.path.join(tmpdir, "input.jpg")
        audio_path = os.path.join(tmpdir, "audio.wav")
        image.save(image_path)

        # 保存音频
        import soundfile as sf
        sf.write(audio_path, audio[1], audio[0])

        # 生成
        cmd = [
            "python", "inference.py",
            "--driven_audio", audio_path,
            "--source_image", image_path,
            "--result_dir", tmpdir,
            "--pose_style", str(pose_style),
            "--expression_scale", str(expression_scale),
            "--enhancer", enhancer
        ]
        subprocess.run(cmd, check=True)

        # 查找输出视频
        for f in os.listdir(tmpdir):
            if f.endswith(".mp4"):
                return os.path.join(tmpdir, f)

    return None

demo = gr.Interface(
    fn=generate_video,
    inputs=[
        gr.Image(type="pil", label="源面孔"),
        gr.Audio(label="驱动音频"),
        gr.Slider(0, 46, value=0, step=1, label="姿势风格"),
        gr.Slider(0.5, 1.5, value=1.0, step=0.1, label="表情强度"),
        gr.Dropdown(["gfpgan", "realesrgan", "none"], value="gfpgan", label="增强器")
    ],
    outputs=gr.Video(label="生成的视频"),
    title="SadTalker - 说话头部生成"
)

demo.launch(server_name="0.0.0.0", server_port=7860)
```

## API 服务器

```python
from fastapi import FastAPI, UploadFile, File
from fastapi.responses import FileResponse
import tempfile
import subprocess
import os

app = FastAPI()

@app.post("/generate")
async def generate(
    image: UploadFile = File(...),
    audio: UploadFile = File(...),
    pose_style: int = 0,
    expression_scale: float = 1.0
):
    with tempfile.TemporaryDirectory() as tmpdir:
        # 保存上传文件
        image_path = os.path.join(tmpdir, "input.jpg")
        audio_path = os.path.join(tmpdir, "audio.wav")

        with open(image_path, "wb") as f:
            f.write(await image.read())
        with open(audio_path, "wb") as f:
            f.write(await audio.read())

        # 生成
        cmd = [
            "python", "inference.py",
            "--driven_audio", audio_path,
            "--source_image", image_path,
            "--result_dir", tmpdir,
            "--pose_style", str(pose_style),
            "--expression_scale", str(expression_scale),
            "--enhancer", "gfpgan"
        ]
        subprocess.run(cmd, check=True)

        # 返回视频
        for f in os.listdir(tmpdir):
            if f.endswith(".mp4"):
                return FileResponse(os.path.join(tmpdir, f), media_type="video/mp4")

# 运行：uvicorn server:app --host 0.0.0.0 --port 8000
```

## 文本转语音 + SadTalker

完整流程：

```python
import subprocess
from TTS.api import TTS

def text_to_talking_video(text, image_path, output_path):
    # 使用 TTS 生成语音
    tts = TTS("tts_models/en/ljspeech/tacotron2-DDC")
    audio_path = "temp_audio.wav"
    tts.tts_to_file(text=text, file_path=audio_path)

    # 生成会说话的视频
    cmd = [
        "python", "inference.py",
        "--driven_audio", audio_path,
        "--source_image", image_path,
        "--result_dir", output_path,
        "--enhancer", "gfpgan"
    ]
    subprocess.run(cmd, check=True)

# 用法
text_to_talking_video(
    "Hello, welcome to our presentation. Today we'll discuss AI.",
    "presenter.jpg",
    "./output"
)
```

## 表情控制

```python

# 最小表情（新闻主播风格）
cmd = [
    "python", "inference.py",
    "--driven_audio", "audio.wav",
    "--source_image", "face.jpg",
    "--expression_scale", "0.5",
    "--still"  # 减少头部动作
]

# 夸张表情（动画角色）
cmd = [
    "python", "inference.py",
    "--driven_audio", "audio.wav",
    "--source_image", "face.jpg",
    "--expression_scale", "1.5",
    "--pose_style", "30"
]
```

## 质量设置

| 设置               | 速度 | 质量 |
| ---------------- | -- | -- |
| 无增强器，256px       | 快速 | 基础 |
| GFPGAN，256px     | 中等 | 良好 |
| GFPGAN，512px     | 慢  | 更好 |
| RealESRGAN，512px | 最慢 | 最佳 |

## 预处理选项

```bash

# 裁剪 - 聚焦脸部（推荐）
--preprocess crop

# 调整大小 - 调整整张图像大小
--preprocess resize

# 完整 - 使用整张图像
--preprocess full
```

## 故障排除

### 未检测到面部

* 使用清晰、正面的面部图像
* 良好的照明
* 避免遮挡（眼镜、头发）

### 音频同步问题

* 使用 16kHz WAV 文件
* 避免背景音乐
* 仅使用清晰的语音

### 动作不连贯

* 稍微增加 expression\_scale
* 尝试不同的 pose\_style
* 使用更长的音频

### 内存不足

* 减少输出尺寸
* 禁用增强器
* 使用裁剪预处理

## 性能

| 分辨率            | GPU      | 时间（10秒视频） |
| -------------- | -------- | --------- |
| 256px          | RTX 3060 | \~30s     |
| 256px          | RTX 4090 | \~15s     |
| 512px + GFPGAN | RTX 4090 | \~45s     |

## 费用估算

CLORE.AI 市场的典型费率（截至 2024 年）：

| GPU       | 小时费率    | 日费率     | 4 小时会话  |
| --------- | ------- | ------- | ------- |
| RTX 3060  | \~$0.03 | \~$0.70 | \~$0.12 |
| RTX 3090  | \~$0.06 | \~$1.50 | \~$0.25 |
| RTX 4090  | \~$0.10 | \~$2.30 | \~$0.40 |
| A100 40GB | \~$0.17 | \~$4.00 | \~$0.70 |
| A100 80GB | \~$0.25 | \~$6.00 | \~$1.00 |

*价格因提供商和需求而异。请查看* [*CLORE.AI 市场*](https://clore.ai/marketplace) *以获取当前费率。*

**节省费用：**

* 使用 **竞价（Spot）** 为弹性工作负载使用市场（通常便宜 30-50%）
* 使用以下方式支付 **CLORE** 代币
* 比较不同提供商的价格

## 下一步

* [Wav2Lip](/guides/guides_v2-zh/hui-shuo-hua-de-tou-xiang/wav2lip.md) - 替代的唇动同步方法
* [Bark TTS](/guides/guides_v2-zh/yin-pin-yu-yu-yin/bark-tts.md) - 生成语音
* [XTTS](/guides/guides_v2-zh/yin-pin-yu-yu-yin/xtts-coqui.md) - 语音克隆 + TTS


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-zh/hui-shuo-hua-de-tou-xiang/sadtalker.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.