> For the complete documentation index, see [llms.txt](https://docs.clore.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.clore.ai/guides/guides_v2-de/talking-heads/sadtalker.md).

# SadTalker

Animieren Sie Gesichter mit Audio, um realistische sprechende Kopfvideos zu erstellen.

{% hint style="success" %}
Alle Beispiele können auf GPU-Servern ausgeführt werden, die über [CLORE.AI Marktplatz](https://clore.ai/marketplace).
{% endhint %}

## Mieten auf CLORE.AI

1. Besuchen [CLORE.AI Marktplatz](https://clore.ai/marketplace)
2. Nach GPU-Typ, VRAM und Preis filtern
3. Wählen **On-Demand** (fester Tarif) oder **Spot** (Gebotspreis)
4. Konfigurieren Sie Ihre Bestellung:
   * Docker-Image auswählen
   * Ports festlegen (TCP für SSH, HTTP für Web-UIs)
   * Bei Bedarf Umgebungsvariablen hinzufügen
   * Startbefehl eingeben
5. Zahlung auswählen: **CLORE**, **BTC**, oder **USDT/USDC**
6. Bestellung erstellen und auf Bereitstellung warten

### Zugriff auf Ihren Server

* Verbindungsdetails finden in **Meine Bestellungen**
* Weboberflächen: Verwenden Sie die HTTP-Port-URL
* SSH: `ssh -p <port> root@<proxy-address>`

## Was ist SadTalker?

SadTalker erzeugt sprechende Videos:

* Lippensynchronisation von beliebigem Audio
* Natürliche Kopfbewegungen
* Funktioniert mit einem einzelnen Bild
* Ausdruckskontrolle

## Anforderungen

| Modus          | VRAM | Empfohlen |
| -------------- | ---- | --------- |
| Basic          | 4GB  | RTX 3060  |
| Hohe Qualität  | 6GB  | RTX 3080  |
| Ganzes Gesicht | 8GB  | RTX 4080  |

## Schnelle Bereitstellung

**Docker-Image:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
```

**Ports:**

```
22/tcp
7860/http
```

**Befehl:**

```bash
cd /workspace && \
git clone https://github.com/OpenTalker/SadTalker.git && \
cd SadTalker && \
pip install -r requirements.txt && \
bash scripts/download_models.sh && \
python app.py
```

## Zugriff auf Ihren Dienst

Nach der Bereitstellung finden Sie Ihre `http_pub` URL in **Meine Bestellungen**:

1. Gehen Sie zu **Meine Bestellungen** Seite
2. Klicken Sie auf Ihre Bestellung
3. Finden Sie die `http_pub` URL (z. B. `abc123.clorecloud.net`)

Verwenden Sie `https://IHRE_HTTP_PUB_URL` anstatt `localhost` in den Beispielen unten.

## Installation

```bash
git clone https://github.com/OpenTalker/SadTalker.git
cd SadTalker

pip install torch torchvision torchaudio
pip install -r requirements.txt

# Vorgefertigte Modelle herunterladen
bash scripts/download_models.sh
```

## Grundlegende Verwendung

### Befehlszeile

```bash
python inference.py \
    --driven_audio audio.wav \
    --source_image face.jpg \
    --result_dir ./results \
    --enhancer gfpgan
```

### Python-API

```python
from src.facerender.animate import AnimateFromCoeff
from src.generate_batch import get_data
from src.generate_facerender_batch import get_facerender_data
import torch

class SadTalker:
    def __init__(self):
        self.device = "cuda"
        # Modelle initialisieren...

    def generate(self, source_image, driven_audio, **kwargs):
        # Audio und Bild verarbeiten
        # Animation erzeugen
        # Videopfad zurückgeben
        pass

# Verwendung
sadtalker = SadTalker()
video_path = sadtalker.generate(
    source_image="face.jpg",
    driven_audio="speech.wav"
)
```

## Mit Gesichtsverbesserung

```bash

# GFPGAN zur Gesichtsverbesserung verwenden
python inference.py \
    --driven_audio audio.wav \
    --source_image face.jpg \
    --enhancer gfpgan \
    --result_dir ./results

# Real-ESRGAN für das gesamte Bild verwenden
python inference.py \
    --driven_audio audio.wav \
    --source_image face.jpg \
    --enhancer realesrgan \
    --result_dir ./results
```

## Parameter

```bash
python inference.py \
    --driven_audio audio.wav \
    --source_image face.jpg \
    --pose_style 0 \           # 0-46 Kopfbewegungsstile
    --expression_scale 1.0 \   # Ausdrucksintensität
    --still \                  # Minimale Kopfbewegung
    --preprocess crop \        # crop, resize, full
    --size 256 \               # Ausgabengröße
    --enhancer gfpgan
```

### Pose-Stile

| Bereich | Effekt                     |
| ------- | -------------------------- |
| 0-5     | Subtile Bewegungen         |
| 6-20    | Normale Bewegungen         |
| 21-46   | Ausdrucksstarke Bewegungen |

## Stapelverarbeitung

```python
import os
import subprocess

def generate_talking_video(image_path, audio_path, output_dir):
    cmd = [
        "python", "inference.py",
        "--driven_audio", audio_path,
        "--source_image", image_path,
        "--result_dir", output_dir,
        "--enhancer", "gfpgan"
    ]
    subprocess.run(cmd, check=True)

# Mehrere Bilder mit demselben Audio verarbeiten
images = ["person1.jpg", "person2.jpg", "person3.jpg"]
audio = "speech.wav"

for i, img in enumerate(images):
    output = f"./results/video_{i}"
    generate_talking_video(img, audio, output)
```

## Gradio-Oberfläche

```python
import gradio as gr
import subprocess
import tempfile
import os

def generate_video(image, audio, pose_style, expression_scale, enhancer):
    with tempfile.TemporaryDirectory() as tmpdir:
        # Eingaben speichern
        image_path = os.path.join(tmpdir, "input.jpg")
        audio_path = os.path.join(tmpdir, "audio.wav")
        image.save(image_path)

        # Audio speichern
        import soundfile as sf
        sf.write(audio_path, audio[1], audio[0])

        # Generieren
        cmd = [
            "python", "inference.py",
            "--driven_audio", audio_path,
            "--source_image", image_path,
            "--result_dir", tmpdir,
            "--pose_style", str(pose_style),
            "--expression_scale", str(expression_scale),
            "--enhancer", enhancer
        ]
        subprocess.run(cmd, check=True)

        # Ausgabevideo finden
        for f in os.listdir(tmpdir):
            if f.endswith(".mp4"):
                return os.path.join(tmpdir, f)

    return None

demo = gr.Interface(
    fn=generate_video,
    inputs=[
        gr.Image(type="pil", label="Source Face"),
        gr.Audio(label="Driving Audio"),
        gr.Slider(0, 46, value=0, step=1, label="Pose Style"),
        gr.Slider(0.5, 1.5, value=1.0, step=0.1, label="Expression Scale"),
        gr.Dropdown(["gfpgan", "realesrgan", "none"], value="gfpgan", label="Enhancer")
    ],
    outputs=gr.Video(label="Generated Video"),
    title="SadTalker - Talking Head Generation"
)

demo.launch(server_name="0.0.0.0", server_port=7860)
```

## API-Server

```python
from fastapi import FastAPI, UploadFile, File
from fastapi.responses import FileResponse
import tempfile
import subprocess
import os

app = FastAPI()

@app.post("/generate")
async def generate(
    image: UploadFile = File(...),
    audio: UploadFile = File(...),
    pose_style: int = 0,
    expression_scale: float = 1.0
):
    with tempfile.TemporaryDirectory() as tmpdir:
        # Uploads speichern
        image_path = os.path.join(tmpdir, "input.jpg")
        audio_path = os.path.join(tmpdir, "audio.wav")

        with open(image_path, "wb") as f:
            f.write(await image.read())
        with open(audio_path, "wb") as f:
            f.write(await audio.read())

        # Generieren
        cmd = [
            "python", "inference.py",
            "--driven_audio", audio_path,
            "--source_image", image_path,
            "--result_dir", tmpdir,
            "--pose_style", str(pose_style),
            "--expression_scale", str(expression_scale),
            "--enhancer", "gfpgan"
        ]
        subprocess.run(cmd, check=True)

        # Video zurückgeben
        for f in os.listdir(tmpdir):
            if f.endswith(".mp4"):
                return FileResponse(os.path.join(tmpdir, f), media_type="video/mp4")

# Ausführen: uvicorn server:app --host 0.0.0.0 --port 8000
```

## Text-to-Speech + SadTalker

Komplette Pipeline:

```python
import subprocess
from TTS.api import TTS

def text_to_talking_video(text, image_path, output_path):
    # Sprache mit TTS erzeugen
    tts = TTS("tts_models/en/ljspeech/tacotron2-DDC")
    audio_path = "temp_audio.wav"
    tts.tts_to_file(text=text, file_path=audio_path)

    # Sprechendes Video erzeugen
    cmd = [
        "python", "inference.py",
        "--driven_audio", audio_path,
        "--source_image", image_path,
        "--result_dir", output_path,
        "--enhancer", "gfpgan"
    ]
    subprocess.run(cmd, check=True)

# Verwendung
text_to_talking_video(
    "Hello, welcome to our presentation. Today we'll discuss AI.",
    "presenter.jpg",
    "./output"
)
```

## Ausdruckskontrolle

```python

# Minimaler Ausdruck (Nachrichtensprecher-Stil)
cmd = [
    "python", "inference.py",
    "--driven_audio", "audio.wav",
    "--source_image", "face.jpg",
    "--expression_scale", "0.5",
    "--still"  # Reduziert Kopfbewegung
]

# Ausdrucksstark (animierter Charakter)
cmd = [
    "python", "inference.py",
    "--driven_audio", "audio.wav",
    "--source_image", "face.jpg",
    "--expression_scale", "1.5",
    "--pose_style", "30"
]
```

## Qualitätseinstellungen

| Einstellung          | Geschwindigkeit | Qualität  |
| -------------------- | --------------- | --------- |
| Kein Enhancer, 256px | Schnell         | Basic     |
| GFPGAN, 256px        | Mittel          | Gut       |
| GFPGAN, 512px        | Langsam         | Besser    |
| RealESRGAN, 512px    | Langsamste      | Am besten |

## Vorverarbeitungsoptionen

```bash

# Zuschneiden - Fokus aufs Gesicht (empfohlen)
--preprocess crop

# Ändern der Größe - gesamtes Bild skalieren
--preprocess resize

# Vollständig - gesamtes Bild verwenden
--preprocess full
```

## Fehlerbehebung

### Gesicht nicht erkannt

* Verwenden Sie ein klares, frontales Gesichtsbild
* Gute Beleuchtung
* Vermeiden Sie Verdeckungen (Brille, Haare)

### Audio-Synchronisationsprobleme

* Verwenden Sie 16-kHz-WAV-Dateien
* Hintergrundmusik vermeiden
* Nur klare Sprache

### Ruckartige Bewegung

* Erhöhen Sie den expression\_scale leicht
* Probieren Sie einen anderen pose\_style
* Verwenden Sie längeres Audio

### Kein Speicher mehr

* Reduzieren Sie die Ausgabengröße
* Deaktivieren Sie den Enhancer
* Verwenden Sie crop-Vorverarbeitung

## Leistung

| Auflösung      | GPU      | Zeit (10s Video) |
| -------------- | -------- | ---------------- |
| 256px          | RTX 3060 | \~30s            |
| 256px          | RTX 4090 | \~15s            |
| 512px + GFPGAN | RTX 4090 | \~45s            |

## Kostenabschätzung

Typische CLORE.AI-Marktplatztarife (Stand 2024):

| GPU       | Stundensatz | Tagessatz | 4-Stunden-Session |
| --------- | ----------- | --------- | ----------------- |
| RTX 3060  | \~$0.03     | \~$0.70   | \~$0.12           |
| RTX 3090  | \~$0.06     | \~$1.50   | \~$0.25           |
| RTX 4090  | \~$0.10     | \~$2.30   | \~$0.40           |
| A100 40GB | \~$0.17     | \~$4.00   | \~$0.70           |
| A100 80GB | \~$0.25     | \~$6.00   | \~$1.00           |

*Preise variieren je nach Anbieter und Nachfrage. Prüfen Sie* [*CLORE.AI Marktplatz*](https://clore.ai/marketplace) *für aktuelle Tarife.*

**Geld sparen:**

* Verwenden Sie **Spot** Markt für flexible Arbeitslasten (oft 30–50% günstiger)
* Bezahlen mit **CLORE** Token
* Preise bei verschiedenen Anbietern vergleichen

## Nächste Schritte

* [Wav2Lip](/guides/guides_v2-de/talking-heads/wav2lip.md) - Alternative Lippensynchronisation
* [Bark TTS](/guides/guides_v2-de/audio-and-sprache/bark-tts.md) - Sprache erzeugen
* [XTTS](/guides/guides_v2-de/audio-and-sprache/xtts-coqui.md) - Stimmklonen + TTS


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-de/talking-heads/sadtalker.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.