> For the complete documentation index, see [llms.txt](https://docs.clore.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.clore.ai/guides/guides_v2-de/audio-and-sprache/xtts-coqui.md).

# XTTS (Coqui)

Erzeuge natürliche Sprache mit Stimmklonen mithilfe von Coqui XTTS.

{% hint style="success" %}
Alle Beispiele können auf GPU-Servern ausgeführt werden, die über [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Mieten auf CLORE.AI

1. Besuchen Sie [CLORE.AI Marketplace](https://clore.ai/marketplace)
2. Nach GPU-Typ, VRAM und Preis filtern
3. Wählen **On-Demand** (Festpreis) oder **Spot** (Gebotspreis)
4. Konfigurieren Sie Ihre Bestellung:
   * Docker-Image auswählen
   * Ports festlegen (TCP für SSH, HTTP für Web-UIs)
   * Umgebungsvariablen bei Bedarf hinzufügen
   * Startbefehl eingeben
5. Zahlung auswählen: **CLORE**, **BTC**, oder **USDT/USDC**
6. Bestellung erstellen und auf Bereitstellung warten

### Zugriff auf Ihren Server

* Verbindungsdetails finden Sie in **Meine Bestellungen**
* Webschnittstellen: Verwenden Sie die HTTP-Port-URL
* SSH: `ssh -p <port> root@<proxy-address>`

## Was ist XTTS?

XTTS (von Coqui) bietet:

* Hochwertige Text-zu-Sprache
* Stimmklonen ab 6 Sekunden Audio
* 17 unterstützte Sprachen
* Emotionale Steuerung
* Streaming-Unterstützung

## Anforderungen

| Modus             | VRAM | Empfohlen |
| ----------------- | ---- | --------- |
| Inference         | 4GB  | RTX 3060  |
| Schnelle Inferenz | 6GB  | RTX 3080  |
| Streaming         | 4GB  | RTX 3060  |

## Schnelle Bereitstellung

**Docker-Image:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime
```

**Ports:**

```
22/tcp
8000/http
```

**Befehl:**

```bash
pip install TTS && \
tts-server --model_name tts_models/multilingual/multi-dataset/xtts_v2
```

## Zugriff auf Ihren Dienst

Nach der Bereitstellung finden Sie Ihre `http_pub` URL in **Meine Bestellungen**:

1. Gehen Sie zur **Meine Bestellungen** Seite
2. Klicken Sie auf Ihre Bestellung
3. Finden Sie die `http_pub` URL (z. B., `abc123.clorecloud.net`)

Verwenden Sie `https://IHRE_HTTP_PUB_URL` anstelle von `localhost` in den Beispielen unten.

## Installation

```bash
pip install TTS
```

## Grundlegende Verwendung

### Einfaches TTS

```python
from TTS.api import TTS

# Lade XTTS v2
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

# Sprache generieren
tts.tts_to_file(
    text="Hallo, dies ist ein Test des XTTS-Text-zu-Sprache-Systems.",
    file_path="output.wav",
    language="en"
)
```

### Stimmenklon

```python
from TTS.api import TTS

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

# Stimme aus Referenzaudio klonen (6+ Sekunden)
tts.tts_to_file(
    text="This is my cloned voice speaking new text.",
    file_path="cloned_output.wav",
    speaker_wav="reference_voice.wav",
    language="en"
)
```

## Mehrere Sprachen

```python
from TTS.api import TTS

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

# Englisch
tts.tts_to_file(
    text="Hello, how are you today?",
    file_path="english.wav",
    speaker_wav="voice.wav",
    language="en"
)

# Spanisch
tts.tts_to_file(
    text="Hola, ¿cómo estás hoy?",
    file_path="spanish.wav",
    speaker_wav="voice.wav",
    language="es"
)

# Deutsch
tts.tts_to_file(
    text="Hallo, wie geht es dir heute?",
    file_path="german.wav",
    speaker_wav="voice.wav",
    language="de"
)

# Russisch
tts.tts_to_file(
    text="Привет, как дела?",
    file_path="russian.wav",
    speaker_wav="voice.wav",
    language="ru"
)
```

### Unterstützte Sprachen

| Code  | Sprache        |
| ----- | -------------- |
| en    | Englisch       |
| es    | Spanisch       |
| fr    | Französisch    |
| de    | Deutsch        |
| it    | Italienisch    |
| pt    | Portugiesisch  |
| pl    | Polnisch       |
| tr    | Türkisch       |
| ru    | Russisch       |
| nl    | Niederländisch |
| cs    | Tschechisch    |
| ar    | Arabisch       |
| zh-cn | Chinesisch     |
| ja    | Japanisch      |
| hu    | Ungarisch      |
| ko    | Koreanisch     |
| hi    | Hindi          |

## Streaming-TTS

```python
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
import torch
import sounddevice as sd

# Modell laden
config = XttsConfig()
config.load_json("path/to/config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="path/to/model")
model.cuda()

# Erhalte Sprecher-Embedding
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path="reference.wav"
)

# Streaming-Generierung
chunks = model.inference_stream(
    text="Dies ist ein Streaming-Test des XTTS-Systems.",
    language="en",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
    stream_chunk_size=20
)

# In Echtzeit abspielen
for chunk in chunks:
    audio = chunk.cpu().numpy()
    sd.play(audio, samplerate=24000)
    sd.wait()
```

## Gradio-Oberfläche

```python
import gradio as gr
from TTS.api import TTS
import tempfile

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

def generate_speech(text, reference_audio, language):
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
        if reference_audio:
            tts.tts_to_file(
                text=text,
                file_path=f.name,
                speaker_wav=reference_audio,
                language=language
            )
        else:
            tts.tts_to_file(
                text=text,
                file_path=f.name,
                language=language
            )
        return f.name

demo = gr.Interface(
    fn=generate_speech,
    inputs=[
        gr.Textbox(label="Text zum Sprechen", lines=5),
        gr.Audio(type="filepath", label="Referenzstimme (optional)"),
        gr.Dropdown(
            ["en", "es", "fr", "de", "it", "pt", "ru", "zh-cn", "ja"],
            value="en",
            label="Sprache"
        )
    ],
    outputs=gr.Audio(label="Generated Speech"),
    title="XTTS Stimmklonen"
)

demo.launch(server_name="0.0.0.0", server_port=7860)
```

## API-Server

```python
from fastapi import FastAPI, UploadFile, File, Form
from fastapi.responses import FileResponse
from TTS.api import TTS
import tempfile
import os

app = FastAPI()
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

@app.post("/synthesize")
async def synthesize(
    text: str = Form(...),
    language: str = Form(default="en"),
    speaker: UploadFile = File(default=None)
):
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as out_file:
        if speaker:
            # Hochgeladene Referenz speichern
            with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as ref_file:
                ref_file.write(await speaker.read())
                ref_path = ref_file.name

            tts.tts_to_file(
                text=text,
                file_path=out_file.name,
                speaker_wav=ref_path,
                language=language
            )
            os.unlink(ref_path)
        else:
            tts.tts_to_file(
                text=text,
                file_path=out_file.name,
                language=language
            )

        return FileResponse(out_file.name, media_type="audio/wav")

# Ausführen: uvicorn server:app --host 0.0.0.0 --port 8000
```

## Batch-Verarbeitung

```python
from TTS.api import TTS
import os

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

texts = [
    "Willkommen auf unserer Plattform.",
    "Bitte überprüfen Sie die folgenden Informationen.",
    "Vielen Dank für Ihre Aufmerksamkeit.",
    "Einen schönen Tag noch!"
]

reference_voice = "speaker.wav"
output_dir = "./audio_files"
os.makedirs(output_dir, exist_ok=True)

for i, text in enumerate(texts):
    output_path = f"{output_dir}/audio_{i:03d}.wav"

    tts.tts_to_file(
        text=text,
        file_path=output_path,
        speaker_wav=reference_voice,
        language="en"
    )

    print(f"Generiert: {output_path}")
```

## Feinabstimmung der Stimme

Für besseres Stimmklonen:

```python
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

config = XttsConfig()
model = Xtts.init_from_config(config)

# Verwende mehrere Referenzbeispiele für bessere Qualität
reference_files = [
    "sample1.wav",
    "sample2.wav",
    "sample3.wav"
]

# Extrahiere Sprecher-Embedding aus mehreren Proben
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path=reference_files
)

# Generiere mit gemitteltem Embedding
output = model.inference(
    text="Hochwertig geklonte Sprache.",
    language="en",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding
)
```

## Audio-Vorverarbeitung

```python
import librosa
import soundfile as sf

def prepare_reference(input_path, output_path, target_sr=22050):
    # Laden und resamplen
    audio, sr = librosa.load(input_path, sr=target_sr)

    # Stille trimmen
    audio, _ = librosa.effects.trim(audio, top_db=20)

    # Normalisieren
    audio = librosa.util.normalize(audio)

    # Speichern
    sf.write(output_path, audio, target_sr)

# Referenzaudio vorbereiten
prepare_reference("raw_voice.wav", "clean_voice.wav")
```

## Leistung

| Modus     | GPU      | Geschwindigkeit |
| --------- | -------- | --------------- |
| Standard  | RTX 3060 | \~0.5x Echtzeit |
| Standard  | RTX 4090 | \~2x Echtzeit   |
| Streaming | RTX 3060 | \~1x Echtzeit   |
| Streaming | RTX 4090 | \~3x Echtzeit   |

## Qualitätstipps

* Verwende 6–15 Sekunden sauberes Referenzaudio
* Vermeide Hintergrundgeräusche in der Referenz
* Stimme und Text sollten dieselbe Sprache haben
* Verwende mehrere Referenzproben für bessere Ergebnisse

## Fehlerbehebung

### Schlechte Sprachqualität

* Sauberes Referenzaudio
* Längere Referenz (10+ Sekunden)
* Sprecherstil anpassen

### Falsche Sprachpronomination

* Stellen Sie den korrekten Sprachcode sicher
* Verwende Referenz eines Muttersprachlers

### Langsame Generierung

* GPU-Inferenz aktivieren
* Streaming-Modus verwenden
* Textlänge pro Aufruf reduzieren

## Kostenabschätzung

Typische CLORE.AI-Marktplatztarife (Stand 2024):

| GPU       | Stundensatz | Tagessatz | 4-Stunden-Sitzung |
| --------- | ----------- | --------- | ----------------- |
| RTX 3060  | \~$0.03     | \~$0.70   | \~$0.12           |
| RTX 3090  | \~$0.06     | \~$1.50   | \~$0.25           |
| RTX 4090  | \~$0.10     | \~$2.30   | \~$0.40           |
| A100 40GB | \~$0.17     | \~$4.00   | \~$0.70           |
| A100 80GB | \~$0.25     | \~$6.00   | \~$1.00           |

*Preise variieren je nach Anbieter und Nachfrage. Prüfen Sie* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *auf aktuelle Preise.*

**Geld sparen:**

* Verwenden Sie **Spot** Markt für flexible Workloads (oft 30–50% günstiger)
* Bezahlen mit **CLORE** Token
* Preise bei verschiedenen Anbietern vergleichen

## Nächste Schritte

* [Bark TTS](/guides/guides_v2-de/audio-and-sprache/bark-tts.md) - Expressives TTS
* [SadTalker](/guides/guides_v2-de/talking-heads/sadtalker.md) - Talking Heads
* [RVC-Stimmenklon](/guides/guides_v2-de/audio-and-sprache/rvc-voice-clone.md) - Sprachkonversion


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-de/audio-and-sprache/xtts-coqui.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.