> For the complete documentation index, see [llms.txt](https://docs.clore.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.clore.ai/guides/guides_v2-de/audio-and-sprache/fish-speech.md).

# Fish Speech

Fish Speech ist ein hochmodernes mehrsprachiges Text-to-Speech (TTS)-System mit Zero-Shot-Voice-Cloning-Fähigkeiten. Mit über 15.000 GitHub-Sternen unterstützt es Englisch, Chinesisch, Japanisch, Koreanisch, Französisch, Deutsch, Arabisch, Spanisch und mehr — alles aus einem einzigen Modell. Mit nur 10–15 Sekunden Referenzaudio kann Fish Speech jede Stimme mit bemerkenswerter Treue klonen, was es ideal für Hörbuchproduktion, Synchronisation, virtuelle Assistenten und Content-Erstellung im großen Maßstab macht.

Fish Speech verwendet eine transformer-basierte Architektur mit einem VQGAN-Vocoder und erreicht nahezu menschliche Natürlichkeitswerte in standardisierten TTS-Benchmarks. Die WebUI (Gradio) macht es zugänglich, ohne eine einzige Codezeile zu schreiben, während die REST-API nahtlose Integration in Produktions-Pipelines ermöglicht.

{% hint style="success" %}
Alle Beispiele können auf GPU-Servern ausgeführt werden, die über [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

***

## Serveranforderungen

| Parameter      | Minimum                 | Empfohlen               |
| -------------- | ----------------------- | ----------------------- |
| GPU            | NVIDIA RTX 3080 (10 GB) | NVIDIA RTX 4090 (24 GB) |
| VRAM           | 8 GB                    | 16–24 GB                |
| RAM            | 16 GB                   | 32 GB                   |
| CPU            | 4 Kerne                 | 8+ Kerne                |
| Festplatte     | 20 GB                   | 40 GB                   |
| Betriebssystem | Ubuntu 20.04+           | Ubuntu 22.04            |
| CUDA           | 11.8+                   | 12.1+                   |
| Ports          | 22, 7860                | 22, 7860                |

{% hint style="info" %}
Fish Speech läuft effizient auf Mittelklasse-GPUs (RTX 3080/3090). Für Batch-Inferenz oder das Bedienen mehrerer gleichzeitiger Nutzer wird eine RTX 4090 oder A100 empfohlen.
{% endhint %}

***

## Schnelle Bereitstellung auf CLORE.AI

Der schnellste Weg, Fish Speech zum Laufen zu bringen, ist über das offizielle Docker-Image direkt von Docker Hub.

### 1. Finden Sie einen geeigneten Server

Gehe zu [CLORE.AI Marketplace](https://clore.ai/marketplace) und filtern Sie nach:

* **VRAM**: ≥ 8 GB
* **GPU**: RTX 3080, 3090, 4080, 4090, A100, H100
* **Festplatte**: ≥ 20 GB

### 2. Konfigurieren Sie Ihre Bereitstellung

Geben Sie im CLORE.AI-Bestellformular Folgendes an:

**Docker-Image:**

```
fishaudio/fish-speech:latest
```

**Portzuordnungen:**

```
22   → SSH-Zugriff
7860 → Gradio Web UI
```

**Umgebungsvariablen:**

```
NVIDIA_VISIBLE_DEVICES=all
CUDA_VISIBLE_DEVICES=0
```

**Startbefehl (optional — startet die WebUI automatisch):**

```bash
python -m tools.webui --listen 0.0.0.0 --port 7860
```

### 3. Greifen Sie auf die Oberfläche zu

Sobald bereitgestellt, öffnen Sie Ihren Browser und navigieren zu:

```
http://<your-clore-server-ip>:7860
```

Die Gradio WebUI wird mit der vollständigen Fish Speech-Oberfläche geladen und ist einsatzbereit.

***

## Schritt-für-Schritt-Einrichtung

### Schritt 1: SSH auf Ihren Server

```bash
ssh root@<your-clore-server-ip> -p <ssh-port>
```

### Schritt 2: Das Docker-Container-Image herunterladen und ausführen

```bash
docker pull fishaudio/fish-speech:latest

docker run -d \
  --name fish-speech \
  --gpus all \
  -p 7860:7860 \
  -p 22:22 \
  -v /workspace/fish-speech:/workspace \
  -e NVIDIA_VISIBLE_DEVICES=all \
  fishaudio/fish-speech:latest \
  python -m tools.webui --listen 0.0.0.0 --port 7860
```

### Schritt 3: GPU-Zugriff verifizieren

```bash
docker exec fish-speech nvidia-smi
```

Sie sollten Ihre GPU mit verfügbarem VRAM aufgelistet sehen.

### Schritt 4: Modell-Download überprüfen

Fish Speech lädt Modellgewichte beim ersten Start automatisch herunter (\~3–5 GB). Überwachen Sie den Fortschritt:

```bash
docker logs -f fish-speech
```

Warten Sie, bis Sie sehen:

```
Running on local URL:  http://0.0.0.0:7860
```

### Schritt 5: Auf die WebUI zugreifen

Navigieren Sie zu `http://<server-ip>:7860` in Ihrem Browser.

### Schritt 6: (Optional) API-Server aktivieren

```bash
docker exec -d fish-speech \
  python -m tools.api_server --listen 0.0.0.0 --port 8080
```

***

## Beispielanwendungen

### Beispiel 1: Grundlegendes Text-to-Speech über die WebUI

1. Öffnen Sie die WebUI unter `http://<server-ip>:7860`
2. Geben Sie Text in das **"Text"** Feld ein:

   ```
   Willkommen bei Clore.ai, dem GPU-Cloud-Marktplatz für AI-Workloads.
   ```
3. Sprache auswählen: **Englisch**
4. Klicken Sie **"Generieren"**
5. Laden Sie die resultierende `.wav` Datei herunter

***

### Beispiel 2: Zero-Shot-Voice-Cloning

Klonen Sie jede Stimme mit nur 10–15 Sekunden Referenzaudio:

1. Navigieren Sie in der WebUI zu **"Voice Clone"** Tab
2. Laden Sie Ihre Referenz-Audiodatei hoch (`.wav` oder `.mp3`, 10–30 Sekunden)
3. Geben Sie die Transkription des Referenzaudios ein (optional, verbessert aber die Qualität)
4. Geben Sie den Zieltext ein, der synthetisiert werden soll
5. Klicken Sie **"Klonen & Generieren"**

Das Modell analysiert die Stimmcharakteristika und synthetisiert Sprache in dieser Stimme.

***

### Beispiel 3: API-basiertes TTS (Python)

```python
import requests
import base64

# Fish Speech API-Endpunkt
API_URL = "http://<your-clore-server-ip>:8080/v1/tts"

payload = {
    "text": "Hallo, dies ist ein Test von Fish Speech, das auf der Clore.ai GPU-Infrastruktur läuft.",
    "reference_id": None,  # Standardstimme verwenden
    "format": "wav",
    "streaming": False
}

response = requests.post(API_URL, json=payload)

if response.status_code == 200:
    with open("output.wav", "wb") as f:
        f.write(response.content)
    print("Audio gespeichert in output.wav")
else:
    print(f"Fehler: {response.status_code} - {response.text}")
```

***

### Beispiel 4: Mehrsprachiges TTS

```python
import requests

API_URL = "http://<your-clore-server-ip>:8080/v1/tts"

texts = {
    "en": "Clore.ai provides affordable GPU cloud computing for AI researchers.",
    "zh": "Clore.ai 为 AI 研究人员提供经济实惠的 GPU 云计算服务。",
    "ja": "Clore.aiはAI研究者向けの手頃なGPUクラウドコンピューティングを提供します。",
    "ko": "Clore.ai는 AI 연구자들을 위한 저렴한 GPU 클라우드 컴퓨팅을 제공합니다.",
    "fr": "Clore.ai fournit un calcul GPU cloud abordable pour les chercheurs en IA.",
}

for lang, text in texts.items():
    payload = {"text": text, "format": "wav"}
    response = requests.post(API_URL, json=payload)
    if response.status_code == 200:
        filename = f"output_{lang}.wav"
        with open(filename, "wb") as f:
            f.write(response.content)
        print(f"Gespeichert {filename}")
```

***

### Beispiel 5: Stapelverarbeitung von Audiodateien

```python
import requests
import os
from pathlib import Path

API_URL = "http://<your-clore-server-ip>:8080/v1/tts"
OUTPUT_DIR = Path("./tts_outputs")
OUTPUT_DIR.mkdir(exist_ok=True)

# Stapel von Texten zum Konvertieren
texts = [
    "Kapitel eins: Der Beginn einer neuen Ära in der künstlichen Intelligenz.",
    "Kapitel zwei: Wie GPU-Computing das maschinelle Lernen transformierte.",
    "Kapitel drei: Der Aufstieg der Sprachsynthesetechnologien.",
    "Kapitel vier: Die Zukunft mit Clore.ai-Infrastruktur bauen.",
    "Kapitel fünf: Fazit und nächste Schritte.",
]

for i, text in enumerate(texts):
    payload = {
        "text": text,
        "format": "wav",
        "streaming": False
    }
    response = requests.post(API_URL, json=payload, timeout=60)
    if response.status_code == 200:
        output_path = OUTPUT_DIR / f"chapter_{i+1:02d}.wav"
        with open(output_path, "wb") as f:
            f.write(response.content)
        print(f"✓ Generiert: {output_path}")
    else:
        print(f"✗ Kapitel {i+1} fehlgeschlagen: {response.status_code}")

print(f"\nAlle Dateien gespeichert in {OUTPUT_DIR}")
```

***

## Konfiguration

### Docker Compose (Produktions-Setup)

```yaml
version: '3.8'

services:
  fish-speech:
    image: fishaudio/fish-speech:latest
    container_name: fish-speech
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - CUDA_VISIBLE_DEVICES=0
    ports:
      - "7860:7860"
      - "8080:8080"
    volumes:
      - ./models:/workspace/models
      - ./outputs:/workspace/outputs
      - ./references:/workspace/references
    command: >
      bash -c "python -m tools.webui --listen 0.0.0.0 --port 7860 &
               python -m tools.api_server --listen 0.0.0.0 --port 8080 &
               wait"
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
```

### Wichtige Konfigurationsoptionen

| Option             | Standard  | Beschreibung                                    |
| ------------------ | --------- | ----------------------------------------------- |
| `--listen`         | `0.0.0.0` | Schnittstelle, an die der Server gebunden wird  |
| `--port`           | `7860`    | Port für die Gradio WebUI                       |
| `--compile`        | `false`   | Aktiviere torch.compile für schnellere Inferenz |
| `--device`         | `cuda`    | Zu verwendendes Gerät (`cuda`, `cpu`, `mps`)    |
| `--half`           | `true`    | Verwende FP16 Halbpräzision (spart VRAM)        |
| `--num_samples`    | `1`       | Anzahl der zu erzeugenden Audiosamples          |
| `--max_new_tokens` | `1024`    | Maximale neue Token für die Generierung         |

### Modellvarianten

| Modell                | Größe    | Sprachen   | Hinweise                 |
| --------------------- | -------- | ---------- | ------------------------ |
| `fish-speech-1.4`     | \~3 GB   | 8 Sprachen | Neueste stabile Version  |
| `fish-speech-1.2-sft` | \~2.5 GB | 8 Sprachen | Feinabgestimmte Variante |
| `fish-speech-1.2`     | \~2.5 GB | 8 Sprachen | Basismodell              |

***

## Leistungs-Tipps

### 1. Aktivieren Sie torch.compile für schnellere Inferenz

```bash
# Fügen Sie beim Starten die --compile-Option hinzu
python -m tools.webui --listen 0.0.0.0 --port 7860 --compile
```

Der erste Lauf wird langsamer sein (Kompilierung dauert 2–5 Minuten), aber nachfolgende Inferenz wird 20–40% schneller sein.

### 2. Verwenden Sie Halbpräzision (FP16)

FP16 reduziert die VRAM-Nutzung um \~50% bei minimalem Qualitätsverlust:

```bash
python -m tools.webui --listen 0.0.0.0 --port 7860 --half
```

### 3. Referenzstimmen vorladen

Speichern Sie häufig verwendete Referenzstimmen im Referenzverzeichnis des Containers, um Wiederverarbeitung zu vermeiden:

```bash
# Referenzaudio in den Container kopieren
docker cp my_voice.wav fish-speech:/workspace/references/my_voice.wav
```

### 4. GPU-Speicheroptimierung

```bash
# Setzen Sie den optimalen CUDA-Speicheranteil
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

# GPU-Cache zwischen großen Batches leeren
docker exec fish-speech python -c "import torch; torch.cuda.empty_cache()"
```

### 5. Anpassung der Batch-Größe

Für Batch-API-Anfragen optimale Batch-Größen:

* **RTX 3080 (10 GB)**: batch\_size = 1–2
* **RTX 3090/4090 (24 GB)**: batch\_size = 4–8
* **A100 (40/80 GB)**: batch\_size = 16–32

***

## Fehlerbehebung

### Problem: Container startet nicht — CUDA nicht gefunden

```bash
# NVIDIA-Treiber im Container überprüfen
docker exec fish-speech nvidia-smi

# Falls es fehlschlägt, überprüfen Sie den Host-Treiber
nvidia-smi

# Mit expliziten GPU-Flags erneut ausführen
docker run --gpus all --rm fishaudio/fish-speech:latest nvidia-smi
```

### Problem: Out of Memory (OOM)-Fehler

```bash
# VRAM-Nutzung überprüfen
docker exec fish-speech nvidia-smi

# Verwenden Sie FP16, um die VRAM-Nutzung zu halbieren
# Container mit --half-Flag neu starten
docker stop fish-speech
docker run -d --name fish-speech --gpus all -p 7860:7860 \
  fishaudio/fish-speech:latest \
  python -m tools.webui --listen 0.0.0.0 --port 7860 --half
```

### Problem: Port 7860 nicht erreichbar

```bash
# Überprüfen Sie, ob der Container läuft
docker ps | grep fish-speech

# Überprüfen Sie die Port-Bindung
docker port fish-speech

# Firewall (auf dem Clore-Server) überprüfen
# Stellen Sie sicher, dass Port 7860 in Ihrer CLORE.AI-Bestellkonfiguration gemappt ist
```

### Problem: Modell-Download schlägt fehl / langsamer Download

```bash
# Internetverbindung vom Container aus überprüfen
docker exec fish-speech curl -I https://huggingface.co

# Modelle manuell vorab herunterladen
docker exec fish-speech python -c "
from huggingface_hub import snapshot_download
snapshot_download('fishaudio/fish-speech-1.4')
"
```

### Problem: Audioqualität ist schlecht

* Stellen Sie sicher, dass das Referenzaudio sauber ist (kein Hintergrundrauschen, Abtastrate 16 kHz+)
* Halten Sie das Referenzaudio zwischen 10–30 Sekunden
* Geben Sie die Transkription des Referenzaudios für eine bessere Ausrichtung an
* Versuchen Sie, `--num_samples` zu erhöhen, um mehrere Optionen zu erzeugen und die beste auszuwählen

### Problem: WebUI lädt, aber die Generierung hängt

```bash
# GPU-Auslastung während der Generierung prüfen
docker exec fish-speech watch -n1 nvidia-smi

# Prüfen Sie die Logs auf Fehler
docker logs fish-speech --tail 50
```

***

## Links

* **GitHub**: <https://github.com/fishaudio/fish-speech>
* **Docker Hub**: <https://hub.docker.com/r/fishaudio/fish-speech>
* **Offizielle Dokumentation**: <https://speech.fish.audio>
* **Hugging Face Modelle**: <https://huggingface.co/fishaudio/fish-speech-1.4>
* **CLORE.AI Marketplace**: <https://clore.ai/marketplace>
* **Discord-Community**: <https://discord.gg/Es5qTB9BcN>

***

## Clore.ai GPU-Empfehlungen

| Anwendungsfall         | Empfohlene GPU  | Geschätzte Kosten auf Clore.ai |
| ---------------------- | --------------- | ------------------------------ |
| Entwicklung/Tests      | RTX 3090 (24GB) | \~$0.12/gpu/hr                 |
| Produktions-TTS        | RTX 4090 (24GB) | \~$0.70/gpu/hr                 |
| Hochdurchsatz-Inferenz | A100 80GB       | \~$1.20/gpu/hr                 |

> 💡 Alle Beispiele in diesem Leitfaden können bereitgestellt werden auf [Clore.ai](https://clore.ai/marketplace) GPU-Servern. Durchsuchen Sie verfügbare GPUs und mieten Sie stundenweise — keine Verpflichtungen, voller Root-Zugriff.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-de/audio-and-sprache/fish-speech.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.