# Fish Speech

Fish Speech is a state-of-the-art multilingual text-to-speech (TTS) system with zero-shot voice cloning capabilities. With over 15,000 GitHub stars, it supports English, Chinese, Japanese, Korean, French, German, Arabic, Spanish, and more — all from a single model. Using only 10–15 seconds of reference audio, Fish Speech can clone any voice with remarkable fidelity, making it ideal for audiobook production, dubbing, virtual assistants, and content creation at scale.

Fish Speech uses a transformer-based architecture with a VQGAN vocoder, achieving near-human naturalness scores on standard TTS benchmarks. The WebUI (Gradio) makes it accessible without writing a single line of code, while the REST API enables seamless integration into production pipelines.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

***

## Server Requirements

| Parameter | Minimum                 | Recommended             |
| --------- | ----------------------- | ----------------------- |
| GPU       | NVIDIA RTX 3080 (10 GB) | NVIDIA RTX 4090 (24 GB) |
| VRAM      | 8 GB                    | 16–24 GB                |
| RAM       | 16 GB                   | 32 GB                   |
| CPU       | 4 cores                 | 8+ cores                |
| Disk      | 20 GB                   | 40 GB                   |
| OS        | Ubuntu 20.04+           | Ubuntu 22.04            |
| CUDA      | 11.8+                   | 12.1+                   |
| Ports     | 22, 7860                | 22, 7860                |

{% hint style="info" %}
Fish Speech runs efficiently on mid-range GPUs (RTX 3080/3090). For batch inference or serving multiple concurrent users, an RTX 4090 or A100 is recommended.
{% endhint %}

***

## Quick Deploy on CLORE.AI

The fastest way to get Fish Speech running is via the official Docker image directly from Docker Hub.

### 1. Find a suitable server

Go to [CLORE.AI Marketplace](https://clore.ai/marketplace) and filter by:

* **VRAM**: ≥ 8 GB
* **GPU**: RTX 3080, 3090, 4080, 4090, A100, H100
* **Disk**: ≥ 20 GB

### 2. Configure your deployment

In the CLORE.AI order form, set the following:

**Docker Image:**

```
fishaudio/fish-speech:latest
```

**Port Mappings:**

```
22   → SSH access
7860 → Gradio Web UI
```

**Environment Variables:**

```
NVIDIA_VISIBLE_DEVICES=all
CUDA_VISIBLE_DEVICES=0
```

**Startup Command (optional — auto-starts WebUI):**

```bash
python -m tools.webui --listen 0.0.0.0 --port 7860
```

### 3. Access the interface

Once deployed, open your browser and navigate to:

```
http://<your-clore-server-ip>:7860
```

The Gradio WebUI will load with the full Fish Speech interface ready to use.

***

## Step-by-Step Setup

### Step 1: SSH into your server

```bash
ssh root@<your-clore-server-ip> -p <ssh-port>
```

### Step 2: Pull and run the Docker container

```bash
docker pull fishaudio/fish-speech:latest

docker run -d \
  --name fish-speech \
  --gpus all \
  -p 7860:7860 \
  -p 22:22 \
  -v /workspace/fish-speech:/workspace \
  -e NVIDIA_VISIBLE_DEVICES=all \
  fishaudio/fish-speech:latest \
  python -m tools.webui --listen 0.0.0.0 --port 7860
```

### Step 3: Verify GPU access

```bash
docker exec fish-speech nvidia-smi
```

You should see your GPU listed with available VRAM.

### Step 4: Check model download

Fish Speech automatically downloads model weights on first run (\~3–5 GB). Monitor progress:

```bash
docker logs -f fish-speech
```

Wait until you see:

```
Running on local URL:  http://0.0.0.0:7860
```

### Step 5: Access the WebUI

Navigate to `http://<server-ip>:7860` in your browser.

### Step 6: (Optional) Enable API server

```bash
docker exec -d fish-speech \
  python -m tools.api_server --listen 0.0.0.0 --port 8080
```

***

## Usage Examples

### Example 1: Basic Text-to-Speech via WebUI

1. Open the WebUI at `http://<server-ip>:7860`
2. Enter text in the **"Text"** field:

   ```
   Welcome to Clore.ai, the GPU cloud marketplace for AI workloads.
   ```
3. Select language: **English**
4. Click **"Generate"**
5. Download the resulting `.wav` file

***

### Example 2: Zero-Shot Voice Cloning

Clone any voice using just 10–15 seconds of reference audio:

1. In the WebUI, navigate to the **"Voice Clone"** tab
2. Upload your reference audio file (`.wav` or `.mp3`, 10–30 seconds)
3. Enter the transcript of the reference audio (optional but improves quality)
4. Enter the target text to synthesize
5. Click **"Clone & Generate"**

The model will analyze the voice characteristics and synthesize speech in that voice.

***

### Example 3: API-Based TTS (Python)

```python
import requests
import base64

# Fish Speech API endpoint
API_URL = "http://<your-clore-server-ip>:8080/v1/tts"

payload = {
    "text": "Hello, this is a test of Fish Speech running on Clore.ai GPU infrastructure.",
    "reference_id": None,  # Use default voice
    "format": "wav",
    "streaming": False
}

response = requests.post(API_URL, json=payload)

if response.status_code == 200:
    with open("output.wav", "wb") as f:
        f.write(response.content)
    print("Audio saved to output.wav")
else:
    print(f"Error: {response.status_code} - {response.text}")
```

***

### Example 4: Multilingual TTS

```python
import requests

API_URL = "http://<your-clore-server-ip>:8080/v1/tts"

texts = {
    "en": "Clore.ai provides affordable GPU cloud computing for AI researchers.",
    "zh": "Clore.ai 为 AI 研究人员提供经济实惠的 GPU 云计算服务。",
    "ja": "Clore.aiはAI研究者向けの手頃なGPUクラウドコンピューティングを提供します。",
    "ko": "Clore.ai는 AI 연구자들을 위한 저렴한 GPU 클라우드 컴퓨팅을 제공합니다.",
    "fr": "Clore.ai fournit un calcul GPU cloud abordable pour les chercheurs en IA.",
}

for lang, text in texts.items():
    payload = {"text": text, "format": "wav"}
    response = requests.post(API_URL, json=payload)
    if response.status_code == 200:
        filename = f"output_{lang}.wav"
        with open(filename, "wb") as f:
            f.write(response.content)
        print(f"Saved {filename}")
```

***

### Example 5: Batch Processing Audio Files

```python
import requests
import os
from pathlib import Path

API_URL = "http://<your-clore-server-ip>:8080/v1/tts"
OUTPUT_DIR = Path("./tts_outputs")
OUTPUT_DIR.mkdir(exist_ok=True)

# Batch of texts to convert
texts = [
    "Chapter one: The beginning of a new era in artificial intelligence.",
    "Chapter two: How GPU computing transformed machine learning.",
    "Chapter three: The rise of voice synthesis technologies.",
    "Chapter four: Building the future with Clore.ai infrastructure.",
    "Chapter five: Conclusion and next steps.",
]

for i, text in enumerate(texts):
    payload = {
        "text": text,
        "format": "wav",
        "streaming": False
    }
    response = requests.post(API_URL, json=payload, timeout=60)
    if response.status_code == 200:
        output_path = OUTPUT_DIR / f"chapter_{i+1:02d}.wav"
        with open(output_path, "wb") as f:
            f.write(response.content)
        print(f"✓ Generated: {output_path}")
    else:
        print(f"✗ Failed chapter {i+1}: {response.status_code}")

print(f"\nAll files saved to {OUTPUT_DIR}")
```

***

## Configuration

### Docker Compose (Production Setup)

```yaml
version: '3.8'

services:
  fish-speech:
    image: fishaudio/fish-speech:latest
    container_name: fish-speech
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - CUDA_VISIBLE_DEVICES=0
    ports:
      - "7860:7860"
      - "8080:8080"
    volumes:
      - ./models:/workspace/models
      - ./outputs:/workspace/outputs
      - ./references:/workspace/references
    command: >
      bash -c "python -m tools.webui --listen 0.0.0.0 --port 7860 &
               python -m tools.api_server --listen 0.0.0.0 --port 8080 &
               wait"
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
```

### Key Configuration Options

| Option             | Default   | Description                               |
| ------------------ | --------- | ----------------------------------------- |
| `--listen`         | `0.0.0.0` | Interface to bind the server              |
| `--port`           | `7860`    | Port for the Gradio WebUI                 |
| `--compile`        | `false`   | Enable torch.compile for faster inference |
| `--device`         | `cuda`    | Device to use (`cuda`, `cpu`, `mps`)      |
| `--half`           | `true`    | Use FP16 half-precision (saves VRAM)      |
| `--num_samples`    | `1`       | Number of audio samples to generate       |
| `--max_new_tokens` | `1024`    | Maximum new tokens for generation         |

### Model Variants

| Model                 | Size     | Languages   | Notes                 |
| --------------------- | -------- | ----------- | --------------------- |
| `fish-speech-1.4`     | \~3 GB   | 8 languages | Latest stable release |
| `fish-speech-1.2-sft` | \~2.5 GB | 8 languages | Fine-tuned variant    |
| `fish-speech-1.2`     | \~2.5 GB | 8 languages | Base model            |

***

## Performance Tips

### 1. Enable torch.compile for Faster Inference

```bash
# Add --compile flag when starting
python -m tools.webui --listen 0.0.0.0 --port 7860 --compile
```

First run will be slower (compilation takes 2–5 minutes), but subsequent inference will be 20–40% faster.

### 2. Use Half-Precision (FP16)

FP16 reduces VRAM usage by \~50% with minimal quality loss:

```bash
python -m tools.webui --listen 0.0.0.0 --port 7860 --half
```

### 3. Pre-load Reference Voices

Store frequently used reference voices in the container's reference directory to avoid re-processing:

```bash
# Copy reference audio into the container
docker cp my_voice.wav fish-speech:/workspace/references/my_voice.wav
```

### 4. GPU Memory Optimization

```bash
# Set optimal CUDA memory fraction
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

# Clear GPU cache between large batches
docker exec fish-speech python -c "import torch; torch.cuda.empty_cache()"
```

### 5. Batch Size Tuning

For batch API requests, optimal batch sizes:

* **RTX 3080 (10 GB)**: batch\_size = 1–2
* **RTX 3090/4090 (24 GB)**: batch\_size = 4–8
* **A100 (40/80 GB)**: batch\_size = 16–32

***

## Troubleshooting

### Issue: Container won't start — CUDA not found

```bash
# Verify NVIDIA driver inside container
docker exec fish-speech nvidia-smi

# If it fails, check host driver
nvidia-smi

# Re-run with explicit GPU flags
docker run --gpus all --rm fishaudio/fish-speech:latest nvidia-smi
```

### Issue: Out of Memory (OOM) Error

```bash
# Check VRAM usage
docker exec fish-speech nvidia-smi

# Use FP16 to halve VRAM usage
# Restart container with --half flag
docker stop fish-speech
docker run -d --name fish-speech --gpus all -p 7860:7860 \
  fishaudio/fish-speech:latest \
  python -m tools.webui --listen 0.0.0.0 --port 7860 --half
```

### Issue: Port 7860 not accessible

```bash
# Check container is running
docker ps | grep fish-speech

# Check port binding
docker port fish-speech

# Verify firewall (on the Clore server)
# Ensure port 7860 is mapped in your CLORE.AI order configuration
```

### Issue: Model download fails / slow download

```bash
# Check internet connectivity from container
docker exec fish-speech curl -I https://huggingface.co

# Manually pre-download models
docker exec fish-speech python -c "
from huggingface_hub import snapshot_download
snapshot_download('fishaudio/fish-speech-1.4')
"
```

### Issue: Audio quality is poor

* Ensure reference audio is clean (no background noise, 16kHz+ sample rate)
* Keep reference audio between 10–30 seconds
* Provide the transcript of reference audio for better alignment
* Try increasing `--num_samples` to generate multiple options and pick the best

### Issue: WebUI loads but generation hangs

```bash
# Check GPU utilization during generation
docker exec fish-speech watch -n1 nvidia-smi

# Check logs for errors
docker logs fish-speech --tail 50
```

***

## Links

* **GitHub**: <https://github.com/fishaudio/fish-speech>
* **Docker Hub**: <https://hub.docker.com/r/fishaudio/fish-speech>
* **Official Docs**: <https://speech.fish.audio>
* **Hugging Face Models**: <https://huggingface.co/fishaudio/fish-speech-1.4>
* **CLORE.AI Marketplace**: <https://clore.ai/marketplace>
* **Discord Community**: <https://discord.gg/Es5qTB9BcN>

***

## Clore.ai GPU Recommendations

| Use Case                  | Recommended GPU | Est. Cost on Clore.ai |
| ------------------------- | --------------- | --------------------- |
| Development/Testing       | RTX 3090 (24GB) | \~$0.12/gpu/hr        |
| Production TTS            | RTX 4090 (24GB) | \~$0.70/gpu/hr        |
| High-throughput Inference | A100 80GB       | \~$1.20/gpu/hr        |

> 💡 All examples in this guide can be deployed on [Clore.ai](https://clore.ai/marketplace) GPU servers. Browse available GPUs and rent by the hour — no commitments, full root access.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/audio-and-voice/fish-speech.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.