Fish Speech

Run Fish Speech multilingual TTS and zero-shot voice cloning on Clore.ai GPUs

Fish Speech is a state-of-the-art multilingual text-to-speech (TTS) system with zero-shot voice cloning capabilities. With over 15,000 GitHub stars, it supports English, Chinese, Japanese, Korean, French, German, Arabic, Spanish, and more — all from a single model. Using only 10–15 seconds of reference audio, Fish Speech can clone any voice with remarkable fidelity, making it ideal for audiobook production, dubbing, virtual assistants, and content creation at scale.

Fish Speech uses a transformer-based architecture with a VQGAN vocoder, achieving near-human naturalness scores on standard TTS benchmarks. The WebUI (Gradio) makes it accessible without writing a single line of code, while the REST API enables seamless integration into production pipelines.

circle-check

Server Requirements

Parameter
Minimum
Recommended

GPU

NVIDIA RTX 3080 (10 GB)

NVIDIA RTX 4090 (24 GB)

VRAM

8 GB

16–24 GB

RAM

16 GB

32 GB

CPU

4 cores

8+ cores

Disk

20 GB

40 GB

OS

Ubuntu 20.04+

Ubuntu 22.04

CUDA

11.8+

12.1+

Ports

22, 7860

22, 7860

circle-info

Fish Speech runs efficiently on mid-range GPUs (RTX 3080/3090). For batch inference or serving multiple concurrent users, an RTX 4090 or A100 is recommended.


Quick Deploy on CLORE.AI

The fastest way to get Fish Speech running is via the official Docker image directly from Docker Hub.

1. Find a suitable server

Go to CLORE.AI Marketplacearrow-up-right and filter by:

  • VRAM: ≥ 8 GB

  • GPU: RTX 3080, 3090, 4080, 4090, A100, H100

  • Disk: ≥ 20 GB

2. Configure your deployment

In the CLORE.AI order form, set the following:

Docker Image:

Port Mappings:

Environment Variables:

Startup Command (optional — auto-starts WebUI):

3. Access the interface

Once deployed, open your browser and navigate to:

The Gradio WebUI will load with the full Fish Speech interface ready to use.


Step-by-Step Setup

Step 1: SSH into your server

Step 2: Pull and run the Docker container

Step 3: Verify GPU access

You should see your GPU listed with available VRAM.

Step 4: Check model download

Fish Speech automatically downloads model weights on first run (~3–5 GB). Monitor progress:

Wait until you see:

Step 5: Access the WebUI

Navigate to http://<server-ip>:7860 in your browser.

Step 6: (Optional) Enable API server


Usage Examples

Example 1: Basic Text-to-Speech via WebUI

  1. Open the WebUI at http://<server-ip>:7860

  2. Enter text in the "Text" field:

  3. Select language: English

  4. Click "Generate"

  5. Download the resulting .wav file


Example 2: Zero-Shot Voice Cloning

Clone any voice using just 10–15 seconds of reference audio:

  1. In the WebUI, navigate to the "Voice Clone" tab

  2. Upload your reference audio file (.wav or .mp3, 10–30 seconds)

  3. Enter the transcript of the reference audio (optional but improves quality)

  4. Enter the target text to synthesize

  5. Click "Clone & Generate"

The model will analyze the voice characteristics and synthesize speech in that voice.


Example 3: API-Based TTS (Python)


Example 4: Multilingual TTS


Example 5: Batch Processing Audio Files


Configuration

Docker Compose (Production Setup)

Key Configuration Options

Option
Default
Description

--listen

0.0.0.0

Interface to bind the server

--port

7860

Port for the Gradio WebUI

--compile

false

Enable torch.compile for faster inference

--device

cuda

Device to use (cuda, cpu, mps)

--half

true

Use FP16 half-precision (saves VRAM)

--num_samples

1

Number of audio samples to generate

--max_new_tokens

1024

Maximum new tokens for generation

Model Variants

Model
Size
Languages
Notes

fish-speech-1.4

~3 GB

8 languages

Latest stable release

fish-speech-1.2-sft

~2.5 GB

8 languages

Fine-tuned variant

fish-speech-1.2

~2.5 GB

8 languages

Base model


Performance Tips

1. Enable torch.compile for Faster Inference

First run will be slower (compilation takes 2–5 minutes), but subsequent inference will be 20–40% faster.

2. Use Half-Precision (FP16)

FP16 reduces VRAM usage by ~50% with minimal quality loss:

3. Pre-load Reference Voices

Store frequently used reference voices in the container's reference directory to avoid re-processing:

4. GPU Memory Optimization

5. Batch Size Tuning

For batch API requests, optimal batch sizes:

  • RTX 3080 (10 GB): batch_size = 1–2

  • RTX 3090/4090 (24 GB): batch_size = 4–8

  • A100 (40/80 GB): batch_size = 16–32


Troubleshooting

Issue: Container won't start — CUDA not found

Issue: Out of Memory (OOM) Error

Issue: Port 7860 not accessible

Issue: Model download fails / slow download

Issue: Audio quality is poor

  • Ensure reference audio is clean (no background noise, 16kHz+ sample rate)

  • Keep reference audio between 10–30 seconds

  • Provide the transcript of reference audio for better alignment

  • Try increasing --num_samples to generate multiple options and pick the best

Issue: WebUI loads but generation hangs



Clore.ai GPU Recommendations

Use Case
Recommended GPU
Est. Cost on Clore.ai

Development/Testing

RTX 3090 (24GB)

~$0.12/gpu/hr

Production TTS

RTX 4090 (24GB)

~$0.70/gpu/hr

High-throughput Inference

A100 80GB

~$1.20/gpu/hr

💡 All examples in this guide can be deployed on Clore.aiarrow-up-right GPU servers. Browse available GPUs and rent by the hour — no commitments, full root access.

Last updated

Was this helpful?