# Fish Speech

Fish Speech 是一款最先进的多语言文本转语音（TTS）系统，具有零样本语音克隆能力。它在 GitHub 上拥有超过 15,000 个星标，支持英语、中文、日语、韩语、法语、德语、阿拉伯语、西班牙语等多种语言——全部来自同一个模型。只需 10–15 秒的参考音频，Fish Speech 就能以惊人的保真度克隆任何声音，十分适用于有声书制作、配音、虚拟助手和大规模内容创作。

Fish Speech 使用基于 Transformer 的架构和 VQGAN 声码器，在标准 TTS 基准上达到接近人类的自然度评分。其 WebUI（Gradio）使用户无需编写任何代码即可使用，而 REST API 则可实现与生产流水线的无缝集成。

{% hint style="success" %}
所有示例都可以在通过以下方式租用的 GPU 服务器上运行 [CLORE.AI 市场](https://clore.ai/marketplace).
{% endhint %}

***

## 服务器要求

| 参数       | 最低要求                   | 推荐配置                   |
| -------- | ---------------------- | ---------------------- |
| GPU      | NVIDIA RTX 3080（10 GB） | NVIDIA RTX 4090（24 GB） |
| 显存（VRAM） | 8 GB                   | 16–24 GB               |
| 内存（RAM）  | 16 GB                  | 32 GB                  |
| CPU      | 4 个内核                  | 8 个以上内核                |
| 磁盘       | 20 GB                  | 40 GB                  |
| 操作系统     | Ubuntu 20.04+          | Ubuntu 22.04           |
| CUDA     | 11.8+                  | 12.1+                  |
| 端口       | 22, 7860               | 22, 7860               |

{% hint style="info" %}
Fish Speech 能在中端 GPU（如 RTX 3080/3090）上高效运行。对于批量推理或为多个并发用户提供服务，建议使用 RTX 4090 或 A100。
{% endhint %}

***

## 在 CLORE.AI 上快速部署

运行 Fish Speech 的最快方式是直接从 Docker Hub 使用官方 Docker 镜像。

### 1. 找到合适的服务器

前往 [CLORE.AI 市场](https://clore.ai/marketplace) 并按以下条件筛选：

* **显存（VRAM）**: ≥ 8 GB
* **GPU**: RTX 3080、3090、4080、4090、A100、H100
* **磁盘**：≥ 20 GB

### 2. 配置您的部署

在 CLORE.AI 订单表单中，设置如下：

**Docker 镜像：**

```
fishaudio/fish-speech:latest
```

**端口映射：**

```
22   → SSH 访问
7860 → Gradio Web 界面
```

**环境变量：**

```
NVIDIA_VISIBLE_DEVICES=all
CUDA_VISIBLE_DEVICES=0
```

**启动命令（可选 — 自动启动 WebUI）：**

```bash
python -m tools.webui --listen 0.0.0.0 --port 7860
```

### 3. 访问界面

部署完成后，打开浏览器并访问：

```
http://<your-clore-server-ip>:7860
```

Gradio WebUI 将加载并显示完整的 Fish Speech 界面，准备就绪。

***

## 逐步设置

### 第一步：SSH 登录到您的服务器

```bash
ssh root@<your-clore-server-ip> -p <ssh-port>
```

### 步骤 2：拉取并运行 Docker 容器

```bash
docker pull fishaudio/fish-speech:latest

docker run -d \
  --name fish-speech \
  --gpus all \
  -p 7860:7860 \
  -p 22:22 \
  -v /workspace/fish-speech:/workspace \
  -e NVIDIA_VISIBLE_DEVICES=all \
  fishaudio/fish-speech:latest \
  python -m tools.webui --listen 0.0.0.0 --port 7860
```

### 步骤 3：验证 GPU 访问

```bash
docker exec fish-speech nvidia-smi
```

你应该能看到列出的 GPU 及其可用显存。

### 步骤 4：检查模型下载

Fish Speech 会在首次运行时自动下载模型权重（约 3–5 GB）。监控下载进度：

```bash
docker logs -f fish-speech
```

等待直到你看到：

```
运行于本地 URL：  http://0.0.0.0:7860
```

### 步骤 5：访问 WebUI

在你的浏览器中导航到 `http://<server-ip>:7860` 。

### 步骤 6：（可选）启用 API 服务器

```bash
docker exec -d fish-speech \
  python -m tools.api_server --listen 0.0.0.0 --port 8080
```

***

## 使用示例

### 示例 1：通过 WebUI 的基础文本转语音

1. 在 WebUI 打开地址 `http://<server-ip>:7860`
2. 在以下位置输入文本 **"Text"** 字段：

   ```
   欢迎使用 Clore.ai，面向 AI 工作负载的 GPU 云市场。
   ```
3. 选择语言： **英语**
4. 点击 **"Generate"**
5. 下载生成的 `.wav` 文件

***

### 示例 2：零样本语音克隆

只需 10–15 秒参考音频即可克隆任意声音：

1. 在 WebUI 中，导航到 **"Voice Clone"** 选项卡
2. 上传你的参考音频文件（`.wav` 或 `.mp3`, 10–30 秒)
3. 输入参考音频的转录（可选，但能提高质量）
4. 输入要合成的目标文本
5. 点击 **"Clone & Generate"**

模型将分析声音特征并以该声音合成语音。

***

### 示例 3：基于 API 的 TTS（Python）

```python
import requests
BASE_URL = "http://localhost:9090"  # 或您的 CLORE.AI http_pub URL

# Fish Speech API 端点
API_URL = "http://<your-clore-server-ip>:8080/v1/tts"

"batch": {
    "text": "Hello, this is a test of Fish Speech running on Clore.ai GPU infrastructure.",
    "reference_id": None,  # 使用默认声音
    "format": "wav",
    "streaming": False
}

response = requests.post(API_URL, json=payload)

if response.status_code == 200:
    with open("output.wav", "wb") as f:
        f.write(response.content)
    print("Audio saved to output.wav")
else:
    print(f"Error: {response.status_code} - {response.text}")
```

***

### 示例 4：多语言 TTS

```python
import requests

API_URL = "http://<your-clore-server-ip>:8080/v1/tts"

texts = {
    "en": "Clore.ai provides affordable GPU cloud computing for AI researchers.",
    "zh": "Clore.ai 为 AI 研究人员提供经济实惠的 GPU 云计算服务。",
    "ja": "Clore.aiはAI研究者向けの手頃なGPUクラウドコンピューティングを提供します。",
    "ko": "Clore.ai는 AI 연구자들을 위한 저렴한 GPU 클라우드 컴퓨팅을 제공합니다.",
    "fr": "Clore.ai fournit un calcul GPU cloud abordable pour les chercheurs en IA.",
}

for lang, text in texts.items():
    payload = {"text": text, "format": "wav"}
    response = requests.post(API_URL, json=payload)
    if response.status_code == 200:
        filename = f"output_{lang}.wav"
        with open(filename, "wb") as f:
            f.write(response.content)
        print(f"Saved {filename}")
```

***

### 示例 5：批量处理音频文件

```python
import requests
import os
from pathlib import Path

API_URL = "http://<your-clore-server-ip>:8080/v1/tts"
OUTPUT_DIR = Path("./tts_outputs")
OUTPUT_DIR.mkdir(exist_ok=True)

# 要转换的一批文本
texts = [
    "Chapter one: The beginning of a new era in artificial intelligence.",
    "Chapter two: How GPU computing transformed machine learning.",
    "Chapter three: The rise of voice synthesis technologies.",
    "Chapter four: Building the future with Clore.ai infrastructure.",
    "Chapter five: Conclusion and next steps.",
]

for i, text in enumerate(texts):
    "batch": {
        "text": text,
        "format": "wav",
        "streaming": False
    }
    response = requests.post(API_URL, json=payload, timeout=60)
    if response.status_code == 200:
        output_path = OUTPUT_DIR / f"chapter_{i+1:02d}.wav"
        with open(output_path, "wb") as f:
            f.write(response.content)
        print(f"✓ Generated: {output_path}")
    else:
        print(f"✗ Failed chapter {i+1}: {response.status_code}")

print(f"\nAll files saved to {OUTPUT_DIR}")
```

***

## invokeai.yaml 配置文件

### Docker Compose（生产部署）

```yaml
version: '3.8'

services:
  fish-speech:
    image: fishaudio/fish-speech:latest
    container_name: fish-speech
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - CUDA_VISIBLE_DEVICES=0
    ports:
      - "7860:7860"
      - "8080:8080"
    volumes:
      - ./models:/workspace/models
      - ./outputs:/workspace/outputs
      - ./references:/workspace/references
    command: >
      bash -c "python -m tools.webui --listen 0.0.0.0 --port 7860 &
               python -m tools.api_server --listen 0.0.0.0 --port 8080 &
               wait"
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
```

### 关键配置选项

| 选项                 | 默认        | 描述                           |
| ------------------ | --------- | ---------------------------- |
| `--listen`         | `0.0.0.0` | 绑定服务器的接口                     |
| `--port`           | `7860`    | Gradio WebUI 的端口             |
| `--compile`        | `false`   | 启用 torch.compile 以加速推理       |
| `--device`         | `cuda`    | 要使用的设备（`cuda`, `cpu`, `mps`) |
| `--half`           | `true`    | 使用 FP16 半精度（节省显存）            |
| `--num_samples`    | `1`       | 要生成的音频样本数量                   |
| `--max_new_tokens` | `1024`    | 生成的最大新令牌数                    |

### 模型变体

| 模型                    | 大小       | 语言    | 说明     |
| --------------------- | -------- | ----- | ------ |
| `fish-speech-1.4`     | \~3 GB   | 8 种语言 | 最新稳定版本 |
| `fish-speech-1.2-sft` | \~2.5 GB | 8 种语言 | 微调变体   |
| `fish-speech-1.2`     | \~2.5 GB | 8 种语言 | 基础模型   |

***

## 1. 使用 SDXL-Turbo 或 SDXL-Lightning 以实现快速生成

### 1. 启用 torch.compile 以加速推理

```bash
# 启动时添加 --compile 标志
python -m tools.webui --listen 0.0.0.0 --port 7860 --compile
```

首次运行会较慢（编译需 2–5 分钟），但随后推理将快 20–40%。

### 2. 使用半精度（FP16）

FP16 可将显存使用量减少约 50%，且质量损失最小：

```bash
python -m tools.webui --listen 0.0.0.0 --port 7860 --half
```

### 3. 预加载参考语音

将常用的参考语音存放在容器的 references 目录中以避免重复处理：

```bash
# 将参考音频复制到容器中
docker cp my_voice.wav fish-speech:/workspace/references/my_voice.wav
```

### 4. GPU 内存优化

```bash
# 设置最佳的 CUDA 内存分配值
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

# 在大批次之间清理 GPU 缓存
docker exec fish-speech python -c "import torch; torch.cuda.empty_cache()"
```

### 5. 批量大小调整

对于批量 API 请求，建议的批量大小：

* **RTX 3080（10 GB）**: batch\_size = 1–2
* **RTX 3090/4090（24 GB）**: batch\_size = 4–8
* **A100（40/80 GB）**: batch\_size = 16–32

***

## 故障排除

### 问题：容器无法启动 — 找不到 CUDA

```bash
# 在容器内验证 NVIDIA 驱动
docker exec fish-speech nvidia-smi

# 如果失败，请检查宿主机驱动
nvidia-smi

# 使用显式的 GPU 标志重新运行
docker run --gpus all --rm fishaudio/fish-speech:latest nvidia-smi
```

### 问题：内存不足（OOM）错误

```bash
# 检查显存使用情况
docker exec fish-speech nvidia-smi

# 使用 FP16 将显存使用量减半
# 使用 --half 标志重启容器
docker stop fish-speech
docker run -d --name fish-speech --gpus all -p 7860:7860 \
  fishaudio/fish-speech:latest \
  python -m tools.webui --listen 0.0.0.0 --port 7860 --half
```

### 问题：端口 7860 无法访问

```bash
# 检查容器是否正在运行
docker ps | grep fish-speech

# 检查端口绑定
docker port fish-speech

# 验证防火墙（在 Clore 服务器上）
# 确保在你的 CLORE.AI 订单配置中映射了 7860 端口
```

### 问题：模型下载失败 / 下载缓慢

```bash
# 检查容器的网络连接
docker exec fish-speech curl -I https://huggingface.co

# 手动预先下载模型
docker exec fish-speech python -c "
from huggingface_hub import snapshot_download
snapshot_download('fishaudio/fish-speech-1.4')
"
```

### 问题：音频质量差

* 确保参考音频干净（无背景噪声，采样率 >= 16kHz）
* 将参考音频保持在 10–30 秒之间
* 提供参考音频的转录以获得更好的对齐
* 尝试增加 `--num_samples` 以生成多个选项并挑选最佳结果

### 问题：WebUI 加载但生成过程挂起

```bash
# 在生成期间检查 GPU 利用率
docker exec fish-speech watch -n1 nvidia-smi

# 验证端口映射
docker logs fish-speech --tail 50
```

***

## 文档

* **GitHub**: <https://github.com/fishaudio/fish-speech>
* **Docker Hub**: <https://hub.docker.com/r/fishaudio/fish-speech>
* **官方文档**: <https://speech.fish.audio>
* **Hugging Face 模型**: <https://huggingface.co/fishaudio/fish-speech-1.4>
* **CLORE.AI 市场**: <https://clore.ai/marketplace>
* **Discord 交流群**: <https://discord.gg/Es5qTB9BcN>

***

## Clore.ai 的 GPU 建议

| 在 Clore.ai 上的预估费用 | 开发/测试             | RTX 3090（24GB） |
| ----------------- | ----------------- | -------------- |
| \~$0.12/每 GPU/每小时 | 生产                | RTX 4090（24GB） |
| 生产级 TTS           | 大规模               | A100 80GB      |
| 高吞吐量推理            | 💡 本指南中的所有示例均可部署在 | Clore.ai       |

> GPU 服务器上。浏览可用 GPU 并按小时租用 — 无需承诺，提供完整的 root 访问权限。 [Clore.ai](https://clore.ai/marketplace) GPU 服务器。浏览可用 GPU 并按小时租用 — 无需承诺，提供完整的 root 访问权限。


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-zh/yin-pin-yu-yu-yin/fish-speech.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.