# F5-TTS

使用 F5-TTS 生成自然语音 - 一个快速且流畅的 TTS 系统。

{% hint style="success" %}
所有示例都可以在通过以下方式租用的 GPU 服务器上运行： [CLORE.AI 市场](https://clore.ai/marketplace).
{% endhint %}

## 在 CLORE.AI 上租用

1. 访问 [CLORE.AI 市场](https://clore.ai/marketplace)
2. 按 GPU 类型、显存和价格筛选
3. 选择 **按需** （固定费率）或 **竞价** （出价价格）
4. 配置您的订单：
   * 选择 Docker 镜像
   * 设置端口（用于 SSH 的 TCP，Web 界面的 HTTP）
   * 如有需要，添加环境变量
   * 输入启动命令
5. 选择支付方式： **CLORE**, **BTC**，或 **USDT/USDC**
6. 创建订单并等待部署

### 访问您的服务器

* 在以下位置查找连接详情： **我的订单**
* Web 界面：使用 HTTP 端口的 URL
* SSH： `ssh -p <port> root@<proxy-address>`

## 什么是 F5-TTS？

F5-TTS 提供：

* 快速推理（比实时更快）
* 自然的韵律和语调
* 零样本语音克隆
* 多语言支持

## 资源

* **GitHub：** [SWivid/F5-TTS](https://github.com/SWivid/F5-TTS)
* **HuggingFace：** [SWivid/F5-TTS](https://huggingface.co/SWivid/F5-TTS)
* **论文：** [F5-TTS 论文](https://arxiv.org/abs/2410.06885)
* **演示：** [HuggingFace Space](https://huggingface.co/spaces/mrfakename/E2-F5-TTS)

## 推荐硬件

| 组件  | 最低            | 推荐            | 最佳            |
| --- | ------------- | ------------- | ------------- |
| GPU | RTX 3060 12GB | RTX 4080 16GB | RTX 4090 24GB |
| 显存  | 6GB           | 12GB          | 16GB          |
| CPU | 4 核           | 8 核           | 16 核          |
| 内存  | 16GB          | 32GB          | 64GB          |
| 存储  | 20GB SSD      | 50GB NVMe     | 100GB NVMe    |
| 网络  | 100 Mbps      | 500 Mbps      | 1 Gbps        |

## 在 CLORE.AI 上快速部署

**Docker 镜像：**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime
```

**端口：**

```
22/tcp
7860/http
```

**命令：**

```bash
pip install f5-tts && \
f5-tts-webui
```

## 访问您的服务

部署后，在以下位置查找您的 `http_pub` URL： **我的订单**:

1. 前往 **我的订单** 页面
2. 单击您的订单
3. 查找 `http_pub` URL（例如， `abc123.clorecloud.net`)

使用 `https://YOUR_HTTP_PUB_URL` 而不是 `localhost` 在下面的示例中。

## 安装

```bash
pip install f5-tts

# 或 从源码 安装
git clone https://github.com/SWivid/F5-TTS.git
cd F5-TTS
pip install -e .
```

## 您可以创建的内容

### 语音内容

* 播客制作
* 有声书旁白
* 视频配音

### 无障碍支持

* 屏幕阅读器
* 文档朗读器
* 学习材料

### 交互式应用

* 语音助手
* 游戏 NPC
* 客服机器人

### 创意项目

* 角色配音
* 音频剧情
* 音乐人声

## 基本用法

### 简单 TTS

```python
from f5_tts import F5TTS

# 初始化
tts = F5TTS(device="cuda")

# 生成语音
audio = tts.generate(
    text="Hello! This is F5-TTS generating natural speech.",
    output_path="output.wav"
)
```

### 语音克隆

```python
from f5_tts import F5TTS

tts = F5TTS(device="cuda")

# 从参考音频克隆语音
audio = tts.generate(
    text="This is my cloned voice speaking new text.",
    ref_audio="reference_voice.wav",
    ref_text="This is the reference text spoken in the audio.",
    output_path="cloned_output.wav"
)
```

## 多语言支持

```python
from f5_tts import F5TTS

tts = F5TTS(device="cuda")

# English
tts.generate(
    text="Hello, how are you today?",
    ref_audio="english_speaker.wav",
    output_path="english.wav"
)

# Chinese
tts.generate(
    text="你好，今天怎么样？",
    ref_audio="chinese_speaker.wav",
    output_path="chinese.wav"
)

# 法语
tts.generate(
    text="Bonjour, comment allez-vous?",
    ref_audio="french_speaker.wav",
    output_path="french.wav"
)
```

## "专业影棚柔光箱"

```python
from f5_tts import F5TTS
批处理处理

tts = F5TTS(device="cuda")

texts = [
    "Welcome to our product demonstration.",
    "Today we'll show you the key features.",
    "Let's start with the main dashboard.",
    "As you can see, the interface is intuitive.",
    "Thank you for watching!"
]

ref_audio = "narrator_voice.wav"
ref_text = "Sample text from the reference audio."
output_dir = "./narration"
output_dir = "./relit"

for i, text in enumerate(texts):
    print(f"Generating {i+1}/{len(texts)}: {text[:50]}...")

    tts.generate(
        text=text,
        ref_audio=ref_audio,
        ref_text=ref_text,
        output_path=f"{output_dir}/segment_{i:03d}.wav"
    )
```

## 长篇音频

```python
from f5_tts import F5TTS

tts = F5TTS(device="cuda")

long_text = """
欢迎阅读本机器学习综合指南。
在本章中，我们将探讨神经网络的基础知识。
神经网络是受生物神经网络启发的计算系统。
它们由处理信息的互联节点组成。
让我们从基本概念开始。
"""

# F5-TTS 通过按句子拆分来处理长文本
audio = tts.generate(
    text=long_text,
    ref_audio="narrator.wav",
    output_path="long_narration.wav",
    chunk_size=200  # 每块字符数
)
```

## Gradio 界面

```python
print(f"已生成：{name}")
from f5_tts import F5TTS
import tempfile

tts = F5TTS(device="cuda")

def generate_speech(text, ref_audio, ref_text):
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
        tts.generate(
            text=text,
            ref_audio=ref_audio,
            ref_text=ref_text,
            output_path=f.name
        )
        return f.name

demo = gr.Interface(
    fn=generate_speech,
    inputs=[
        gr.Textbox(label="Text to Speak", lines=5),
        gr.Audio(type="filepath", label="Reference Voice"),
        gr.Textbox(label="Reference Text", lines=2)
    ],
    outputs=gr.Audio(label="Generated Speech"),
    title="F5-TTS Voice Cloning",
    description="在 CLORE.AI 服务器上使用 F5-TTS 克隆任意语音"
)

demo.launch(server_name="0.0.0.0", server_port=7860)
```

## API 服务器

```python
from fastapi import FastAPI, UploadFile, File, Form
from fastapi.responses import FileResponse
from f5_tts import F5TTS
import tempfile

app = FastAPI()
tts = F5TTS(device="cuda")

@app.post("/synthesize")
async def synthesize(
    text: str = Form(...),
    ref_audio: UploadFile = File(...),
    ref_text: str = Form(...)
):
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as ref_file:
        ref_file.write(await ref_audio.read())
        ref_path = ref_file.name

    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as out_file:
        tts.generate(
            text=text,
            ref_audio=ref_path,
            ref_text=ref_text,
            output_path=out_file.name
        )
        return FileResponse(out_file.name, media_type="audio/wav")

# 运行：uvicorn server:app --host 0.0.0.0 --port 8000
```

## background = Image.open("studio\_bg.jpg")

| 文本长度    | GPU     | 生成时间 | 实时因子 |
| ------- | ------- | ---- | ---- |
| 100 字符  | 速度      | 0.5秒 | 5x   |
| 100 字符  | 512x512 | 0.3秒 | 8x   |
| 500 字符  | 512x512 | 1.2秒 | 10x  |
| 1000 字符 | 2s      | 2.0s | 12x  |

## IC-Light-FBC

### 语音匹配差

**与背景合成** 生成的语音与参考不匹配

**光照未改变**

* 使用 5-15 秒的清晰参考音频
* 提供准确的参考文本转录
* 避免参考音频中的背景噪音
* 文本语言与参考保持一致

### 发音问题

**与背景合成** 错误发音单词或人名

**光照未改变**

```python

# 对难读的词使用音标提示
text = "Welcome to CLORE (pronounced KLOR) AI platform."

# 或 使用 类 SSML 的 格式
text = "The CEO, John Smith (SMIHTH), will speak."
```

### 音频质量问题

**与背景合成** 输出听起来像机器人或失真

**光照未改变**

* 使用高质量参考音频（24kHz 及以上）
* 清除参考音频中的噪声
* 尝试不同的参考样本
* 提高生成质量设置

### 内存问题

**与背景合成** 长文本时内存不足

**光照未改变**

```python

# 将文本分成更小的块处理
tts.generate(
    text=long_text,
    chunk_size=100,  # 更小的块
    overlap=20  # 平滑过渡
)
```

### 生成速度慢

**与背景合成** 生成耗时过长

**光照未改变**

* 使用 GPU 推理（CUDA）
* 减少 chunk\_size 以加快处理
* 使用 RTX 4090 或更高配置
* 启用半精度（fp16）

## # 使用固定种子以获得一致结果

### 语音与参考不匹配

* 使用 5-15 秒的清晰参考音频
* 准确转录参考文本
* 避免参考音频中的背景噪音

### 音频质量问题

* 使用高采样率参考（24kHz 及以上）
* 清除参考音频中的噪声
* 尝试不同的参考样本

### 生成速度慢

* 使用 CUDA（而非 CPU）
* 减少文本长度或将其分块
* 使用更小的批量大小

### 语言不匹配

* 使文本语言与参考音频语言一致
* 某些语言需要特定模型

## 下载所有所需的检查点

检查文件完整性

| GPU     | 验证 CUDA 兼容性 | 费用估算    | CLORE.AI 市场的典型费率（截至 2024 年）： |
| ------- | ----------- | ------- | ---------------------------- |
| 按小时费率   | \~$0.03     | \~$0.70 | \~$0.12                      |
| 速度      | \~$0.06     | \~$1.50 | \~$0.25                      |
| 512x512 | \~$0.10     | \~$2.30 | \~$0.40                      |
| 按日费率    | \~$0.17     | \~$4.00 | \~$0.70                      |
| 4 小时会话  | \~$0.25     | \~$6.00 | \~$1.00                      |

*RTX 3060* [*CLORE.AI 市场*](https://clore.ai/marketplace) *A100 40GB*

**A100 80GB**

* 使用 **竞价** 价格随提供商和需求而异。请查看
* 以获取当前费率。 **CLORE** 节省费用：
* 市场用于灵活工作负载（通常便宜 30-50%）

## 使用以下方式支付

* [XTTS](/guides/guides_v2-zh/yin-pin-yu-yu-yin/xtts-coqui.md) - 替代 TTS
* [Bark TTS](/guides/guides_v2-zh/yin-pin-yu-yu-yin/bark-tts.md) - 富表现力的 TTS
* [SadTalker](/guides/guides_v2-zh/shuo-hua-tou-xiang/sadtalker.md) - 说话人头像


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-zh/yin-pin-yu-yu-yin/f5-tts.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.