# Zonos TTS 声音克隆

Zonos 来自 [Zyphra](https://www.zyphra.com/) 是一个 0.4B 参数的开源权重文本转语音模型，在 200K+ 小时的多语种语音上训练。它能从仅 2–30 秒的参考音频进行零样本语音克隆，并提供对情感、语速、音高变化和音频质量的细粒度控制。输出为高保真 44 kHz 音频。提供两种模型变体：Transformer（最佳质量）和 Hybrid/Mamba（推理更快）。

**GitHub：** [Zyphra/Zonos](https://github.com/Zyphra/Zonos) **HuggingFace：** [Zyphra/Zonos-v0.1-transformer](https://huggingface.co/Zyphra/Zonos-v0.1-transformer) **许可：** Apache 2.0

## 主要特性

* **从 2–30 秒进行语音克隆** — 无需微调
* **44 kHz 高保真输出** — 工作室级音频质量
* **情感控制** — 通过 8 维向量控制快乐、悲伤、愤怒、恐惧、惊讶、厌恶
* **语速与音高** — 独立的细粒度控制
* **音频前缀输入** — 可实现低语等难以克隆的行为
* **多语言** — 英语、日语、中文、法语、德语
* **两种架构** — Transformer（质量）和 Hybrid/Mamba（速度，RTX 4090 上约 2× 实时）
* **Apache 2.0** — 个人和商业用途免费

## 要求

| 组件     | 最低                | 推荐             |
| ------ | ----------------- | -------------- |
| GPU    | RTX 3080 10 GB    | RTX 4090 24 GB |
| 显存     | 6 GB（Transformer） | 10 GB+         |
| 内存     | 16 GB             | 32 GB          |
| 磁盘     | 10 GB             | 20 GB          |
| Python | 3.10+             | 3.11           |
| CUDA   | 11.8+             | 12.4           |
| 系统     | espeak-ng         | —              |

**Clore.ai 建议：** RTX 3090（~~$0.30–1.00/天）以获得充足余量。RTX 4090（~~$0.50–2.00/天）适用于 Hybrid 模型和最快推理。

## 安装

```bash
# 安装系统依赖项
apt-get install -y espeak-ng

# 克隆并安装
git clone https://github.com/Zyphra/Zonos.git
cd Zonos
pip install -e .

# 对于 Hybrid 模型（需要 Ampere 及以上 GPU，即 RTX 3000 系列或更新）
pip install -e ".[compile]"

# 验证
python -c "from zonos.model import Zonos; print('Zonos ready')"
```

## 快速开始

```python
import torch
import torchaudio
from zonos.model import Zonos
from zonos.conditioning import make_cond_dict

# 加载模型（首次运行会下载权重）
model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")

# 加载用于语音克隆的参考音频
wav, sr = torchaudio.load("reference_speaker.wav")
speaker = model.make_speaker_embedding(wav, sr)

# 构建条件信息
cond_dict = make_cond_dict(
    text="Hello from Clore.ai! This is a voice cloning demonstration.",
    speaker=speaker,
    language="en-us",
)
conditioning = model.prepare_conditioning(cond_dict)

# 生成
torch.manual_seed(42)
codes = model.generate(conditioning)
wavs = model.autoencoder.decode(codes).cpu()

torchaudio.save("output.wav", wavs[0], model.autoencoder.sampling_rate)
print(f"Saved output.wav at {model.autoencoder.sampling_rate} Hz")
```

## 使用示例

### 情感控制

Zonos 接受一个 8 维情感向量： `[快乐, 悲伤, 厌恶, 恐惧, 惊讶, 愤怒, 其它, 中性]`.

```python
import torch
import torchaudio
from zonos.model import Zonos
from zonos.conditioning import make_cond_dict

model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")

wav, sr = torchaudio.load("speaker_ref.wav")
speaker = model.make_speaker_embedding(wav, sr)

text = "I can't believe what just happened today!"

emotions = {
    "happy":   [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
    "sad":     [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
    "angry":   [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0],
    "fearful": [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0],
    "neutral": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0],
}

for name, emo_vec in emotions.items():
    cond_dict = make_cond_dict(
        text=text,
        speaker=speaker,
        language="en-us",
        emotion=torch.tensor(emo_vec).unsqueeze(0),
    )
    conditioning = model.prepare_conditioning(cond_dict)
    codes = model.generate(conditioning)
    audio = model.autoencoder.decode(codes).cpu()
    torchaudio.save(f"emotion_{name}.wav", audio[0], model.autoencoder.sampling_rate)
    prompt=prompt,
```

### 语速和音高控制

```python
import torch
import torchaudio
from zonos.model import Zonos
from zonos.conditioning import make_cond_dict

model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")

wav, sr = torchaudio.load("speaker_ref.wav")
speaker = model.make_speaker_embedding(wav, sr)

# 慢且平静
cond_slow = make_cond_dict(
    text="Take your time. There is no rush at all.",
    speaker=speaker,
    language="en-us",
    speaking_rate=torch.tensor([8.0]),   # 值越低 = 越慢
    pitch_std=torch.tensor([20.0]),      # 值越低 = 越单调
)
codes = model.generate(model.prepare_conditioning(cond_slow))
audio = model.autoencoder.decode(codes).cpu()
torchaudio.save("slow_calm.wav", audio[0], model.autoencoder.sampling_rate)

# 快且充满活力
cond_fast = make_cond_dict(
    text="Hurry up! We need to go right now!",
    speaker=speaker,
    language="en-us",
    speaking_rate=torch.tensor([22.0]),  # 值越高 = 越快
    pitch_std=torch.tensor([80.0]),      # 值越高 = 越富表现力
)
codes = model.generate(model.prepare_conditioning(cond_fast))
audio = model.autoencoder.decode(codes).cpu()
torchaudio.save("fast_energetic.wav", audio[0], model.autoencoder.sampling_rate)
```

### Gradio 网络界面

```bash
cd Zonos
python gradio_interface.py
# 或使用 uv：
# uv run gradio_interface.py
```

开放端口 `7860/http` 在你的 Clore.ai 订单中并打开 `http_pub` 访问 UI 的 URL。

## 给 Clore.ai 用户的提示

* **模型选择** — Transformer 提供最佳质量，Hybrid 提供约 2× 更快的推理（需要 RTX 3000+ GPU）
* **参考音频** — 10–30 秒的干净语音效果最佳；较短片段（2–5 秒）可用但保真度较低
* **Docker 设置** — 使用 `pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime`, 添加 `apt-get install -y espeak-ng` 到启动项
* **端口映射** — 暴露 `7860/http` 用于 Gradio UI， `8000/http` 用于 API 服务器
* **种子控制** —— 设置 `torch.manual_seed()` 在生成之前以获得可复现的输出
* **音频质量参数** — 在 `audio_quality` conditioning 字段中进行尝试以获得更干净的输出

## # 使用固定种子以获得一致结果

| 问题                           | 解决方案                                                             |
| ---------------------------- | ---------------------------------------------------------------- |
| `未找到 espeak-ng`              | 运行 `apt-get install -y espeak-ng` （用于音素化的必要项）                    |
| `CUDA 内存不足（out of memory）`   | 使用 Transformer 模型（比 Hybrid 小）；减少每次调用的文本长度                        |
| Hybrid 模型失败                  | 需要 Ampere 及以上 GPU（RTX 3000 系列或更新）和 `pip install -e ".[compile]"` |
| 克隆的声音听起来不对                   | 使用更长的参考片段（15–30 秒），要求语音清晰且背景噪声极少                                 |
| 生成速度慢                        | Transformer 一般为正常（约 0.5× 实时）；Hybrid 在 RTX 4090 上可实现约 2× 实时       |
| `ModuleNotFoundError: zonos` | 确保你是从源码安装的： `cd Zonos && pip install -e .`                       |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-zh/yin-pin-yu-yu-yin/zonos-tts.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.