# Chatterbox 声音克隆

Chatterbox 是由以下公司开发的一系列最先进的开源文本转语音模型： [Resemble AI](https://resemble.ai)。它可以从短的参考片段（约 10 秒）进行零样本语音克隆，支持诸如以下的副语言标签， `[laugh]` 和 `[cough]`，并提供覆盖 23+ 语言的多语言变体。提供三种模型变体：Turbo（350M，低延迟）、Original（500M，具有创造性控制）和 Multilingual（500M，支持 23+ 语言）。

**GitHub：** [resemble-ai/chatterbox](https://github.com/resemble-ai/chatterbox) **PyPI：** [chatterbox-tts](https://pypi.org/project/chatterbox-tts/) **许可：** MIT 协议

## 主要特性

* **零样本语音克隆** — 从约 10 秒的参考音频克隆任意声音
* **副语言标签** （Turbo）— `[laugh]`, `[cough]`, `[chuckle]`, `[sigh]` 用于真实感语音
* **23+ 种语言** （Multilingual）— 阿拉伯语、中文、法语、德语、日语、韩语、俄语、西班牙语等
* **CFG 与夸张度调节** （Original）— 对表现力的创造性控制
* **三种模型规模** — Turbo（350M）、Original（500M）、Multilingual（500M）
* **MIT 许可证** — 完全开放以供商业使用

## 要求

| 组件     | 最低             | 推荐                  |
| ------ | -------------- | ------------------- |
| GPU    | RTX 3060 12 GB | RTX 3090 / RTX 4090 |
| 显存     | 6 GB           | 10 GB+              |
| 内存     | 8 GB           | 16 GB               |
| 磁盘     | 5 GB           | 15 GB               |
| Python | 3.10+          | 3.11                |
| CUDA   | 11.8+          | 12.1+               |

**Clore.ai 建议：** RTX 3090（~~（$0.30–1.00/天）以获得舒适的显存余量。RTX 3060 可用于 Turbo 模型。对于处理长文本的 Multilingual 模型，建议考虑 RTX 4090（~~$0.50–2.00/天）。

## 安装

```bash
# 从 PyPI 安装
pip install chatterbox-tts

# 或从源码安装
git clone https://github.com/resemble-ai/chatterbox.git
cd chatterbox
pip install -e .

# 验证
python -c "from chatterbox.tts import ChatterboxTTS; print('Chatterbox ready')"
```

## 快速开始

### Turbo 模型（最低延迟）

```python
import torchaudio as ta
from chatterbox.tts_turbo import ChatterboxTurboTTS

model = ChatterboxTurboTTS.from_pretrained(device="cuda")

# 带副语言标签的基础 TTS
text = "Hey, welcome back! [chuckle] I've got some great news for you today."

# 语音克隆 — 提供 10 秒以上的参考音频片段
wav = model.generate(text, audio_prompt_path="reference_voice.wav")

ta.save("output_turbo.wav", wav, model.sr)
print(f"Saved at {model.sr} Hz")
```

### Original 模型（英语，创造性控制）

```python
import torchaudio as ta
from chatterbox.tts import ChatterboxTTS

model = ChatterboxTTS.from_pretrained(device="cuda")

text = "The quick brown fox jumps over the lazy dog. It was a beautiful morning."

# 不使用语音克隆生成（使用默认语音）
wav = model.generate(text)
ta.save("output_default.wav", wav, model.sr)

# 使用语音克隆生成
wav = model.generate(text, audio_prompt_path="my_voice_sample.wav")
ta.save("output_cloned.wav", wav, model.sr)
```

## 使用示例

### 多语言语音克隆

```python
import torchaudio as ta
from chatterbox.mtl_tts import ChatterboxMultilingualTTS

model = ChatterboxMultilingualTTS.from_pretrained(device="cuda")

# 法语
french_text = "Bonjour, comment allez-vous? Bienvenue dans notre démonstration."
wav_fr = model.generate(french_text, language_id="fr")
ta.save("output_french.wav", wav_fr, model.sr)

# Japanese
japanese_text = "こんにちは、テキスト読み上げのデモンストレーションです。"
wav_ja = model.generate(japanese_text, language_id="ja")
ta.save("output_japanese.wav", wav_ja, model.sr)

# 俄语与语音克隆
russian_text = "Привет! Это демонстрация синтеза речи на русском языке."
wav_ru = model.generate(
    russian_text,
    language_id="ru",
    audio_prompt_path="russian_speaker.wav"
)
ta.save("output_russian.wav", wav_ru, model.sr)

print("Multilingual generation complete")
```

### 副语言标签（Turbo）

```python
import torchaudio as ta
from chatterbox.tts_turbo import ChatterboxTurboTTS

model = ChatterboxTurboTTS.from_pretrained(device="cuda")

samples = [
    ("greeting", "Hi there! [laugh] It's so good to see you again."),
    ("nervous", "Um, well [cough] I'm not really sure about that."),
    ("excited", "Oh my gosh! [chuckle] That's absolutely incredible news!"),
]

for name, text in samples:
    wav = model.generate(text, audio_prompt_path="speaker_ref.wav")
    ta.save(f"para_{name}.wav", wav, model.sr)
    prompt=prompt,
```

### 批处理脚本

```python
import torchaudio as ta
from chatterbox.tts import ChatterboxTTS
批处理处理

model = ChatterboxTTS.from_pretrained(device="cuda")

# 处理一系列行（例如有声书章节）
lines = [
    "Chapter one. The adventure begins.",
    "It was a dark and stormy night.",
    "The hero stood at the crossroads, uncertain of the path ahead.",
]

os.makedirs("output_batch", exist_ok=True)

for i, line in enumerate(lines):
    wav = model.generate(line, audio_prompt_path="narrator_voice.wav")
    ta.save(f"output_batch/line_{i:03d}.wav", wav, model.sr)
    print(f"[{i+1}/{len(lines)}] {line[:40]}...")

print("Batch processing complete")
```

## 给 Clore.ai 用户的提示

* **模型选择** — 对低延迟语音代理使用 Turbo，针对英语创造性工作使用 Original，针对非英语内容使用 Multilingual
* **参考音频质量** — 使用干净、无噪声的 10–30 秒片段以获得最佳语音克隆效果
* **Docker 设置** — 基础镜像 `pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime`，暴露端口 `7860/http` 用于 Gradio
* **内存管理** — 在生产使用前调用 `torch.cuda.empty_cache()` 在大型批次之间以释放显存
* **支持的语言** — ar, da, de, el, en, es, fi, fr, he, hi, it, ja, ko, ms, nl, no, pl, pt, ru, sv, sw, tr, zh
* **HuggingFace Space** — 在租用前先在以下地址试用 [huggingface.co/spaces/ResembleAI/Chatterbox](https://huggingface.co/spaces/ResembleAI/Chatterbox)

## # 使用固定种子以获得一致结果

| 问题                         | 解决方案                                                            |
| -------------------------- | --------------------------------------------------------------- |
| `CUDA 内存不足（out of memory）` | 使用 Turbo（350M）替代 Original/Multilingual（500M），或租用更大的 GPU         |
| 克隆的语音不匹配                   | 使用更长（15–30 秒）、更干净且背景噪声极少的参考片段                                   |
| `numpy` 版本冲突               | 运行 `pip install numpy==1.26.4 --force-reinstall`                |
| 模型下载缓慢                     | 模型在首次运行时从 HuggingFace 获取（约 2 GB）；可使用以下方式预先下载 `huggingface-cli`  |
| 音频有瑕疵                      | 减少每次生成的文本长度；非常长的文本会降低质量                                         |
| `ModuleNotFoundError`      | 确保已安装 `pip install chatterbox-tts` 已完成且无错误；请检查 Python 3.11 的兼容性 |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-zh/yin-pin-yu-yu-yin/chatterbox-tts.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.