# Chatterbox Voice Cloning

Chatterbox is a family of state-of-the-art open-source text-to-speech models by [Resemble AI](https://resemble.ai). It performs zero-shot voice cloning from a short reference clip (\~10 seconds), supports paralinguistic tags like `[laugh]` and `[cough]`, and offers a multilingual variant covering 23+ languages. Three model variants are available: Turbo (350M, low-latency), Original (500M, creative controls), and Multilingual (500M, 23+ languages).

**GitHub:** [resemble-ai/chatterbox](https://github.com/resemble-ai/chatterbox) **PyPI:** [chatterbox-tts](https://pypi.org/project/chatterbox-tts/) **License:** MIT

## Key Features

* **Zero-shot voice cloning** — clone any voice from \~10 seconds of reference audio
* **Paralinguistic tags** (Turbo) — `[laugh]`, `[cough]`, `[chuckle]`, `[sigh]` for realistic speech
* **23+ languages** (Multilingual) — Arabic, Chinese, French, German, Japanese, Korean, Russian, Spanish, and more
* **CFG & Exaggeration tuning** (Original) — creative control over expressiveness
* **Three model sizes** — Turbo (350M), Original (500M), Multilingual (500M)
* **MIT license** — fully open for commercial use

## Requirements

| Component | Minimum        | Recommended         |
| --------- | -------------- | ------------------- |
| GPU       | RTX 3060 12 GB | RTX 3090 / RTX 4090 |
| VRAM      | 6 GB           | 10 GB+              |
| RAM       | 8 GB           | 16 GB               |
| Disk      | 5 GB           | 15 GB               |
| Python    | 3.10+          | 3.11                |
| CUDA      | 11.8+          | 12.1+               |

**Clore.ai recommendation:** RTX 3090 (~~$0.30–1.00/day) for comfortable VRAM headroom. RTX 3060 works for Turbo model. For the Multilingual model with long texts, consider an RTX 4090 (~~$0.50–2.00/day).

## Installation

```bash
# Install from PyPI
pip install chatterbox-tts

# Or install from source
git clone https://github.com/resemble-ai/chatterbox.git
cd chatterbox
pip install -e .

# Verify
python -c "from chatterbox.tts import ChatterboxTTS; print('Chatterbox ready')"
```

## Quick Start

### Turbo Model (Lowest Latency)

```python
import torchaudio as ta
from chatterbox.tts_turbo import ChatterboxTurboTTS

model = ChatterboxTurboTTS.from_pretrained(device="cuda")

# Basic TTS with paralinguistic tags
text = "Hey, welcome back! [chuckle] I've got some great news for you today."

# Voice cloning — provide a 10+ second reference clip
wav = model.generate(text, audio_prompt_path="reference_voice.wav")

ta.save("output_turbo.wav", wav, model.sr)
print(f"Saved at {model.sr} Hz")
```

### Original Model (English, Creative Controls)

```python
import torchaudio as ta
from chatterbox.tts import ChatterboxTTS

model = ChatterboxTTS.from_pretrained(device="cuda")

text = "The quick brown fox jumps over the lazy dog. It was a beautiful morning."

# Generate without voice cloning (uses default voice)
wav = model.generate(text)
ta.save("output_default.wav", wav, model.sr)

# Generate with voice cloning
wav = model.generate(text, audio_prompt_path="my_voice_sample.wav")
ta.save("output_cloned.wav", wav, model.sr)
```

## Usage Examples

### Multilingual Voice Cloning

```python
import torchaudio as ta
from chatterbox.mtl_tts import ChatterboxMultilingualTTS

model = ChatterboxMultilingualTTS.from_pretrained(device="cuda")

# French
french_text = "Bonjour, comment allez-vous? Bienvenue dans notre démonstration."
wav_fr = model.generate(french_text, language_id="fr")
ta.save("output_french.wav", wav_fr, model.sr)

# Japanese
japanese_text = "こんにちは、テキスト読み上げのデモンストレーションです。"
wav_ja = model.generate(japanese_text, language_id="ja")
ta.save("output_japanese.wav", wav_ja, model.sr)

# Russian with voice cloning
russian_text = "Привет! Это демонстрация синтеза речи на русском языке."
wav_ru = model.generate(
    russian_text,
    language_id="ru",
    audio_prompt_path="russian_speaker.wav"
)
ta.save("output_russian.wav", wav_ru, model.sr)

print("Multilingual generation complete")
```

### Paralinguistic Tags (Turbo)

```python
import torchaudio as ta
from chatterbox.tts_turbo import ChatterboxTurboTTS

model = ChatterboxTurboTTS.from_pretrained(device="cuda")

samples = [
    ("greeting", "Hi there! [laugh] It's so good to see you again."),
    ("nervous", "Um, well [cough] I'm not really sure about that."),
    ("excited", "Oh my gosh! [chuckle] That's absolutely incredible news!"),
]

for name, text in samples:
    wav = model.generate(text, audio_prompt_path="speaker_ref.wav")
    ta.save(f"para_{name}.wav", wav, model.sr)
    print(f"Generated: {name}")
```

### Batch Processing Script

```python
import torchaudio as ta
from chatterbox.tts import ChatterboxTTS
import os

model = ChatterboxTTS.from_pretrained(device="cuda")

# Process a list of lines (e.g., for audiobook chapters)
lines = [
    "Chapter one. The adventure begins.",
    "It was a dark and stormy night.",
    "The hero stood at the crossroads, uncertain of the path ahead.",
]

os.makedirs("output_batch", exist_ok=True)

for i, line in enumerate(lines):
    wav = model.generate(line, audio_prompt_path="narrator_voice.wav")
    ta.save(f"output_batch/line_{i:03d}.wav", wav, model.sr)
    print(f"[{i+1}/{len(lines)}] {line[:40]}...")

print("Batch processing complete")
```

## Tips for Clore.ai Users

* **Model choice** — use Turbo for low-latency voice agents, Original for English creative work, Multilingual for non-English content
* **Reference audio quality** — use a clean, noise-free 10–30 second clip for best voice cloning results
* **Docker setup** — base image `pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime`, expose port `7860/http` for Gradio
* **Memory management** — call `torch.cuda.empty_cache()` between large batches to free VRAM
* **Supported languages** — ar, da, de, el, en, es, fi, fr, he, hi, it, ja, ko, ms, nl, no, pl, pt, ru, sv, sw, tr, zh
* **HuggingFace Space** — try before renting at [huggingface.co/spaces/ResembleAI/Chatterbox](https://huggingface.co/spaces/ResembleAI/Chatterbox)

## Troubleshooting

| Problem                    | Solution                                                                                       |
| -------------------------- | ---------------------------------------------------------------------------------------------- |
| `CUDA out of memory`       | Use Turbo (350M) instead of Original/Multilingual (500M), or rent a larger GPU                 |
| Cloned voice doesn't match | Use a longer (15–30s), cleaner reference clip with minimal background noise                    |
| `numpy` version conflict   | Run `pip install numpy==1.26.4 --force-reinstall`                                              |
| Slow model download        | Models are fetched from HuggingFace on first run (\~2 GB); pre-download with `huggingface-cli` |
| Audio has artifacts        | Reduce text length per generation; very long texts can degrade quality                         |
| `ModuleNotFoundError`      | Ensure `pip install chatterbox-tts` completed without errors; check Python 3.11 compatibility  |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/audio-and-voice/chatterbox-tts.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
