# Chatterbox Voice Cloning

Chatterbox is a family of state-of-the-art open-source text-to-speech models by [Resemble AI](https://resemble.ai). It performs zero-shot voice cloning from a short reference clip (\~10 seconds), supports paralinguistic tags like `[laugh]` and `[cough]`, and offers a multilingual variant covering 23+ languages. Three model variants are available: Turbo (350M, low-latency), Original (500M, creative controls), and Multilingual (500M, 23+ languages).

**GitHub:** [resemble-ai/chatterbox](https://github.com/resemble-ai/chatterbox) **PyPI:** [chatterbox-tts](https://pypi.org/project/chatterbox-tts/) **License:** MIT

## Key Features

* **Zero-shot voice cloning** — clone any voice from \~10 seconds of reference audio
* **Paralinguistic tags** (Turbo) — `[laugh]`, `[cough]`, `[chuckle]`, `[sigh]` for realistic speech
* **23+ languages** (Multilingual) — Arabic, Chinese, French, German, Japanese, Korean, Russian, Spanish, and more
* **CFG & Exaggeration tuning** (Original) — creative control over expressiveness
* **Three model sizes** — Turbo (350M), Original (500M), Multilingual (500M)
* **MIT license** — fully open for commercial use

## Requirements

| Component | Minimum        | Recommended         |
| --------- | -------------- | ------------------- |
| GPU       | RTX 3060 12 GB | RTX 3090 / RTX 4090 |
| VRAM      | 6 GB           | 10 GB+              |
| RAM       | 8 GB           | 16 GB               |
| Disk      | 5 GB           | 15 GB               |
| Python    | 3.10+          | 3.11                |
| CUDA      | 11.8+          | 12.1+               |

**Clore.ai recommendation:** RTX 3090 (~~$0.30–1.00/day) for comfortable VRAM headroom. RTX 3060 works for Turbo model. For the Multilingual model with long texts, consider an RTX 4090 (~~$0.50–2.00/day).

## Installation

```bash
# Install from PyPI
pip install chatterbox-tts

# Or install from source
git clone https://github.com/resemble-ai/chatterbox.git
cd chatterbox
pip install -e .

# Verify
python -c "from chatterbox.tts import ChatterboxTTS; print('Chatterbox ready')"
```

## Quick Start

### Turbo Model (Lowest Latency)

```python
import torchaudio as ta
from chatterbox.tts_turbo import ChatterboxTurboTTS

model = ChatterboxTurboTTS.from_pretrained(device="cuda")

# Basic TTS with paralinguistic tags
text = "Hey, welcome back! [chuckle] I've got some great news for you today."

# Voice cloning — provide a 10+ second reference clip
wav = model.generate(text, audio_prompt_path="reference_voice.wav")

ta.save("output_turbo.wav", wav, model.sr)
print(f"Saved at {model.sr} Hz")
```

### Original Model (English, Creative Controls)

```python
import torchaudio as ta
from chatterbox.tts import ChatterboxTTS

model = ChatterboxTTS.from_pretrained(device="cuda")

text = "The quick brown fox jumps over the lazy dog. It was a beautiful morning."

# Generate without voice cloning (uses default voice)
wav = model.generate(text)
ta.save("output_default.wav", wav, model.sr)

# Generate with voice cloning
wav = model.generate(text, audio_prompt_path="my_voice_sample.wav")
ta.save("output_cloned.wav", wav, model.sr)
```

## Usage Examples

### Multilingual Voice Cloning

```python
import torchaudio as ta
from chatterbox.mtl_tts import ChatterboxMultilingualTTS

model = ChatterboxMultilingualTTS.from_pretrained(device="cuda")

# French
french_text = "Bonjour, comment allez-vous? Bienvenue dans notre démonstration."
wav_fr = model.generate(french_text, language_id="fr")
ta.save("output_french.wav", wav_fr, model.sr)

# Japanese
japanese_text = "こんにちは、テキスト読み上げのデモンストレーションです。"
wav_ja = model.generate(japanese_text, language_id="ja")
ta.save("output_japanese.wav", wav_ja, model.sr)

# Russian with voice cloning
russian_text = "Привет! Это демонстрация синтеза речи на русском языке."
wav_ru = model.generate(
    russian_text,
    language_id="ru",
    audio_prompt_path="russian_speaker.wav"
)
ta.save("output_russian.wav", wav_ru, model.sr)

print("Multilingual generation complete")
```

### Paralinguistic Tags (Turbo)

```python
import torchaudio as ta
from chatterbox.tts_turbo import ChatterboxTurboTTS

model = ChatterboxTurboTTS.from_pretrained(device="cuda")

samples = [
    ("greeting", "Hi there! [laugh] It's so good to see you again."),
    ("nervous", "Um, well [cough] I'm not really sure about that."),
    ("excited", "Oh my gosh! [chuckle] That's absolutely incredible news!"),
]

for name, text in samples:
    wav = model.generate(text, audio_prompt_path="speaker_ref.wav")
    ta.save(f"para_{name}.wav", wav, model.sr)
    print(f"Generated: {name}")
```

### Batch Processing Script

```python
import torchaudio as ta
from chatterbox.tts import ChatterboxTTS
import os

model = ChatterboxTTS.from_pretrained(device="cuda")

# Process a list of lines (e.g., for audiobook chapters)
lines = [
    "Chapter one. The adventure begins.",
    "It was a dark and stormy night.",
    "The hero stood at the crossroads, uncertain of the path ahead.",
]

os.makedirs("output_batch", exist_ok=True)

for i, line in enumerate(lines):
    wav = model.generate(line, audio_prompt_path="narrator_voice.wav")
    ta.save(f"output_batch/line_{i:03d}.wav", wav, model.sr)
    print(f"[{i+1}/{len(lines)}] {line[:40]}...")

print("Batch processing complete")
```

## Tips for Clore.ai Users

* **Model choice** — use Turbo for low-latency voice agents, Original for English creative work, Multilingual for non-English content
* **Reference audio quality** — use a clean, noise-free 10–30 second clip for best voice cloning results
* **Docker setup** — base image `pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime`, expose port `7860/http` for Gradio
* **Memory management** — call `torch.cuda.empty_cache()` between large batches to free VRAM
* **Supported languages** — ar, da, de, el, en, es, fi, fr, he, hi, it, ja, ko, ms, nl, no, pl, pt, ru, sv, sw, tr, zh
* **HuggingFace Space** — try before renting at [huggingface.co/spaces/ResembleAI/Chatterbox](https://huggingface.co/spaces/ResembleAI/Chatterbox)

## Troubleshooting

| Problem                    | Solution                                                                                       |
| -------------------------- | ---------------------------------------------------------------------------------------------- |
| `CUDA out of memory`       | Use Turbo (350M) instead of Original/Multilingual (500M), or rent a larger GPU                 |
| Cloned voice doesn't match | Use a longer (15–30s), cleaner reference clip with minimal background noise                    |
| `numpy` version conflict   | Run `pip install numpy==1.26.4 --force-reinstall`                                              |
| Slow model download        | Models are fetched from HuggingFace on first run (\~2 GB); pre-download with `huggingface-cli` |
| Audio has artifacts        | Reduce text length per generation; very long texts can degrade quality                         |
| `ModuleNotFoundError`      | Ensure `pip install chatterbox-tts` completed without errors; check Python 3.11 compatibility  |
