Chatterbox Voice Cloning

Run Chatterbox TTS by Resemble AI for zero-shot voice cloning and multilingual speech synthesis on Clore.ai GPUs.

Chatterbox is a family of state-of-the-art open-source text-to-speech models by Resemble AI. It performs zero-shot voice cloning from a short reference clip (~10 seconds), supports paralinguistic tags like [laugh] and [cough], and offers a multilingual variant covering 23+ languages. Three model variants are available: Turbo (350M, low-latency), Original (500M, creative controls), and Multilingual (500M, 23+ languages).

GitHub: resemble-ai/chatterbox PyPI: chatterbox-tts License: MIT

Key Features

Zero-shot voice cloning — clone any voice from ~10 seconds of reference audio
Paralinguistic tags (Turbo) — [laugh], [cough], [chuckle], [sigh] for realistic speech
23+ languages (Multilingual) — Arabic, Chinese, French, German, Japanese, Korean, Russian, Spanish, and more
CFG & Exaggeration tuning (Original) — creative control over expressiveness
Three model sizes — Turbo (350M), Original (500M), Multilingual (500M)
MIT license — fully open for commercial use

Requirements

Component

Minimum

Recommended

GPU

RTX 3060 12 GB

RTX 3090 / RTX 4090

VRAM

6 GB

10 GB+

RAM

8 GB

16 GB

Disk

5 GB

15 GB

Python

3.10+

3.11

CUDA

11.8+

12.1+

Clore.ai recommendation: RTX 3090 (~~$0.30–1.00/day) for comfortable VRAM headroom. RTX 3060 works for Turbo model. For the Multilingual model with long texts, consider an RTX 4090 (~~$0.50–2.00/day).

Installation

# Install from PyPI
pip install chatterbox-tts

# Or install from source
git clone https://github.com/resemble-ai/chatterbox.git
cd chatterbox
pip install -e .

# Verify
python -c "from chatterbox.tts import ChatterboxTTS; print('Chatterbox ready')"

Quick Start

Turbo Model (Lowest Latency)

import torchaudio as ta
from chatterbox.tts_turbo import ChatterboxTurboTTS

model = ChatterboxTurboTTS.from_pretrained(device="cuda")

# Basic TTS with paralinguistic tags
text = "Hey, welcome back! [chuckle] I've got some great news for you today."

# Voice cloning — provide a 10+ second reference clip
wav = model.generate(text, audio_prompt_path="reference_voice.wav")

ta.save("output_turbo.wav", wav, model.sr)
print(f"Saved at {model.sr} Hz")

Original Model (English, Creative Controls)

import torchaudio as ta
from chatterbox.tts import ChatterboxTTS

model = ChatterboxTTS.from_pretrained(device="cuda")

text = "The quick brown fox jumps over the lazy dog. It was a beautiful morning."

# Generate without voice cloning (uses default voice)
wav = model.generate(text)
ta.save("output_default.wav", wav, model.sr)

# Generate with voice cloning
wav = model.generate(text, audio_prompt_path="my_voice_sample.wav")
ta.save("output_cloned.wav", wav, model.sr)

Usage Examples

Multilingual Voice Cloning

import torchaudio as ta
from chatterbox.mtl_tts import ChatterboxMultilingualTTS

model = ChatterboxMultilingualTTS.from_pretrained(device="cuda")

# French
french_text = "Bonjour, comment allez-vous? Bienvenue dans notre démonstration."
wav_fr = model.generate(french_text, language_id="fr")
ta.save("output_french.wav", wav_fr, model.sr)

# Japanese
japanese_text = "こんにちは、テキスト読み上げのデモンストレーションです。"
wav_ja = model.generate(japanese_text, language_id="ja")
ta.save("output_japanese.wav", wav_ja, model.sr)

# Russian with voice cloning
russian_text = "Привет! Это демонстрация синтеза речи на русском языке."
wav_ru = model.generate(
    russian_text,
    language_id="ru",
    audio_prompt_path="russian_speaker.wav"
)
ta.save("output_russian.wav", wav_ru, model.sr)

print("Multilingual generation complete")

Paralinguistic Tags (Turbo)

import torchaudio as ta
from chatterbox.tts_turbo import ChatterboxTurboTTS

model = ChatterboxTurboTTS.from_pretrained(device="cuda")

samples = [
    ("greeting", "Hi there! [laugh] It's so good to see you again."),
    ("nervous", "Um, well [cough] I'm not really sure about that."),
    ("excited", "Oh my gosh! [chuckle] That's absolutely incredible news!"),
]

for name, text in samples:
    wav = model.generate(text, audio_prompt_path="speaker_ref.wav")
    ta.save(f"para_{name}.wav", wav, model.sr)
    print(f"Generated: {name}")

Batch Processing Script

import torchaudio as ta
from chatterbox.tts import ChatterboxTTS
import os

model = ChatterboxTTS.from_pretrained(device="cuda")

# Process a list of lines (e.g., for audiobook chapters)
lines = [
    "Chapter one. The adventure begins.",
    "It was a dark and stormy night.",
    "The hero stood at the crossroads, uncertain of the path ahead.",
]

os.makedirs("output_batch", exist_ok=True)

for i, line in enumerate(lines):
    wav = model.generate(line, audio_prompt_path="narrator_voice.wav")
    ta.save(f"output_batch/line_{i:03d}.wav", wav, model.sr)
    print(f"[{i+1}/{len(lines)}] {line[:40]}...")

print("Batch processing complete")

Tips for Clore.ai Users

Model choice — use Turbo for low-latency voice agents, Original for English creative work, Multilingual for non-English content
Reference audio quality — use a clean, noise-free 10–30 second clip for best voice cloning results
Docker setup — base image pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime, expose port 7860/http for Gradio
Memory management — call torch.cuda.empty_cache() between large batches to free VRAM
Supported languages — ar, da, de, el, en, es, fi, fr, he, hi, it, ja, ko, ms, nl, no, pl, pt, ru, sv, sw, tr, zh
HuggingFace Space — try before renting at huggingface.co/spaces/ResembleAI/Chatterbox

Troubleshooting

Problem

Solution

CUDA out of memory

Use Turbo (350M) instead of Original/Multilingual (500M), or rent a larger GPU

Cloned voice doesn't match

Use a longer (15–30s), cleaner reference clip with minimal background noise

numpy version conflict

Run pip install numpy==1.26.4 --force-reinstall

Slow model download

Models are fetched from HuggingFace on first run (~2 GB); pre-download with huggingface-cli

Audio has artifacts

Reduce text length per generation; very long texts can degrade quality

ModuleNotFoundError

Ensure pip install chatterbox-tts completed without errors; check Python 3.11 compatibility

PreviousChatTTS Conversational Speech NextKani-TTS-2 Voice Cloning

Last updated 24 days ago

Was this helpful?

hashtagKey Features

hashtagRequirements

hashtagInstallation

hashtagQuick Start

hashtagTurbo Model (Lowest Latency)

hashtagOriginal Model (English, Creative Controls)

hashtagUsage Examples

hashtagMultilingual Voice Cloning

hashtagParalinguistic Tags (Turbo)

hashtagBatch Processing Script

hashtagTips for Clore.ai Users

hashtagTroubleshooting