Chatterbox Voice Cloning

Run Chatterbox TTS by Resemble AI for zero-shot voice cloning and multilingual speech synthesis on Clore.ai GPUs.

Chatterbox is a family of state-of-the-art open-source text-to-speech models by Resemble AIarrow-up-right. It performs zero-shot voice cloning from a short reference clip (~10 seconds), supports paralinguistic tags like [laugh] and [cough], and offers a multilingual variant covering 23+ languages. Three model variants are available: Turbo (350M, low-latency), Original (500M, creative controls), and Multilingual (500M, 23+ languages).

GitHub: resemble-ai/chatterboxarrow-up-right PyPI: chatterbox-ttsarrow-up-right License: MIT

Key Features

  • Zero-shot voice cloning — clone any voice from ~10 seconds of reference audio

  • Paralinguistic tags (Turbo) — [laugh], [cough], [chuckle], [sigh] for realistic speech

  • 23+ languages (Multilingual) — Arabic, Chinese, French, German, Japanese, Korean, Russian, Spanish, and more

  • CFG & Exaggeration tuning (Original) — creative control over expressiveness

  • Three model sizes — Turbo (350M), Original (500M), Multilingual (500M)

  • MIT license — fully open for commercial use

Requirements

Component
Minimum
Recommended

GPU

RTX 3060 12 GB

RTX 3090 / RTX 4090

VRAM

6 GB

10 GB+

RAM

8 GB

16 GB

Disk

5 GB

15 GB

Python

3.10+

3.11

CUDA

11.8+

12.1+

Clore.ai recommendation: RTX 3090 ($0.30–1.00/day) for comfortable VRAM headroom. RTX 3060 works for Turbo model. For the Multilingual model with long texts, consider an RTX 4090 ($0.50–2.00/day).

Installation

Quick Start

Turbo Model (Lowest Latency)

Original Model (English, Creative Controls)

Usage Examples

Multilingual Voice Cloning

Paralinguistic Tags (Turbo)

Batch Processing Script

Tips for Clore.ai Users

  • Model choice — use Turbo for low-latency voice agents, Original for English creative work, Multilingual for non-English content

  • Reference audio quality — use a clean, noise-free 10–30 second clip for best voice cloning results

  • Docker setup — base image pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime, expose port 7860/http for Gradio

  • Memory management — call torch.cuda.empty_cache() between large batches to free VRAM

  • Supported languages — ar, da, de, el, en, es, fi, fr, he, hi, it, ja, ko, ms, nl, no, pl, pt, ru, sv, sw, tr, zh

  • HuggingFace Space — try before renting at huggingface.co/spaces/ResembleAI/Chatterboxarrow-up-right

Troubleshooting

Problem
Solution

CUDA out of memory

Use Turbo (350M) instead of Original/Multilingual (500M), or rent a larger GPU

Cloned voice doesn't match

Use a longer (15–30s), cleaner reference clip with minimal background noise

numpy version conflict

Run pip install numpy==1.26.4 --force-reinstall

Slow model download

Models are fetched from HuggingFace on first run (~2 GB); pre-download with huggingface-cli

Audio has artifacts

Reduce text length per generation; very long texts can degrade quality

ModuleNotFoundError

Ensure pip install chatterbox-tts completed without errors; check Python 3.11 compatibility

Last updated

Was this helpful?