ChatTTS Conversational Speech

Run ChatTTS conversational text-to-speech with fine-grained prosody control on Clore.ai GPUs.

ChatTTS is a 300M-parameter generative speech model optimized for dialogue scenarios such as LLM assistants, chatbots, and interactive voice applications. It produces natural-sounding speech with realistic pauses, laughter, fillers, and intonation — characteristics that most TTS systems struggle to reproduce. The model supports English and Chinese and generates audio at 24 kHz.

GitHub: 2noise/ChatTTS (30K+ stars) License: AGPLv3+ (code), CC BY-NC 4.0 (model weights — non-commercial)

Key Features

Conversational prosody — natural pauses, fillers, and intonation tuned for dialogue
Fine-grained control tags — [oral_0-9], [laugh_0-2], [break_0-7], [uv_break], [lbreak]
Multi-speaker — sample random speakers or reuse speaker embeddings for consistency
Temperature / top-P / top-K — control generation diversity
Batch inference — synthesize multiple texts in a single call
Lightweight — ~300M parameters, runs on 4 GB VRAM

Requirements

Component

Minimum

Recommended

GPU

RTX 3060 (4 GB free)

RTX 3090 / RTX 4090

VRAM

4 GB

8 GB+

RAM

8 GB

16 GB

Disk

5 GB

10 GB

Python

3.9+

3.11

CUDA

11.8+

12.1+

Clore.ai recommendation: An RTX 3060 (~~$0.15–0.30/day) handles ChatTTS comfortably. For batch production or lower latency, pick an RTX 3090 (~~$0.30–1.00/day).

Installation

# Install from PyPI
pip install ChatTTS torch torchaudio

# Or install from source for the latest features
git clone https://github.com/2noise/ChatTTS.git
cd ChatTTS
pip install -r requirements.txt

# Verify GPU
python -c "import torch; print(torch.cuda.get_device_name(0))"

Quick Start

import ChatTTS
import torch
import torchaudio

# Initialize and load model (downloads weights on first run)
chat = ChatTTS.Chat()
chat.load(compile=False)  # Set compile=True for faster inference after warmup

texts = [
    "Hey there! How's your day going so far?",
    "I've been working on this project all morning. It's coming along nicely.",
]

wavs = chat.infer(texts)

for i, wav in enumerate(wavs):
    audio_tensor = torch.from_numpy(wav)
    if audio_tensor.dim() == 1:
        audio_tensor = audio_tensor.unsqueeze(0)
    torchaudio.save(f"output_{i}.wav", audio_tensor, 24000)
    print(f"Saved output_{i}.wav")

Usage Examples

Consistent Speaker Voice

Sample a random speaker embedding and reuse it across multiple generations for a consistent voice:

import ChatTTS
import torch
import torchaudio

chat = ChatTTS.Chat()
chat.load(compile=False)

# Sample a speaker — save this string to reuse later
rand_spk = chat.sample_random_speaker()

params_infer_code = ChatTTS.Chat.InferCodeParams(
    spk_emb=rand_spk,
    temperature=0.3,
    top_P=0.7,
    top_K=20,
)

params_refine_text = ChatTTS.Chat.RefineTextParams(
    prompt='[oral_2][laugh_0][break_4]',
)

texts = ["Welcome to today's episode. Let me tell you about something exciting."]

wavs = chat.infer(
    texts,
    params_refine_text=params_refine_text,
    params_infer_code=params_infer_code,
)

audio = torch.from_numpy(wavs[0])
if audio.dim() == 1:
    audio = audio.unsqueeze(0)
torchaudio.save("consistent_speaker.wav", audio, 24000)

Word-Level Control Tags

Insert control tags directly into text for precise prosody:

import ChatTTS
import torch
import torchaudio

chat = ChatTTS.Chat()
chat.load(compile=False)

# Tags: [uv_break] = short pause, [laugh] = laughter, [lbreak] = long break
text = 'What is [uv_break]your favorite food?[laugh][lbreak]'

rand_spk = chat.sample_random_speaker()
params = ChatTTS.Chat.InferCodeParams(spk_emb=rand_spk, temperature=0.3)

# skip_refine_text=True preserves your manual control tags
wavs = chat.infer(text, skip_refine_text=True, params_infer_code=params)

audio = torch.from_numpy(wavs[0])
if audio.dim() == 1:
    audio = audio.unsqueeze(0)
torchaudio.save("controlled_output.wav", audio, 24000)

Batch Processing with WebUI

ChatTTS ships with a Gradio web interface for interactive use:

cd ChatTTS
python examples/web/webui.py --server_name 0.0.0.0 --server_port 7860

Open the http_pub URL from your Clore.ai order dashboard to access the UI.

Tips for Clore.ai Users

Use compile=True after initial testing — PyTorch compilation adds startup time but speeds up repeated inference significantly
Port mapping — expose port 7860/http when deploying with the WebUI
Docker image — use pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime as a base
Speaker persistence — save rand_spk strings to a file so you can reuse voices across sessions without re-sampling
Batch your requests — chat.infer() accepts a list of texts and processes them together, which is more efficient than one-by-one calls
Non-commercial license — the model weights are CC BY-NC 4.0; check licensing requirements for your use case

Troubleshooting

Problem

Solution

CUDA out of memory

Reduce batch size or use a GPU with ≥ 6 GB VRAM

Model downloads slowly

Pre-download from HuggingFace: huggingface-cli download 2Noise/ChatTTS

Audio has static/noise

This is intentional in the open-source model (anti-abuse measure); use compile=True for cleaner output

torchaudio.save dimension error

Ensure tensor is 2D: audio.unsqueeze(0) if needed

Garbled Chinese output

Make sure input text is UTF-8 encoded; install WeTextProcessing for better normalization

Slow first inference

Normal — model compilation and weight loading happen on first call; subsequent calls are faster

PreviousKokoro TTS NextChatterbox Voice Cloning

Last updated 22 days ago

Was this helpful?

hashtagKey Features

hashtagRequirements

hashtagInstallation

hashtagQuick Start

hashtagUsage Examples

hashtagConsistent Speaker Voice

hashtagWord-Level Control Tags

hashtagBatch Processing with WebUI

hashtagTips for Clore.ai Users

hashtagTroubleshooting