ChatTTS Conversational Speech
Run ChatTTS conversational text-to-speech with fine-grained prosody control on Clore.ai GPUs.
ChatTTS is a 300M-parameter generative speech model optimized for dialogue scenarios such as LLM assistants, chatbots, and interactive voice applications. It produces natural-sounding speech with realistic pauses, laughter, fillers, and intonation — characteristics that most TTS systems struggle to reproduce. The model supports English and Chinese and generates audio at 24 kHz.
GitHub: 2noise/ChatTTS (30K+ stars) License: AGPLv3+ (code), CC BY-NC 4.0 (model weights — non-commercial)
Key Features
Conversational prosody — natural pauses, fillers, and intonation tuned for dialogue
Fine-grained control tags —
[oral_0-9],[laugh_0-2],[break_0-7],[uv_break],[lbreak]Multi-speaker — sample random speakers or reuse speaker embeddings for consistency
Temperature / top-P / top-K — control generation diversity
Batch inference — synthesize multiple texts in a single call
Lightweight — ~300M parameters, runs on 4 GB VRAM
Requirements
GPU
RTX 3060 (4 GB free)
RTX 3090 / RTX 4090
VRAM
4 GB
8 GB+
RAM
8 GB
16 GB
Disk
5 GB
10 GB
Python
3.9+
3.11
CUDA
11.8+
12.1+
Clore.ai recommendation: An RTX 3060 ($0.15–0.30/day) handles ChatTTS comfortably. For batch production or lower latency, pick an RTX 3090 ($0.30–1.00/day).
Installation
Quick Start
Usage Examples
Consistent Speaker Voice
Sample a random speaker embedding and reuse it across multiple generations for a consistent voice:
Word-Level Control Tags
Insert control tags directly into text for precise prosody:
Batch Processing with WebUI
ChatTTS ships with a Gradio web interface for interactive use:
Open the http_pub URL from your Clore.ai order dashboard to access the UI.
Tips for Clore.ai Users
Use
compile=Trueafter initial testing — PyTorch compilation adds startup time but speeds up repeated inference significantlyPort mapping — expose port
7860/httpwhen deploying with the WebUIDocker image — use
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtimeas a baseSpeaker persistence — save
rand_spkstrings to a file so you can reuse voices across sessions without re-samplingBatch your requests —
chat.infer()accepts a list of texts and processes them together, which is more efficient than one-by-one callsNon-commercial license — the model weights are CC BY-NC 4.0; check licensing requirements for your use case
Troubleshooting
CUDA out of memory
Reduce batch size or use a GPU with ≥ 6 GB VRAM
Model downloads slowly
Pre-download from HuggingFace: huggingface-cli download 2Noise/ChatTTS
Audio has static/noise
This is intentional in the open-source model (anti-abuse measure); use compile=True for cleaner output
torchaudio.save dimension error
Ensure tensor is 2D: audio.unsqueeze(0) if needed
Garbled Chinese output
Make sure input text is UTF-8 encoded; install WeTextProcessing for better normalization
Slow first inference
Normal — model compilation and weight loading happen on first call; subsequent calls are faster
Last updated
Was this helpful?