ChatTTS Conversational Speech

Run ChatTTS conversational text-to-speech with fine-grained prosody control on Clore.ai GPUs.

ChatTTS is a 300M-parameter generative speech model optimized for dialogue scenarios such as LLM assistants, chatbots, and interactive voice applications. It produces natural-sounding speech with realistic pauses, laughter, fillers, and intonation — characteristics that most TTS systems struggle to reproduce. The model supports English and Chinese and generates audio at 24 kHz.

GitHub: 2noise/ChatTTSarrow-up-right (30K+ stars) License: AGPLv3+ (code), CC BY-NC 4.0 (model weights — non-commercial)

Key Features

  • Conversational prosody — natural pauses, fillers, and intonation tuned for dialogue

  • Fine-grained control tags[oral_0-9], [laugh_0-2], [break_0-7], [uv_break], [lbreak]

  • Multi-speaker — sample random speakers or reuse speaker embeddings for consistency

  • Temperature / top-P / top-K — control generation diversity

  • Batch inference — synthesize multiple texts in a single call

  • Lightweight — ~300M parameters, runs on 4 GB VRAM

Requirements

Component
Minimum
Recommended

GPU

RTX 3060 (4 GB free)

RTX 3090 / RTX 4090

VRAM

4 GB

8 GB+

RAM

8 GB

16 GB

Disk

5 GB

10 GB

Python

3.9+

3.11

CUDA

11.8+

12.1+

Clore.ai recommendation: An RTX 3060 ($0.15–0.30/day) handles ChatTTS comfortably. For batch production or lower latency, pick an RTX 3090 ($0.30–1.00/day).

Installation

Quick Start

Usage Examples

Consistent Speaker Voice

Sample a random speaker embedding and reuse it across multiple generations for a consistent voice:

Word-Level Control Tags

Insert control tags directly into text for precise prosody:

Batch Processing with WebUI

ChatTTS ships with a Gradio web interface for interactive use:

Open the http_pub URL from your Clore.ai order dashboard to access the UI.

Tips for Clore.ai Users

  • Use compile=True after initial testing — PyTorch compilation adds startup time but speeds up repeated inference significantly

  • Port mapping — expose port 7860/http when deploying with the WebUI

  • Docker image — use pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime as a base

  • Speaker persistence — save rand_spk strings to a file so you can reuse voices across sessions without re-sampling

  • Batch your requestschat.infer() accepts a list of texts and processes them together, which is more efficient than one-by-one calls

  • Non-commercial license — the model weights are CC BY-NC 4.0; check licensing requirements for your use case

Troubleshooting

Problem
Solution

CUDA out of memory

Reduce batch size or use a GPU with ≥ 6 GB VRAM

Model downloads slowly

Pre-download from HuggingFace: huggingface-cli download 2Noise/ChatTTS

Audio has static/noise

This is intentional in the open-source model (anti-abuse measure); use compile=True for cleaner output

torchaudio.save dimension error

Ensure tensor is 2D: audio.unsqueeze(0) if needed

Garbled Chinese output

Make sure input text is UTF-8 encoded; install WeTextProcessing for better normalization

Slow first inference

Normal — model compilation and weight loading happen on first call; subsequent calls are faster

Last updated

Was this helpful?