# ChatTTS Conversational Speech

ChatTTS is a 300M-parameter generative speech model optimized for dialogue scenarios such as LLM assistants, chatbots, and interactive voice applications. It produces natural-sounding speech with realistic pauses, laughter, fillers, and intonation — characteristics that most TTS systems struggle to reproduce. The model supports English and Chinese and generates audio at 24 kHz.

**GitHub:** [2noise/ChatTTS](https://github.com/2noise/ChatTTS) (30K+ stars) **License:** AGPLv3+ (code), CC BY-NC 4.0 (model weights — non-commercial)

## Key Features

* **Conversational prosody** — natural pauses, fillers, and intonation tuned for dialogue
* **Fine-grained control tags** — `[oral_0-9]`, `[laugh_0-2]`, `[break_0-7]`, `[uv_break]`, `[lbreak]`
* **Multi-speaker** — sample random speakers or reuse speaker embeddings for consistency
* **Temperature / top-P / top-K** — control generation diversity
* **Batch inference** — synthesize multiple texts in a single call
* **Lightweight** — \~300M parameters, runs on 4 GB VRAM

## Requirements

| Component | Minimum              | Recommended         |
| --------- | -------------------- | ------------------- |
| GPU       | RTX 3060 (4 GB free) | RTX 3090 / RTX 4090 |
| VRAM      | 4 GB                 | 8 GB+               |
| RAM       | 8 GB                 | 16 GB               |
| Disk      | 5 GB                 | 10 GB               |
| Python    | 3.9+                 | 3.11                |
| CUDA      | 11.8+                | 12.1+               |

**Clore.ai recommendation:** An RTX 3060 (~~$0.15–0.30/day) handles ChatTTS comfortably. For batch production or lower latency, pick an RTX 3090 (~~$0.30–1.00/day).

## Installation

```bash
# Install from PyPI
pip install ChatTTS torch torchaudio

# Or install from source for the latest features
git clone https://github.com/2noise/ChatTTS.git
cd ChatTTS
pip install -r requirements.txt

# Verify GPU
python -c "import torch; print(torch.cuda.get_device_name(0))"
```

## Quick Start

```python
import ChatTTS
import torch
import torchaudio

# Initialize and load model (downloads weights on first run)
chat = ChatTTS.Chat()
chat.load(compile=False)  # Set compile=True for faster inference after warmup

texts = [
    "Hey there! How's your day going so far?",
    "I've been working on this project all morning. It's coming along nicely.",
]

wavs = chat.infer(texts)

for i, wav in enumerate(wavs):
    audio_tensor = torch.from_numpy(wav)
    if audio_tensor.dim() == 1:
        audio_tensor = audio_tensor.unsqueeze(0)
    torchaudio.save(f"output_{i}.wav", audio_tensor, 24000)
    print(f"Saved output_{i}.wav")
```

## Usage Examples

### Consistent Speaker Voice

Sample a random speaker embedding and reuse it across multiple generations for a consistent voice:

```python
import ChatTTS
import torch
import torchaudio

chat = ChatTTS.Chat()
chat.load(compile=False)

# Sample a speaker — save this string to reuse later
rand_spk = chat.sample_random_speaker()

params_infer_code = ChatTTS.Chat.InferCodeParams(
    spk_emb=rand_spk,
    temperature=0.3,
    top_P=0.7,
    top_K=20,
)

params_refine_text = ChatTTS.Chat.RefineTextParams(
    prompt='[oral_2][laugh_0][break_4]',
)

texts = ["Welcome to today's episode. Let me tell you about something exciting."]

wavs = chat.infer(
    texts,
    params_refine_text=params_refine_text,
    params_infer_code=params_infer_code,
)

audio = torch.from_numpy(wavs[0])
if audio.dim() == 1:
    audio = audio.unsqueeze(0)
torchaudio.save("consistent_speaker.wav", audio, 24000)
```

### Word-Level Control Tags

Insert control tags directly into text for precise prosody:

```python
import ChatTTS
import torch
import torchaudio

chat = ChatTTS.Chat()
chat.load(compile=False)

# Tags: [uv_break] = short pause, [laugh] = laughter, [lbreak] = long break
text = 'What is [uv_break]your favorite food?[laugh][lbreak]'

rand_spk = chat.sample_random_speaker()
params = ChatTTS.Chat.InferCodeParams(spk_emb=rand_spk, temperature=0.3)

# skip_refine_text=True preserves your manual control tags
wavs = chat.infer(text, skip_refine_text=True, params_infer_code=params)

audio = torch.from_numpy(wavs[0])
if audio.dim() == 1:
    audio = audio.unsqueeze(0)
torchaudio.save("controlled_output.wav", audio, 24000)
```

### Batch Processing with WebUI

ChatTTS ships with a Gradio web interface for interactive use:

```bash
cd ChatTTS
python examples/web/webui.py --server_name 0.0.0.0 --server_port 7860
```

Open the `http_pub` URL from your Clore.ai order dashboard to access the UI.

## Tips for Clore.ai Users

* **Use `compile=True`** after initial testing — PyTorch compilation adds startup time but speeds up repeated inference significantly
* **Port mapping** — expose port `7860/http` when deploying with the WebUI
* **Docker image** — use `pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime` as a base
* **Speaker persistence** — save `rand_spk` strings to a file so you can reuse voices across sessions without re-sampling
* **Batch your requests** — `chat.infer()` accepts a list of texts and processes them together, which is more efficient than one-by-one calls
* **Non-commercial license** — the model weights are CC BY-NC 4.0; check licensing requirements for your use case

## Troubleshooting

| Problem                           | Solution                                                                                                 |
| --------------------------------- | -------------------------------------------------------------------------------------------------------- |
| `CUDA out of memory`              | Reduce batch size or use a GPU with ≥ 6 GB VRAM                                                          |
| Model downloads slowly            | Pre-download from HuggingFace: `huggingface-cli download 2Noise/ChatTTS`                                 |
| Audio has static/noise            | This is intentional in the open-source model (anti-abuse measure); use `compile=True` for cleaner output |
| `torchaudio.save` dimension error | Ensure tensor is 2D: `audio.unsqueeze(0)` if needed                                                      |
| Garbled Chinese output            | Make sure input text is UTF-8 encoded; install `WeTextProcessing` for better normalization               |
| Slow first inference              | Normal — model compilation and weight loading happen on first call; subsequent calls are faster          |
