# ChatTTS Conversational Speech

ChatTTS is a 300M-parameter generative speech model optimized for dialogue scenarios such as LLM assistants, chatbots, and interactive voice applications. It produces natural-sounding speech with realistic pauses, laughter, fillers, and intonation — characteristics that most TTS systems struggle to reproduce. The model supports English and Chinese and generates audio at 24 kHz.

**GitHub:** [2noise/ChatTTS](https://github.com/2noise/ChatTTS) (30K+ stars) **License:** AGPLv3+ (code), CC BY-NC 4.0 (model weights — non-commercial)

## Key Features

* **Conversational prosody** — natural pauses, fillers, and intonation tuned for dialogue
* **Fine-grained control tags** — `[oral_0-9]`, `[laugh_0-2]`, `[break_0-7]`, `[uv_break]`, `[lbreak]`
* **Multi-speaker** — sample random speakers or reuse speaker embeddings for consistency
* **Temperature / top-P / top-K** — control generation diversity
* **Batch inference** — synthesize multiple texts in a single call
* **Lightweight** — \~300M parameters, runs on 4 GB VRAM

## Requirements

| Component | Minimum              | Recommended         |
| --------- | -------------------- | ------------------- |
| GPU       | RTX 3060 (4 GB free) | RTX 3090 / RTX 4090 |
| VRAM      | 4 GB                 | 8 GB+               |
| RAM       | 8 GB                 | 16 GB               |
| Disk      | 5 GB                 | 10 GB               |
| Python    | 3.9+                 | 3.11                |
| CUDA      | 11.8+                | 12.1+               |

**Clore.ai recommendation:** An RTX 3060 (~~$0.15–0.30/day) handles ChatTTS comfortably. For batch production or lower latency, pick an RTX 3090 (~~$0.30–1.00/day).

## Installation

```bash
# Install from PyPI
pip install ChatTTS torch torchaudio

# Or install from source for the latest features
git clone https://github.com/2noise/ChatTTS.git
cd ChatTTS
pip install -r requirements.txt

# Verify GPU
python -c "import torch; print(torch.cuda.get_device_name(0))"
```

## Quick Start

```python
import ChatTTS
import torch
import torchaudio

# Initialize and load model (downloads weights on first run)
chat = ChatTTS.Chat()
chat.load(compile=False)  # Set compile=True for faster inference after warmup

texts = [
    "Hey there! How's your day going so far?",
    "I've been working on this project all morning. It's coming along nicely.",
]

wavs = chat.infer(texts)

for i, wav in enumerate(wavs):
    audio_tensor = torch.from_numpy(wav)
    if audio_tensor.dim() == 1:
        audio_tensor = audio_tensor.unsqueeze(0)
    torchaudio.save(f"output_{i}.wav", audio_tensor, 24000)
    print(f"Saved output_{i}.wav")
```

## Usage Examples

### Consistent Speaker Voice

Sample a random speaker embedding and reuse it across multiple generations for a consistent voice:

```python
import ChatTTS
import torch
import torchaudio

chat = ChatTTS.Chat()
chat.load(compile=False)

# Sample a speaker — save this string to reuse later
rand_spk = chat.sample_random_speaker()

params_infer_code = ChatTTS.Chat.InferCodeParams(
    spk_emb=rand_spk,
    temperature=0.3,
    top_P=0.7,
    top_K=20,
)

params_refine_text = ChatTTS.Chat.RefineTextParams(
    prompt='[oral_2][laugh_0][break_4]',
)

texts = ["Welcome to today's episode. Let me tell you about something exciting."]

wavs = chat.infer(
    texts,
    params_refine_text=params_refine_text,
    params_infer_code=params_infer_code,
)

audio = torch.from_numpy(wavs[0])
if audio.dim() == 1:
    audio = audio.unsqueeze(0)
torchaudio.save("consistent_speaker.wav", audio, 24000)
```

### Word-Level Control Tags

Insert control tags directly into text for precise prosody:

```python
import ChatTTS
import torch
import torchaudio

chat = ChatTTS.Chat()
chat.load(compile=False)

# Tags: [uv_break] = short pause, [laugh] = laughter, [lbreak] = long break
text = 'What is [uv_break]your favorite food?[laugh][lbreak]'

rand_spk = chat.sample_random_speaker()
params = ChatTTS.Chat.InferCodeParams(spk_emb=rand_spk, temperature=0.3)

# skip_refine_text=True preserves your manual control tags
wavs = chat.infer(text, skip_refine_text=True, params_infer_code=params)

audio = torch.from_numpy(wavs[0])
if audio.dim() == 1:
    audio = audio.unsqueeze(0)
torchaudio.save("controlled_output.wav", audio, 24000)
```

### Batch Processing with WebUI

ChatTTS ships with a Gradio web interface for interactive use:

```bash
cd ChatTTS
python examples/web/webui.py --server_name 0.0.0.0 --server_port 7860
```

Open the `http_pub` URL from your Clore.ai order dashboard to access the UI.

## Tips for Clore.ai Users

* **Use `compile=True`** after initial testing — PyTorch compilation adds startup time but speeds up repeated inference significantly
* **Port mapping** — expose port `7860/http` when deploying with the WebUI
* **Docker image** — use `pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime` as a base
* **Speaker persistence** — save `rand_spk` strings to a file so you can reuse voices across sessions without re-sampling
* **Batch your requests** — `chat.infer()` accepts a list of texts and processes them together, which is more efficient than one-by-one calls
* **Non-commercial license** — the model weights are CC BY-NC 4.0; check licensing requirements for your use case

## Troubleshooting

| Problem                           | Solution                                                                                                 |
| --------------------------------- | -------------------------------------------------------------------------------------------------------- |
| `CUDA out of memory`              | Reduce batch size or use a GPU with ≥ 6 GB VRAM                                                          |
| Model downloads slowly            | Pre-download from HuggingFace: `huggingface-cli download 2Noise/ChatTTS`                                 |
| Audio has static/noise            | This is intentional in the open-source model (anti-abuse measure); use `compile=True` for cleaner output |
| `torchaudio.save` dimension error | Ensure tensor is 2D: `audio.unsqueeze(0)` if needed                                                      |
| Garbled Chinese output            | Make sure input text is UTF-8 encoded; install `WeTextProcessing` for better normalization               |
| Slow first inference              | Normal — model compilation and weight loading happen on first call; subsequent calls are faster          |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/audio-and-voice/chattts.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
