# Dia TTS (Nari Labs)

Dia by Nari Labs is an advanced text-to-speech model that specializes in **realistic multi-speaker dialogue**. Unlike traditional TTS that handles one speaker at a time, Dia generates natural conversations between multiple speakers with emotion, laughter, hesitation, and other non-verbal cues. At 1.6B parameters, it runs on any 8GB+ GPU.

## Key Features

* **Multi-speaker dialog**: Generate conversations between 2+ speakers in one pass
* **Non-verbal cues**: Laughter `(laughs)`, hesitation `(sighs)`, pauses — automatically embedded
* **Emotional speech**: Natural intonation without explicit emotion tags
* **1.6B parameters**: Fits on RTX 3070/3080 (8-10GB VRAM)
* **Apache 2.0 license**: Full commercial use
* **HuggingFace integration**: Works with Transformers library

## Requirements

| Component | Minimum        | Recommended     |
| --------- | -------------- | --------------- |
| GPU       | RTX 3070 (8GB) | RTX 3080 (10GB) |
| VRAM      | 8GB            | 10GB+           |
| RAM       | 16GB           | 32GB            |
| Disk      | 10GB           | 15GB            |
| Python    | 3.9+           | 3.11            |

**Recommended Clore.ai GPU**: RTX 3080 10GB (\~$0.2–0.5/day)

## Installation

```bash
# Option 1: pip install
pip install dia-tts

# Option 2: From source
git clone https://github.com/nari-labs/dia.git
cd dia
pip install -e .
```

## Quick Start

### Basic Multi-Speaker Dialog

```python
from dia import Dia

# Load model
model = Dia.from_pretrained("nari-labs/Dia-1.6B")

# Generate multi-speaker conversation
# [S1] = Speaker 1, [S2] = Speaker 2
text = """[S1] Hey, have you tried the new GPU rental platform?
[S2] You mean Clore? Yeah, I rented an RTX 4090 yesterday.
[S1] How was it?
[S2] (laughs) Honestly? Way cheaper than I expected. Like two bucks a day.
[S1] No way. That's... that's actually insane."""

audio = model.generate(text)

# Save to file
import soundfile as sf
sf.write("dialog.wav", audio, samplerate=24000)
```

### With Emotion and Non-Verbal Cues

```python
# Dia automatically handles natural speech patterns
text = """[S1] I just got the results back...
[S2] And? Don't keep me in suspense!
[S1] (sighs) We passed. We actually passed all the tests.
[S2] (laughs) I told you! I told you we'd make it!
[S1] I can't believe it... (laughs) okay, okay, let's celebrate."""

audio = model.generate(text, temperature=0.8)
sf.write("emotional_dialog.wav", audio, samplerate=24000)
```

### Single Speaker

```python
# Works for single speaker too
text = "[S1] Welcome to the Clore AI documentation. In this guide, we'll walk through setting up your first GPU rental and deploying a machine learning model."

audio = model.generate(text)
sf.write("narration.wav", audio, samplerate=24000)
```

## Gradio Web UI

```python
# Launch interactive demo
python -m dia.app --port 7860 --share

# Or manually:
import gradio as gr
from dia import Dia

model = Dia.from_pretrained("nari-labs/Dia-1.6B")

def generate_speech(text):
    audio = model.generate(text)
    return (24000, audio)

demo = gr.Interface(
    fn=generate_speech,
    inputs=gr.Textbox(label="Dialog (use [S1], [S2] tags)", lines=10),
    outputs=gr.Audio(label="Generated Speech"),
    title="Dia TTS — Multi-Speaker Dialog"
)
demo.launch(server_port=7860)
```

## Use Cases

* **Podcast generation**: Create conversational podcasts from scripts
* **Audiobook dialogs**: Generate character conversations with distinct voices
* **Game dialogue**: NPC conversations with natural speech patterns
* **Training data**: Generate diverse speech datasets for ASR training
* **Chatbot voices**: Multi-turn dialog with emotional responses

## Tips for Clore.ai Users

* **RTX 3080 is ideal**: 10GB VRAM handles Dia easily at \~$0.2–0.5/day
* **Batch generation**: Process multiple dialogs in a loop to maximize your rental time
* **Save models to persistent storage**: If your Clore instance has persistent disk, cache the model to avoid re-downloading
* **Temperature 0.7–0.9**: Lower = more consistent, higher = more expressive/varied
* **English only**: Dia currently focuses on English — for multilingual, see Qwen3-TTS guide

## Troubleshooting

| Issue                   | Solution                                                              |
| ----------------------- | --------------------------------------------------------------------- |
| CUDA out of memory      | Use `model.to("cuda", torch_dtype=torch.float16)` for half precision  |
| Speakers sound similar  | Add more text/context per speaker; try higher temperature             |
| Non-verbal cues ignored | Ensure correct format: `(laughs)`, `(sighs)` in parentheses           |
| Audio quality low       | Increase `num_steps` parameter if available; ensure 24kHz sample rate |

## Further Reading

* [Nari Labs GitHub](https://github.com/nari-labs/dia)
* [HuggingFace Model](https://huggingface.co/nari-labs/Dia-1.6B)
* [Comparison: Dia vs ElevenLabs](https://nari-labs.github.io/dia/) — official demo page
