Dia TTS (Nari Labs)

Generate multi-speaker dialog with emotion using Dia TTS by Nari Labs

Dia by Nari Labs is an advanced text-to-speech model that specializes in realistic multi-speaker dialogue. Unlike traditional TTS that handles one speaker at a time, Dia generates natural conversations between multiple speakers with emotion, laughter, hesitation, and other non-verbal cues. At 1.6B parameters, it runs on any 8GB+ GPU.

Key Features

  • Multi-speaker dialog: Generate conversations between 2+ speakers in one pass

  • Non-verbal cues: Laughter (laughs), hesitation (sighs), pauses — automatically embedded

  • Emotional speech: Natural intonation without explicit emotion tags

  • 1.6B parameters: Fits on RTX 3070/3080 (8-10GB VRAM)

  • Apache 2.0 license: Full commercial use

  • HuggingFace integration: Works with Transformers library

Requirements

Component
Minimum
Recommended

GPU

RTX 3070 (8GB)

RTX 3080 (10GB)

VRAM

8GB

10GB+

RAM

16GB

32GB

Disk

10GB

15GB

Python

3.9+

3.11

Recommended Clore.ai GPU: RTX 3080 10GB (~$0.2–0.5/day)

Installation

Quick Start

Basic Multi-Speaker Dialog

With Emotion and Non-Verbal Cues

Single Speaker

Gradio Web UI

Use Cases

  • Podcast generation: Create conversational podcasts from scripts

  • Audiobook dialogs: Generate character conversations with distinct voices

  • Game dialogue: NPC conversations with natural speech patterns

  • Training data: Generate diverse speech datasets for ASR training

  • Chatbot voices: Multi-turn dialog with emotional responses

Tips for Clore.ai Users

  • RTX 3080 is ideal: 10GB VRAM handles Dia easily at ~$0.2–0.5/day

  • Batch generation: Process multiple dialogs in a loop to maximize your rental time

  • Save models to persistent storage: If your Clore instance has persistent disk, cache the model to avoid re-downloading

  • Temperature 0.7–0.9: Lower = more consistent, higher = more expressive/varied

  • English only: Dia currently focuses on English — for multilingual, see Qwen3-TTS guide

Troubleshooting

Issue
Solution

CUDA out of memory

Use model.to("cuda", torch_dtype=torch.float16) for half precision

Speakers sound similar

Add more text/context per speaker; try higher temperature

Non-verbal cues ignored

Ensure correct format: (laughs), (sighs) in parentheses

Audio quality low

Increase num_steps parameter if available; ensure 24kHz sample rate

Further Reading

Last updated

Was this helpful?