# Zonos TTS Voice Cloning

Zonos by [Zyphra](https://www.zyphra.com/) is a 0.4B-parameter open-weight text-to-speech model trained on 200K+ hours of multilingual speech. It performs zero-shot voice cloning from just 2–30 seconds of reference audio and offers fine-grained control over emotion, speaking rate, pitch variation, and audio quality. Output is high-fidelity 44 kHz audio. Two model variants are available: Transformer (best quality) and Hybrid/Mamba (faster inference).

**GitHub:** [Zyphra/Zonos](https://github.com/Zyphra/Zonos) **HuggingFace:** [Zyphra/Zonos-v0.1-transformer](https://huggingface.co/Zyphra/Zonos-v0.1-transformer) **License:** Apache 2.0

## Key Features

* **Voice cloning from 2–30 seconds** — no fine-tuning required
* **44 kHz high-fidelity output** — studio-grade audio quality
* **Emotion control** — happiness, sadness, anger, fear, surprise, disgust via 8D vector
* **Speaking rate & pitch** — independent fine-grained control
* **Audio prefix inputs** — enables whispering and other hard-to-clone behaviors
* **Multilingual** — English, Japanese, Chinese, French, German
* **Two architectures** — Transformer (quality) and Hybrid/Mamba (speed, \~2× real-time on RTX 4090)
* **Apache 2.0** — free for personal and commercial use

## Requirements

| Component | Minimum            | Recommended    |
| --------- | ------------------ | -------------- |
| GPU       | RTX 3080 10 GB     | RTX 4090 24 GB |
| VRAM      | 6 GB (Transformer) | 10 GB+         |
| RAM       | 16 GB              | 32 GB          |
| Disk      | 10 GB              | 20 GB          |
| Python    | 3.10+              | 3.11           |
| CUDA      | 11.8+              | 12.4           |
| System    | espeak-ng          | —              |

**Clore.ai recommendation:** RTX 3090 (~~$0.30–1.00/day) for comfortable headroom. RTX 4090 (~~$0.50–2.00/day) for the Hybrid model and fastest inference.

## Installation

```bash
# Install system dependency
apt-get install -y espeak-ng

# Clone and install
git clone https://github.com/Zyphra/Zonos.git
cd Zonos
pip install -e .

# For the Hybrid model (requires Ampere+ GPU, i.e., RTX 3000 series or newer)
pip install -e ".[compile]"

# Verify
python -c "from zonos.model import Zonos; print('Zonos ready')"
```

## Quick Start

```python
import torch
import torchaudio
from zonos.model import Zonos
from zonos.conditioning import make_cond_dict

# Load model (downloads weights on first run)
model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")

# Load reference audio for voice cloning
wav, sr = torchaudio.load("reference_speaker.wav")
speaker = model.make_speaker_embedding(wav, sr)

# Build conditioning
cond_dict = make_cond_dict(
    text="Hello from Clore.ai! This is a voice cloning demonstration.",
    speaker=speaker,
    language="en-us",
)
conditioning = model.prepare_conditioning(cond_dict)

# Generate
torch.manual_seed(42)
codes = model.generate(conditioning)
wavs = model.autoencoder.decode(codes).cpu()

torchaudio.save("output.wav", wavs[0], model.autoencoder.sampling_rate)
print(f"Saved output.wav at {model.autoencoder.sampling_rate} Hz")
```

## Usage Examples

### Emotion Control

Zonos accepts an 8-dimensional emotion vector: `[happiness, sadness, disgust, fear, surprise, anger, other, neutral]`.

```python
import torch
import torchaudio
from zonos.model import Zonos
from zonos.conditioning import make_cond_dict

model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")

wav, sr = torchaudio.load("speaker_ref.wav")
speaker = model.make_speaker_embedding(wav, sr)

text = "I can't believe what just happened today!"

emotions = {
    "happy":   [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
    "sad":     [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
    "angry":   [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0],
    "fearful": [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0],
    "neutral": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0],
}

for name, emo_vec in emotions.items():
    cond_dict = make_cond_dict(
        text=text,
        speaker=speaker,
        language="en-us",
        emotion=torch.tensor(emo_vec).unsqueeze(0),
    )
    conditioning = model.prepare_conditioning(cond_dict)
    codes = model.generate(conditioning)
    audio = model.autoencoder.decode(codes).cpu()
    torchaudio.save(f"emotion_{name}.wav", audio[0], model.autoencoder.sampling_rate)
    print(f"Generated: {name}")
```

### Speaking Rate and Pitch Control

```python
import torch
import torchaudio
from zonos.model import Zonos
from zonos.conditioning import make_cond_dict

model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")

wav, sr = torchaudio.load("speaker_ref.wav")
speaker = model.make_speaker_embedding(wav, sr)

# Slow and calm
cond_slow = make_cond_dict(
    text="Take your time. There is no rush at all.",
    speaker=speaker,
    language="en-us",
    speaking_rate=torch.tensor([8.0]),   # lower = slower
    pitch_std=torch.tensor([20.0]),      # lower = more monotone
)
codes = model.generate(model.prepare_conditioning(cond_slow))
audio = model.autoencoder.decode(codes).cpu()
torchaudio.save("slow_calm.wav", audio[0], model.autoencoder.sampling_rate)

# Fast and energetic
cond_fast = make_cond_dict(
    text="Hurry up! We need to go right now!",
    speaker=speaker,
    language="en-us",
    speaking_rate=torch.tensor([22.0]),  # higher = faster
    pitch_std=torch.tensor([80.0]),      # higher = more expressive
)
codes = model.generate(model.prepare_conditioning(cond_fast))
audio = model.autoencoder.decode(codes).cpu()
torchaudio.save("fast_energetic.wav", audio[0], model.autoencoder.sampling_rate)
```

### Gradio Web Interface

```bash
cd Zonos
python gradio_interface.py
# Or with uv:
# uv run gradio_interface.py
```

Expose port `7860/http` in your Clore.ai order and open the `http_pub` URL to access the UI.

## Tips for Clore.ai Users

* **Model choice** — Transformer for best quality, Hybrid for \~2× faster inference (requires RTX 3000+ GPU)
* **Reference audio** — 10–30 seconds of clean speech gives best results; shorter clips (2–5s) work but with lower fidelity
* **Docker setup** — use `pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime`, add `apt-get install -y espeak-ng` to startup
* **Port mapping** — expose `7860/http` for Gradio UI, `8000/http` for API server
* **Seed control** — set `torch.manual_seed()` before generation for reproducible output
* **Audio quality parameter** — experiment with the `audio_quality` conditioning field for cleaner output

## Troubleshooting

| Problem                      | Solution                                                                              |
| ---------------------------- | ------------------------------------------------------------------------------------- |
| `espeak-ng not found`        | Run `apt-get install -y espeak-ng` (required for phonemization)                       |
| `CUDA out of memory`         | Use the Transformer model (smaller than Hybrid); reduce text length per call          |
| Hybrid model fails           | Requires Ampere+ GPU (RTX 3000 series or newer) and `pip install -e ".[compile]"`     |
| Cloned voice sounds off      | Use a longer reference clip (15–30s) with clear speech and minimal background noise   |
| Slow generation              | Normal for Transformer (\~0.5× real-time); Hybrid achieves \~2× real-time on RTX 4090 |
| `ModuleNotFoundError: zonos` | Ensure you installed from source: `cd Zonos && pip install -e .`                      |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/audio-and-voice/zonos-tts.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
