# Zonos TTS Voice Cloning

Zonos by [Zyphra](https://www.zyphra.com/) is a 0.4B-parameter open-weight text-to-speech model trained on 200K+ hours of multilingual speech. It performs zero-shot voice cloning from just 2–30 seconds of reference audio and offers fine-grained control over emotion, speaking rate, pitch variation, and audio quality. Output is high-fidelity 44 kHz audio. Two model variants are available: Transformer (best quality) and Hybrid/Mamba (faster inference).

**GitHub:** [Zyphra/Zonos](https://github.com/Zyphra/Zonos) **HuggingFace:** [Zyphra/Zonos-v0.1-transformer](https://huggingface.co/Zyphra/Zonos-v0.1-transformer) **License:** Apache 2.0

## Key Features

* **Voice cloning from 2–30 seconds** — no fine-tuning required
* **44 kHz high-fidelity output** — studio-grade audio quality
* **Emotion control** — happiness, sadness, anger, fear, surprise, disgust via 8D vector
* **Speaking rate & pitch** — independent fine-grained control
* **Audio prefix inputs** — enables whispering and other hard-to-clone behaviors
* **Multilingual** — English, Japanese, Chinese, French, German
* **Two architectures** — Transformer (quality) and Hybrid/Mamba (speed, \~2× real-time on RTX 4090)
* **Apache 2.0** — free for personal and commercial use

## Requirements

| Component | Minimum            | Recommended    |
| --------- | ------------------ | -------------- |
| GPU       | RTX 3080 10 GB     | RTX 4090 24 GB |
| VRAM      | 6 GB (Transformer) | 10 GB+         |
| RAM       | 16 GB              | 32 GB          |
| Disk      | 10 GB              | 20 GB          |
| Python    | 3.10+              | 3.11           |
| CUDA      | 11.8+              | 12.4           |
| System    | espeak-ng          | —              |

**Clore.ai recommendation:** RTX 3090 (~~$0.30–1.00/day) for comfortable headroom. RTX 4090 (~~$0.50–2.00/day) for the Hybrid model and fastest inference.

## Installation

```bash
# Install system dependency
apt-get install -y espeak-ng

# Clone and install
git clone https://github.com/Zyphra/Zonos.git
cd Zonos
pip install -e .

# For the Hybrid model (requires Ampere+ GPU, i.e., RTX 3000 series or newer)
pip install -e ".[compile]"

# Verify
python -c "from zonos.model import Zonos; print('Zonos ready')"
```

## Quick Start

```python
import torch
import torchaudio
from zonos.model import Zonos
from zonos.conditioning import make_cond_dict

# Load model (downloads weights on first run)
model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")

# Load reference audio for voice cloning
wav, sr = torchaudio.load("reference_speaker.wav")
speaker = model.make_speaker_embedding(wav, sr)

# Build conditioning
cond_dict = make_cond_dict(
    text="Hello from Clore.ai! This is a voice cloning demonstration.",
    speaker=speaker,
    language="en-us",
)
conditioning = model.prepare_conditioning(cond_dict)

# Generate
torch.manual_seed(42)
codes = model.generate(conditioning)
wavs = model.autoencoder.decode(codes).cpu()

torchaudio.save("output.wav", wavs[0], model.autoencoder.sampling_rate)
print(f"Saved output.wav at {model.autoencoder.sampling_rate} Hz")
```

## Usage Examples

### Emotion Control

Zonos accepts an 8-dimensional emotion vector: `[happiness, sadness, disgust, fear, surprise, anger, other, neutral]`.

```python
import torch
import torchaudio
from zonos.model import Zonos
from zonos.conditioning import make_cond_dict

model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")

wav, sr = torchaudio.load("speaker_ref.wav")
speaker = model.make_speaker_embedding(wav, sr)

text = "I can't believe what just happened today!"

emotions = {
    "happy":   [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
    "sad":     [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
    "angry":   [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0],
    "fearful": [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0],
    "neutral": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0],
}

for name, emo_vec in emotions.items():
    cond_dict = make_cond_dict(
        text=text,
        speaker=speaker,
        language="en-us",
        emotion=torch.tensor(emo_vec).unsqueeze(0),
    )
    conditioning = model.prepare_conditioning(cond_dict)
    codes = model.generate(conditioning)
    audio = model.autoencoder.decode(codes).cpu()
    torchaudio.save(f"emotion_{name}.wav", audio[0], model.autoencoder.sampling_rate)
    print(f"Generated: {name}")
```

### Speaking Rate and Pitch Control

```python
import torch
import torchaudio
from zonos.model import Zonos
from zonos.conditioning import make_cond_dict

model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")

wav, sr = torchaudio.load("speaker_ref.wav")
speaker = model.make_speaker_embedding(wav, sr)

# Slow and calm
cond_slow = make_cond_dict(
    text="Take your time. There is no rush at all.",
    speaker=speaker,
    language="en-us",
    speaking_rate=torch.tensor([8.0]),   # lower = slower
    pitch_std=torch.tensor([20.0]),      # lower = more monotone
)
codes = model.generate(model.prepare_conditioning(cond_slow))
audio = model.autoencoder.decode(codes).cpu()
torchaudio.save("slow_calm.wav", audio[0], model.autoencoder.sampling_rate)

# Fast and energetic
cond_fast = make_cond_dict(
    text="Hurry up! We need to go right now!",
    speaker=speaker,
    language="en-us",
    speaking_rate=torch.tensor([22.0]),  # higher = faster
    pitch_std=torch.tensor([80.0]),      # higher = more expressive
)
codes = model.generate(model.prepare_conditioning(cond_fast))
audio = model.autoencoder.decode(codes).cpu()
torchaudio.save("fast_energetic.wav", audio[0], model.autoencoder.sampling_rate)
```

### Gradio Web Interface

```bash
cd Zonos
python gradio_interface.py
# Or with uv:
# uv run gradio_interface.py
```

Expose port `7860/http` in your Clore.ai order and open the `http_pub` URL to access the UI.

## Tips for Clore.ai Users

* **Model choice** — Transformer for best quality, Hybrid for \~2× faster inference (requires RTX 3000+ GPU)
* **Reference audio** — 10–30 seconds of clean speech gives best results; shorter clips (2–5s) work but with lower fidelity
* **Docker setup** — use `pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime`, add `apt-get install -y espeak-ng` to startup
* **Port mapping** — expose `7860/http` for Gradio UI, `8000/http` for API server
* **Seed control** — set `torch.manual_seed()` before generation for reproducible output
* **Audio quality parameter** — experiment with the `audio_quality` conditioning field for cleaner output

## Troubleshooting

| Problem                      | Solution                                                                              |
| ---------------------------- | ------------------------------------------------------------------------------------- |
| `espeak-ng not found`        | Run `apt-get install -y espeak-ng` (required for phonemization)                       |
| `CUDA out of memory`         | Use the Transformer model (smaller than Hybrid); reduce text length per call          |
| Hybrid model fails           | Requires Ampere+ GPU (RTX 3000 series or newer) and `pip install -e ".[compile]"`     |
| Cloned voice sounds off      | Use a longer reference clip (15–30s) with clear speech and minimal background noise   |
| Slow generation              | Normal for Transformer (\~0.5× real-time); Hybrid achieves \~2× real-time on RTX 4090 |
| `ModuleNotFoundError: zonos` | Ensure you installed from source: `cd Zonos && pip install -e .`                      |
