Zonos TTS Voice Cloning

Run Zonos TTS by Zyphra for voice cloning with emotion and pitch control on Clore.ai GPUs.

Zonos by Zyphra is a 0.4B-parameter open-weight text-to-speech model trained on 200K+ hours of multilingual speech. It performs zero-shot voice cloning from just 2–30 seconds of reference audio and offers fine-grained control over emotion, speaking rate, pitch variation, and audio quality. Output is high-fidelity 44 kHz audio. Two model variants are available: Transformer (best quality) and Hybrid/Mamba (faster inference).

GitHub: Zyphra/Zonos HuggingFace: Zyphra/Zonos-v0.1-transformer License: Apache 2.0

Key Features

Voice cloning from 2–30 seconds — no fine-tuning required
44 kHz high-fidelity output — studio-grade audio quality
Emotion control — happiness, sadness, anger, fear, surprise, disgust via 8D vector
Speaking rate & pitch — independent fine-grained control
Audio prefix inputs — enables whispering and other hard-to-clone behaviors
Multilingual — English, Japanese, Chinese, French, German
Two architectures — Transformer (quality) and Hybrid/Mamba (speed, ~2× real-time on RTX 4090)
Apache 2.0 — free for personal and commercial use

Requirements

Component

Minimum

Recommended

GPU

RTX 3080 10 GB

RTX 4090 24 GB

VRAM

6 GB (Transformer)

10 GB+

RAM

16 GB

32 GB

Disk

10 GB

20 GB

Python

3.10+

3.11

CUDA

11.8+

12.4

System

espeak-ng

—

Clore.ai recommendation: RTX 3090 (~~$0.30–1.00/day) for comfortable headroom. RTX 4090 (~~$0.50–2.00/day) for the Hybrid model and fastest inference.

Installation

# Install system dependency
apt-get install -y espeak-ng

# Clone and install
git clone https://github.com/Zyphra/Zonos.git
cd Zonos
pip install -e .

# For the Hybrid model (requires Ampere+ GPU, i.e., RTX 3000 series or newer)
pip install -e ".[compile]"

# Verify
python -c "from zonos.model import Zonos; print('Zonos ready')"

Quick Start

import torch
import torchaudio
from zonos.model import Zonos
from zonos.conditioning import make_cond_dict

# Load model (downloads weights on first run)
model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")

# Load reference audio for voice cloning
wav, sr = torchaudio.load("reference_speaker.wav")
speaker = model.make_speaker_embedding(wav, sr)

# Build conditioning
cond_dict = make_cond_dict(
    text="Hello from Clore.ai! This is a voice cloning demonstration.",
    speaker=speaker,
    language="en-us",
)
conditioning = model.prepare_conditioning(cond_dict)

# Generate
torch.manual_seed(42)
codes = model.generate(conditioning)
wavs = model.autoencoder.decode(codes).cpu()

torchaudio.save("output.wav", wavs[0], model.autoencoder.sampling_rate)
print(f"Saved output.wav at {model.autoencoder.sampling_rate} Hz")

Usage Examples

Emotion Control

Zonos accepts an 8-dimensional emotion vector: [happiness, sadness, disgust, fear, surprise, anger, other, neutral].

import torch
import torchaudio
from zonos.model import Zonos
from zonos.conditioning import make_cond_dict

model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")

wav, sr = torchaudio.load("speaker_ref.wav")
speaker = model.make_speaker_embedding(wav, sr)

text = "I can't believe what just happened today!"

emotions = {
    "happy":   [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
    "sad":     [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
    "angry":   [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0],
    "fearful": [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0],
    "neutral": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0],
}

for name, emo_vec in emotions.items():
    cond_dict = make_cond_dict(
        text=text,
        speaker=speaker,
        language="en-us",
        emotion=torch.tensor(emo_vec).unsqueeze(0),
    )
    conditioning = model.prepare_conditioning(cond_dict)
    codes = model.generate(conditioning)
    audio = model.autoencoder.decode(codes).cpu()
    torchaudio.save(f"emotion_{name}.wav", audio[0], model.autoencoder.sampling_rate)
    print(f"Generated: {name}")

Speaking Rate and Pitch Control

import torch
import torchaudio
from zonos.model import Zonos
from zonos.conditioning import make_cond_dict

model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")

wav, sr = torchaudio.load("speaker_ref.wav")
speaker = model.make_speaker_embedding(wav, sr)

# Slow and calm
cond_slow = make_cond_dict(
    text="Take your time. There is no rush at all.",
    speaker=speaker,
    language="en-us",
    speaking_rate=torch.tensor([8.0]),   # lower = slower
    pitch_std=torch.tensor([20.0]),      # lower = more monotone
)
codes = model.generate(model.prepare_conditioning(cond_slow))
audio = model.autoencoder.decode(codes).cpu()
torchaudio.save("slow_calm.wav", audio[0], model.autoencoder.sampling_rate)

# Fast and energetic
cond_fast = make_cond_dict(
    text="Hurry up! We need to go right now!",
    speaker=speaker,
    language="en-us",
    speaking_rate=torch.tensor([22.0]),  # higher = faster
    pitch_std=torch.tensor([80.0]),      # higher = more expressive
)
codes = model.generate(model.prepare_conditioning(cond_fast))
audio = model.autoencoder.decode(codes).cpu()
torchaudio.save("fast_energetic.wav", audio[0], model.autoencoder.sampling_rate)

Gradio Web Interface

cd Zonos
python gradio_interface.py
# Or with uv:
# uv run gradio_interface.py

Expose port 7860/http in your Clore.ai order and open the http_pub URL to access the UI.

Tips for Clore.ai Users

Model choice — Transformer for best quality, Hybrid for ~2× faster inference (requires RTX 3000+ GPU)
Reference audio — 10–30 seconds of clean speech gives best results; shorter clips (2–5s) work but with lower fidelity
Docker setup — use pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime, add apt-get install -y espeak-ng to startup
Port mapping — expose 7860/http for Gradio UI, 8000/http for API server
Seed control — set torch.manual_seed() before generation for reproducible output
Audio quality parameter — experiment with the audio_quality conditioning field for cleaner output

Troubleshooting

Problem

Solution

espeak-ng not found

Run apt-get install -y espeak-ng (required for phonemization)

CUDA out of memory

Use the Transformer model (smaller than Hybrid); reduce text length per call

Hybrid model fails

Requires Ampere+ GPU (RTX 3000 series or newer) and pip install -e ".[compile]"

Cloned voice sounds off

Use a longer reference clip (15–30s) with clear speech and minimal background noise

Slow generation

Normal for Transformer (~0.5× real-time); Hybrid achieves ~2× real-time on RTX 4090

ModuleNotFoundError: zonos

Ensure you installed from source: cd Zonos && pip install -e .

PreviousF5-TTS NextOpenVoice

Last updated 23 days ago

Was this helpful?

hashtagKey Features

hashtagRequirements

hashtagInstallation

hashtagQuick Start

hashtagUsage Examples

hashtagEmotion Control

hashtagSpeaking Rate and Pitch Control

hashtagGradio Web Interface

hashtagTips for Clore.ai Users

hashtagTroubleshooting