Zonos TTS Voice Cloning
Run Zonos TTS by Zyphra for voice cloning with emotion and pitch control on Clore.ai GPUs.
Zonos by Zyphra is a 0.4B-parameter open-weight text-to-speech model trained on 200K+ hours of multilingual speech. It performs zero-shot voice cloning from just 2–30 seconds of reference audio and offers fine-grained control over emotion, speaking rate, pitch variation, and audio quality. Output is high-fidelity 44 kHz audio. Two model variants are available: Transformer (best quality) and Hybrid/Mamba (faster inference).
GitHub: Zyphra/Zonos HuggingFace: Zyphra/Zonos-v0.1-transformer License: Apache 2.0
Key Features
Voice cloning from 2–30 seconds — no fine-tuning required
44 kHz high-fidelity output — studio-grade audio quality
Emotion control — happiness, sadness, anger, fear, surprise, disgust via 8D vector
Speaking rate & pitch — independent fine-grained control
Audio prefix inputs — enables whispering and other hard-to-clone behaviors
Multilingual — English, Japanese, Chinese, French, German
Two architectures — Transformer (quality) and Hybrid/Mamba (speed, ~2× real-time on RTX 4090)
Apache 2.0 — free for personal and commercial use
Requirements
GPU
RTX 3080 10 GB
RTX 4090 24 GB
VRAM
6 GB (Transformer)
10 GB+
RAM
16 GB
32 GB
Disk
10 GB
20 GB
Python
3.10+
3.11
CUDA
11.8+
12.4
System
espeak-ng
—
Clore.ai recommendation: RTX 3090 ($0.30–1.00/day) for comfortable headroom. RTX 4090 ($0.50–2.00/day) for the Hybrid model and fastest inference.
Installation
Quick Start
Usage Examples
Emotion Control
Zonos accepts an 8-dimensional emotion vector: [happiness, sadness, disgust, fear, surprise, anger, other, neutral].
Speaking Rate and Pitch Control
Gradio Web Interface
Expose port 7860/http in your Clore.ai order and open the http_pub URL to access the UI.
Tips for Clore.ai Users
Model choice — Transformer for best quality, Hybrid for ~2× faster inference (requires RTX 3000+ GPU)
Reference audio — 10–30 seconds of clean speech gives best results; shorter clips (2–5s) work but with lower fidelity
Docker setup — use
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime, addapt-get install -y espeak-ngto startupPort mapping — expose
7860/httpfor Gradio UI,8000/httpfor API serverSeed control — set
torch.manual_seed()before generation for reproducible outputAudio quality parameter — experiment with the
audio_qualityconditioning field for cleaner output
Troubleshooting
espeak-ng not found
Run apt-get install -y espeak-ng (required for phonemization)
CUDA out of memory
Use the Transformer model (smaller than Hybrid); reduce text length per call
Hybrid model fails
Requires Ampere+ GPU (RTX 3000 series or newer) and pip install -e ".[compile]"
Cloned voice sounds off
Use a longer reference clip (15–30s) with clear speech and minimal background noise
Slow generation
Normal for Transformer (~0.5× real-time); Hybrid achieves ~2× real-time on RTX 4090
ModuleNotFoundError: zonos
Ensure you installed from source: cd Zonos && pip install -e .
Last updated
Was this helpful?