Zonos TTS Voice Cloning

Run Zonos TTS by Zyphra for voice cloning with emotion and pitch control on Clore.ai GPUs.

Zonos by Zyphraarrow-up-right is a 0.4B-parameter open-weight text-to-speech model trained on 200K+ hours of multilingual speech. It performs zero-shot voice cloning from just 2–30 seconds of reference audio and offers fine-grained control over emotion, speaking rate, pitch variation, and audio quality. Output is high-fidelity 44 kHz audio. Two model variants are available: Transformer (best quality) and Hybrid/Mamba (faster inference).

GitHub: Zyphra/Zonosarrow-up-right HuggingFace: Zyphra/Zonos-v0.1-transformerarrow-up-right License: Apache 2.0

Key Features

  • Voice cloning from 2–30 seconds — no fine-tuning required

  • 44 kHz high-fidelity output — studio-grade audio quality

  • Emotion control — happiness, sadness, anger, fear, surprise, disgust via 8D vector

  • Speaking rate & pitch — independent fine-grained control

  • Audio prefix inputs — enables whispering and other hard-to-clone behaviors

  • Multilingual — English, Japanese, Chinese, French, German

  • Two architectures — Transformer (quality) and Hybrid/Mamba (speed, ~2× real-time on RTX 4090)

  • Apache 2.0 — free for personal and commercial use

Requirements

Component
Minimum
Recommended

GPU

RTX 3080 10 GB

RTX 4090 24 GB

VRAM

6 GB (Transformer)

10 GB+

RAM

16 GB

32 GB

Disk

10 GB

20 GB

Python

3.10+

3.11

CUDA

11.8+

12.4

System

espeak-ng

Clore.ai recommendation: RTX 3090 ($0.30–1.00/day) for comfortable headroom. RTX 4090 ($0.50–2.00/day) for the Hybrid model and fastest inference.

Installation

Quick Start

Usage Examples

Emotion Control

Zonos accepts an 8-dimensional emotion vector: [happiness, sadness, disgust, fear, surprise, anger, other, neutral].

Speaking Rate and Pitch Control

Gradio Web Interface

Expose port 7860/http in your Clore.ai order and open the http_pub URL to access the UI.

Tips for Clore.ai Users

  • Model choice — Transformer for best quality, Hybrid for ~2× faster inference (requires RTX 3000+ GPU)

  • Reference audio — 10–30 seconds of clean speech gives best results; shorter clips (2–5s) work but with lower fidelity

  • Docker setup — use pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime, add apt-get install -y espeak-ng to startup

  • Port mapping — expose 7860/http for Gradio UI, 8000/http for API server

  • Seed control — set torch.manual_seed() before generation for reproducible output

  • Audio quality parameter — experiment with the audio_quality conditioning field for cleaner output

Troubleshooting

Problem
Solution

espeak-ng not found

Run apt-get install -y espeak-ng (required for phonemization)

CUDA out of memory

Use the Transformer model (smaller than Hybrid); reduce text length per call

Hybrid model fails

Requires Ampere+ GPU (RTX 3000 series or newer) and pip install -e ".[compile]"

Cloned voice sounds off

Use a longer reference clip (15–30s) with clear speech and minimal background noise

Slow generation

Normal for Transformer (~0.5× real-time); Hybrid achieves ~2× real-time on RTX 4090

ModuleNotFoundError: zonos

Ensure you installed from source: cd Zonos && pip install -e .

Last updated

Was this helpful?