For the complete documentation index, see llms.txt. This page is also available as Markdown.

Zonos TTS Voice Cloning

Run Zonos TTS by Zyphra for voice cloning with emotion and pitch control on Clore.ai GPUs.

Zonos by Zyphra is a 0.4B-parameter open-weight text-to-speech model trained on 200K+ hours of multilingual speech. It performs zero-shot voice cloning from just 2–30 seconds of reference audio and offers fine-grained control over emotion, speaking rate, pitch variation, and audio quality. Output is high-fidelity 44 kHz audio. Two model variants are available: Transformer (best quality) and Hybrid/Mamba (faster inference).

GitHub: Zyphra/Zonos HuggingFace: Zyphra/Zonos-v0.1-transformer License: Apache 2.0

Key Features

  • Voice cloning from 2–30 seconds — no fine-tuning required

  • 44 kHz high-fidelity output — studio-grade audio quality

  • Emotion control — happiness, sadness, anger, fear, surprise, disgust via 8D vector

  • Speaking rate & pitch — independent fine-grained control

  • Audio prefix inputs — enables whispering and other hard-to-clone behaviors

  • Multilingual — English, Japanese, Chinese, French, German

  • Two architectures — Transformer (quality) and Hybrid/Mamba (speed, ~2× real-time on RTX 4090)

  • Apache 2.0 — free for personal and commercial use

Requirements

Component
Minimum
Recommended

GPU

RTX 3080 10 GB

RTX 4090 24 GB

VRAM

6 GB (Transformer)

10 GB+

RAM

16 GB

32 GB

Disk

10 GB

20 GB

Python

3.10+

3.11

CUDA

11.8+

12.4

System

espeak-ng

Clore.ai recommendation: RTX 3090 ($0.30–1.00/day) for comfortable headroom. RTX 4090 ($0.50–2.00/day) for the Hybrid model and fastest inference.

Installation

Quick Start

Usage Examples

Emotion Control

Zonos accepts an 8-dimensional emotion vector: [happiness, sadness, disgust, fear, surprise, anger, other, neutral].

Speaking Rate and Pitch Control

Gradio Web Interface

Expose port 7860/http in your Clore.ai order and open the http_pub URL to access the UI.

Tips for Clore.ai Users

  • Model choice — Transformer for best quality, Hybrid for ~2× faster inference (requires RTX 3000+ GPU)

  • Reference audio — 10–30 seconds of clean speech gives best results; shorter clips (2–5s) work but with lower fidelity

  • Docker setup — use pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime, add apt-get install -y espeak-ng to startup

  • Port mapping — expose 7860/http for Gradio UI, 8000/http for API server

  • Seed control — set torch.manual_seed() before generation for reproducible output

  • Audio quality parameter — experiment with the audio_quality conditioning field for cleaner output

Troubleshooting

Problem
Solution

espeak-ng not found

Run apt-get install -y espeak-ng (required for phonemization)

CUDA out of memory

Use the Transformer model (smaller than Hybrid); reduce text length per call

Hybrid model fails

Requires Ampere+ GPU (RTX 3000 series or newer) and pip install -e ".[compile]"

Cloned voice sounds off

Use a longer reference clip (15–30s) with clear speech and minimal background noise

Slow generation

Normal for Transformer (~0.5× real-time); Hybrid achieves ~2× real-time on RTX 4090

ModuleNotFoundError: zonos

Ensure you installed from source: cd Zonos && pip install -e .

Last updated

Was this helpful?