Kokoro TTS

Run Kokoro TTS — an ultra-lightweight 82M-parameter text-to-speech model on Clore.ai GPUs.

Kokoro is an 82M-parameter text-to-speech model that punches far above its weight class. Despite its tiny size (under 2 GB VRAM), it produces remarkably natural English speech and runs at real-time or faster speeds even on budget hardware. With Apache 2.0 licensing, multiple built-in voice styles, and CPU inference support, Kokoro is ideal for real-time applications, chatbots, and edge deployments.

HuggingFace: hexgrad/Kokoro-82Marrow-up-right PyPI: kokoroarrow-up-right License: Apache 2.0

Key Features

  • 82M parameters — one of the smallest high-quality TTS models available

  • < 2 GB VRAM — runs on virtually any GPU, and even on CPU

  • Multiple voice styles — American English, British English; male and female voices

  • Real-time or faster — low-latency inference suitable for streaming

  • Streaming generation — yields audio chunks as they are produced

  • Multi-language support — English (primary), Japanese (misaki[ja]), Chinese (misaki[zh])

  • Apache 2.0 — free for personal and commercial use

Requirements

Component
Minimum
Recommended

GPU

Any with 2 GB VRAM

RTX 3060

VRAM

2 GB

4 GB

RAM

4 GB

8 GB

Disk

500 MB

1 GB

Python

3.9+

3.11

System

espeak-ng installed

Clore.ai recommendation: An RTX 3060 (~$0.15–0.30/day) is more than enough. Kokoro can even run on CPU-only instances for extremely cost-effective TTS.

Installation

Quick Start

Usage Examples

Multiple Voices Comparison

Generate the same text with different voices to compare:

British English with Speed Control

Batch File Processing

Process multiple texts and concatenate into a single audiobook-style file:

Tips for Clore.ai Users

  • CPU inference — Kokoro is small enough to run on CPU; useful for cost-sensitive workloads or when GPUs are unavailable

  • Streaming — the generator yields audio chunks as they are produced, enabling real-time playback in web apps

  • Combine with WhisperX — use WhisperX for transcription and Kokoro for re-synthesis in voice pipelines

  • Docker — use pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime and add apt-get install -y espeak-ng to your startup

  • Voice consistency — stick to one voice ID per project for a consistent narrator experience

  • Cost efficiency — at $0.15/day on an RTX 3060, Kokoro is one of the cheapest TTS solutions to self-host

Troubleshooting

Problem
Solution

espeak-ng not found

Run apt-get install -y espeak-ng (required system dependency)

ModuleNotFoundError: kokoro

Install with pip install kokoro>=0.9.4 soundfile

Audio sounds robotic

Try a different voice (e.g., af_heart tends to sound most natural)

Japanese/Chinese not working

Install language extras: pip install misaki[ja] or misaki[zh]

Out of memory on CPU

Reduce text length per call; Kokoro streams chunks so memory stays bounded

Slow first run

Model weights download on first use (~200 MB); subsequent runs are instant

Last updated

Was this helpful?