Kokoro TTS

Run Kokoro TTS — an ultra-lightweight 82M-parameter text-to-speech model on Clore.ai GPUs.

Kokoro is an 82M-parameter text-to-speech model that punches far above its weight class. Despite its tiny size (under 2 GB VRAM), it produces remarkably natural English speech and runs at real-time or faster speeds even on budget hardware. With Apache 2.0 licensing, multiple built-in voice styles, and CPU inference support, Kokoro is ideal for real-time applications, chatbots, and edge deployments.

HuggingFace: hexgrad/Kokoro-82M PyPI: kokoro License: Apache 2.0

Key Features

82M parameters — one of the smallest high-quality TTS models available
< 2 GB VRAM — runs on virtually any GPU, and even on CPU
Multiple voice styles — American English, British English; male and female voices
Real-time or faster — low-latency inference suitable for streaming
Streaming generation — yields audio chunks as they are produced
Multi-language support — English (primary), Japanese (misaki[ja]), Chinese (misaki[zh])
Apache 2.0 — free for personal and commercial use

Requirements

Component

Minimum

Recommended

GPU

Any with 2 GB VRAM

RTX 3060

VRAM

2 GB

4 GB

RAM

4 GB

8 GB

Disk

500 MB

1 GB

Python

3.9+

3.11

System

espeak-ng installed

—

Clore.ai recommendation: An RTX 3060 (~$0.15–0.30/day) is more than enough. Kokoro can even run on CPU-only instances for extremely cost-effective TTS.

Installation

# Install system dependency
apt-get install -y espeak-ng

# Install Kokoro and audio I/O
pip install kokoro>=0.9.4 soundfile torch

# For Japanese support (optional)
pip install misaki[ja]

# For Chinese support (optional)
pip install misaki[zh]

# Verify
python -c "from kokoro import KPipeline; print('Kokoro ready')"

Quick Start

from kokoro import KPipeline
import soundfile as sf

# Initialize pipeline
# 'a' = American English, 'b' = British English
pipeline = KPipeline(lang_code='a')

text = """
Kokoro is a lightweight text-to-speech model with only eighty-two million
parameters. Despite its small size, it produces natural and expressive speech.
"""

# Generate audio — voice options: af_heart, af_bella, af_nicole, af_sarah, af_sky,
#                                  am_adam, am_michael, bf_emma, bf_isabella, bm_george, bm_lewis
generator = pipeline(text, voice='af_heart', speed=1.0)

for i, (graphemes, phonemes, audio) in enumerate(generator):
    sf.write(f'output_{i}.wav', audio, 24000)
    print(f"Chunk {i}: {graphemes[:50]}...")

print("Done!")

Usage Examples

Multiple Voices Comparison

Generate the same text with different voices to compare:

from kokoro import KPipeline
import soundfile as sf

pipeline = KPipeline(lang_code='a')

text = "Welcome to Clore.ai, the peer-to-peer GPU marketplace."

voices = ['af_heart', 'af_bella', 'am_adam', 'am_michael']

for voice in voices:
    generator = pipeline(text, voice=voice, speed=1.0)
    for i, (gs, ps, audio) in enumerate(generator):
        sf.write(f'{voice}_{i}.wav', audio, 24000)
    print(f"Generated: {voice}")

British English with Speed Control

from kokoro import KPipeline
import soundfile as sf

# 'b' = British English
pipeline = KPipeline(lang_code='b')

text = "Good afternoon. This is a demonstration of British English synthesis."

# speed < 1.0 = slower, speed > 1.0 = faster
generator = pipeline(text, voice='bf_emma', speed=0.85)

all_audio = []
for gs, ps, audio in generator:
    all_audio.append(audio)

import numpy as np
combined = np.concatenate(all_audio)
sf.write('british_slow.wav', combined, 24000)
print(f"Total duration: {len(combined)/24000:.1f}s")

Batch File Processing

Process multiple texts and concatenate into a single audiobook-style file:

from kokoro import KPipeline
import soundfile as sf
import numpy as np

pipeline = KPipeline(lang_code='a')

chapters = [
    "Chapter one. The beginning of our journey starts here.",
    "The sun rose over the mountains, casting long shadows across the valley.",
    "She opened the door and stepped into the unknown.",
]

all_audio = []
silence = np.zeros(int(24000 * 0.5))  # 0.5s silence between chapters

for idx, text in enumerate(chapters):
    for gs, ps, audio in pipeline(text, voice='af_bella', speed=1.0):
        all_audio.append(audio)
    all_audio.append(silence)
    print(f"Chapter {idx+1} done")

combined = np.concatenate(all_audio)
sf.write('audiobook.wav', combined, 24000)
print(f"Total: {len(combined)/24000:.1f}s")

Tips for Clore.ai Users

CPU inference — Kokoro is small enough to run on CPU; useful for cost-sensitive workloads or when GPUs are unavailable
Streaming — the generator yields audio chunks as they are produced, enabling real-time playback in web apps
Combine with WhisperX — use WhisperX for transcription and Kokoro for re-synthesis in voice pipelines
Docker — use pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime and add apt-get install -y espeak-ng to your startup
Voice consistency — stick to one voice ID per project for a consistent narrator experience
Cost efficiency — at $0.15/day on an RTX 3060, Kokoro is one of the cheapest TTS solutions to self-host

Troubleshooting

Problem

Solution

espeak-ng not found

Run apt-get install -y espeak-ng (required system dependency)

ModuleNotFoundError: kokoro

Install with pip install kokoro>=0.9.4 soundfile

Audio sounds robotic

Try a different voice (e.g., af_heart tends to sound most natural)

Japanese/Chinese not working

Install language extras: pip install misaki[ja] or misaki[zh]

Out of memory on CPU

Reduce text length per call; Kokoro streams chunks so memory stays bounded

Slow first run

Model weights download on first use (~200 MB); subsequent runs are instant

PreviousQwen3-TTS Voice Cloning NextChatTTS Conversational Speech

Last updated 22 days ago

Was this helpful?

hashtagKey Features

hashtagRequirements

hashtagInstallation

hashtagQuick Start

hashtagUsage Examples

hashtagMultiple Voices Comparison

hashtagBritish English with Speed Control

hashtagBatch File Processing

hashtagTips for Clore.ai Users

hashtagTroubleshooting