# Kokoro TTS

Kokoro is an 82M-parameter text-to-speech model that punches far above its weight class. Despite its tiny size (under 2 GB VRAM), it produces remarkably natural English speech and runs at real-time or faster speeds even on budget hardware. With Apache 2.0 licensing, multiple built-in voice styles, and CPU inference support, Kokoro is ideal for real-time applications, chatbots, and edge deployments.

**HuggingFace:** [hexgrad/Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) **PyPI:** [kokoro](https://pypi.org/project/kokoro/) **License:** Apache 2.0

## Key Features

* **82M parameters** — one of the smallest high-quality TTS models available
* **< 2 GB VRAM** — runs on virtually any GPU, and even on CPU
* **Multiple voice styles** — American English, British English; male and female voices
* **Real-time or faster** — low-latency inference suitable for streaming
* **Streaming generation** — yields audio chunks as they are produced
* **Multi-language support** — English (primary), Japanese (`misaki[ja]`), Chinese (`misaki[zh]`)
* **Apache 2.0** — free for personal and commercial use

## Requirements

| Component | Minimum             | Recommended |
| --------- | ------------------- | ----------- |
| GPU       | Any with 2 GB VRAM  | RTX 3060    |
| VRAM      | 2 GB                | 4 GB        |
| RAM       | 4 GB                | 8 GB        |
| Disk      | 500 MB              | 1 GB        |
| Python    | 3.9+                | 3.11        |
| System    | espeak-ng installed | —           |

**Clore.ai recommendation:** An RTX 3060 (\~$0.15–0.30/day) is more than enough. Kokoro can even run on CPU-only instances for extremely cost-effective TTS.

## Installation

```bash
# Install system dependency
apt-get install -y espeak-ng

# Install Kokoro and audio I/O
pip install kokoro>=0.9.4 soundfile torch

# For Japanese support (optional)
pip install misaki[ja]

# For Chinese support (optional)
pip install misaki[zh]

# Verify
python -c "from kokoro import KPipeline; print('Kokoro ready')"
```

## Quick Start

```python
from kokoro import KPipeline
import soundfile as sf

# Initialize pipeline
# 'a' = American English, 'b' = British English
pipeline = KPipeline(lang_code='a')

text = """
Kokoro is a lightweight text-to-speech model with only eighty-two million
parameters. Despite its small size, it produces natural and expressive speech.
"""

# Generate audio — voice options: af_heart, af_bella, af_nicole, af_sarah, af_sky,
#                                  am_adam, am_michael, bf_emma, bf_isabella, bm_george, bm_lewis
generator = pipeline(text, voice='af_heart', speed=1.0)

for i, (graphemes, phonemes, audio) in enumerate(generator):
    sf.write(f'output_{i}.wav', audio, 24000)
    print(f"Chunk {i}: {graphemes[:50]}...")

print("Done!")
```

## Usage Examples

### Multiple Voices Comparison

Generate the same text with different voices to compare:

```python
from kokoro import KPipeline
import soundfile as sf

pipeline = KPipeline(lang_code='a')

text = "Welcome to Clore.ai, the peer-to-peer GPU marketplace."

voices = ['af_heart', 'af_bella', 'am_adam', 'am_michael']

for voice in voices:
    generator = pipeline(text, voice=voice, speed=1.0)
    for i, (gs, ps, audio) in enumerate(generator):
        sf.write(f'{voice}_{i}.wav', audio, 24000)
    print(f"Generated: {voice}")
```

### British English with Speed Control

```python
from kokoro import KPipeline
import soundfile as sf

# 'b' = British English
pipeline = KPipeline(lang_code='b')

text = "Good afternoon. This is a demonstration of British English synthesis."

# speed < 1.0 = slower, speed > 1.0 = faster
generator = pipeline(text, voice='bf_emma', speed=0.85)

all_audio = []
for gs, ps, audio in generator:
    all_audio.append(audio)

import numpy as np
combined = np.concatenate(all_audio)
sf.write('british_slow.wav', combined, 24000)
print(f"Total duration: {len(combined)/24000:.1f}s")
```

### Batch File Processing

Process multiple texts and concatenate into a single audiobook-style file:

```python
from kokoro import KPipeline
import soundfile as sf
import numpy as np

pipeline = KPipeline(lang_code='a')

chapters = [
    "Chapter one. The beginning of our journey starts here.",
    "The sun rose over the mountains, casting long shadows across the valley.",
    "She opened the door and stepped into the unknown.",
]

all_audio = []
silence = np.zeros(int(24000 * 0.5))  # 0.5s silence between chapters

for idx, text in enumerate(chapters):
    for gs, ps, audio in pipeline(text, voice='af_bella', speed=1.0):
        all_audio.append(audio)
    all_audio.append(silence)
    print(f"Chapter {idx+1} done")

combined = np.concatenate(all_audio)
sf.write('audiobook.wav', combined, 24000)
print(f"Total: {len(combined)/24000:.1f}s")
```

## Tips for Clore.ai Users

* **CPU inference** — Kokoro is small enough to run on CPU; useful for cost-sensitive workloads or when GPUs are unavailable
* **Streaming** — the generator yields audio chunks as they are produced, enabling real-time playback in web apps
* **Combine with WhisperX** — use WhisperX for transcription and Kokoro for re-synthesis in voice pipelines
* **Docker** — use `pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime` and add `apt-get install -y espeak-ng` to your startup
* **Voice consistency** — stick to one voice ID per project for a consistent narrator experience
* **Cost efficiency** — at $0.15/day on an RTX 3060, Kokoro is one of the cheapest TTS solutions to self-host

## Troubleshooting

| Problem                       | Solution                                                                    |
| ----------------------------- | --------------------------------------------------------------------------- |
| `espeak-ng not found`         | Run `apt-get install -y espeak-ng` (required system dependency)             |
| `ModuleNotFoundError: kokoro` | Install with `pip install kokoro>=0.9.4 soundfile`                          |
| Audio sounds robotic          | Try a different voice (e.g., `af_heart` tends to sound most natural)        |
| Japanese/Chinese not working  | Install language extras: `pip install misaki[ja]` or `misaki[zh]`           |
| Out of memory on CPU          | Reduce text length per call; Kokoro streams chunks so memory stays bounded  |
| Slow first run                | Model weights download on first use (\~200 MB); subsequent runs are instant |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/audio-and-voice/kokoro-tts.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
