Kani-TTS-2 Voice Cloning

Run Kani-TTS-2 — an ultra-efficient 400M parameter text-to-speech model with voice cloning on Clore.ai GPUs

Kani-TTS-2 by nineninesix.ai (released February 15, 2026) is a 400M-parameter open-source text-to-speech model that achieves high-fidelity speech synthesis using only 3GB of VRAM. Built on LiquidAI's LFM2 architecture with NVIDIA NanoCodec, it treats audio as a language — generating natural-sounding speech with zero-shot voice cloning from a short reference audio clip. At under half the size of competing models and a fraction of the compute, Kani-TTS-2 is perfect for real-time conversational AI, audiobook generation, and voice cloning on budget hardware.

HuggingFace: nineninesix/kani-tts-2-enarrow-up-right GitHub: nineninesix-ai/kani-tts-2arrow-up-right PyPI: kani-tts-2arrow-up-right License: Apache 2.0

Key Features

  • 400M parameters, 3GB VRAM — runs on virtually any modern GPU, including RTX 3060

  • Zero-shot voice cloning — clone any voice from a 3–30 second reference audio sample

  • Speaker embeddings — WavLM-based 128-dim speaker representations for precise voice control

  • Up to 40 seconds of continuous audio — suitable for longer passages and dialogue

  • Real-time or faster — RTF ~0.2 on RTX 5080, real-time even on budget GPUs

  • Apache 2.0 — fully open for personal and commercial use

  • Pretraining framework included — train your own TTS model from scratch on any language

Comparison with Other TTS Models

Model
Parameters
Min VRAM
Voice Cloning
Language
License

Kani-TTS-2

400M

3GB

✅ Zero-shot

English (extensible)

Apache 2.0

Kokoro

82M

2GB

❌ Preset voices

EN, JP, CN

Apache 2.0

Zonos

400M

8GB

Multi

Apache 2.0

ChatTTS

300M

4GB

❌ Random seeds

Chinese, English

AGPL 3.0

Chatterbox

500M

6GB

English

Apache 2.0

XTTS (Coqui)

467M

6GB

Multi

MPL 2.0

F5-TTS

335M

4GB

Multi

CC-BY-NC 4.0

Requirements

Component
Minimum
Recommended

GPU

Any with 3GB VRAM

RTX 3060 or better

VRAM

3GB

6GB

RAM

8GB

16GB

Disk

2GB

5GB

Python

3.9+

3.11+

CUDA

11.8+

12.0+

Clore.ai recommendation: An RTX 3060 ($0.15–0.30/day) is more than enough. Even the cheapest GPU instances on Clore.ai will run Kani-TTS-2 comfortably. For batch processing (audiobooks, datasets), an RTX 4090 ($0.5–2/day) provides excellent throughput.

Installation

Quick Start

Three lines to generate speech:

Usage Examples

1. Basic Text-to-Speech

2. Voice Cloning

Clone any voice from a short reference audio sample:

3. Batch Generation for Audiobooks

Generate multiple chapters efficiently:

4. OpenAI-Compatible Streaming API

For real-time applications, use the OpenAI-compatible server:

Then use it with any OpenAI TTS client:

Tips for Clore.ai Users

  1. This is the cheapest model to run — At 3GB VRAM, Kani-TTS-2 runs on literally any GPU instance on Clore.ai. An RTX 3060 at $0.15/day is more than sufficient for production TTS.

  2. Combine with a language model — Rent one GPU instance and run both a small LLM (e.g., Mistral 3 8B) and Kani-TTS-2 simultaneously for a complete voice assistant. They'll share the GPU with room to spare.

  3. Pre-compute speaker embeddings — Extract speaker embeddings once and save them. This avoids loading the WavLM embedder model on every request.

  4. Use the OpenAI-compatible server — The kani-tts-2-openai-server provides a drop-in replacement for OpenAI's TTS API, making it easy to integrate with existing applications.

  5. Train on custom languages — Kani-TTS-2 includes a full pretraining framework (kani-tts-2-pretrainarrow-up-right). Fine-tune the model on your own language dataset — it only takes 8× H100s for ~6 hours.

Troubleshooting

Issue
Solution

ImportError: cannot import LFM2

Install the correct transformers version: pip install -U "transformers==4.56.0"

Audio quality is poor / robotic

Increase temperature to 0.8–0.9; ensure reference audio for cloning is clean (no background noise)

Voice cloning doesn't sound like reference

Use 5–15 seconds of clear, single-speaker audio. Avoid music or background noise in the reference

CUDA out of memory

Shouldn't happen with 3GB model — check if other processes are using GPU memory (nvidia-smi)

Audio cuts off mid-sentence

Kani-TTS-2 supports up to ~40 seconds. Split longer texts into sentences and concatenate the outputs

Slow on CPU

GPU inference is strongly recommended. Even a basic GPU is 10–50× faster than CPU

Further Reading

Last updated

Was this helpful?