Kani-TTS-2 Voice Cloning
Run Kani-TTS-2 — an ultra-efficient 400M parameter text-to-speech model with voice cloning on Clore.ai GPUs
Kani-TTS-2 by nineninesix.ai (released February 15, 2026) is a 400M-parameter open-source text-to-speech model that achieves high-fidelity speech synthesis using only 3GB of VRAM. Built on LiquidAI's LFM2 architecture with NVIDIA NanoCodec, it treats audio as a language — generating natural-sounding speech with zero-shot voice cloning from a short reference audio clip. At under half the size of competing models and a fraction of the compute, Kani-TTS-2 is perfect for real-time conversational AI, audiobook generation, and voice cloning on budget hardware.
HuggingFace: nineninesix/kani-tts-2-en GitHub: nineninesix-ai/kani-tts-2 PyPI: kani-tts-2 License: Apache 2.0
Key Features
400M parameters, 3GB VRAM — runs on virtually any modern GPU, including RTX 3060
Zero-shot voice cloning — clone any voice from a 3–30 second reference audio sample
Speaker embeddings — WavLM-based 128-dim speaker representations for precise voice control
Up to 40 seconds of continuous audio — suitable for longer passages and dialogue
Real-time or faster — RTF ~0.2 on RTX 5080, real-time even on budget GPUs
Apache 2.0 — fully open for personal and commercial use
Pretraining framework included — train your own TTS model from scratch on any language
Comparison with Other TTS Models
Kani-TTS-2
400M
3GB
✅ Zero-shot
English (extensible)
Apache 2.0
Kokoro
82M
2GB
❌ Preset voices
EN, JP, CN
Apache 2.0
Zonos
400M
8GB
✅
Multi
Apache 2.0
ChatTTS
300M
4GB
❌ Random seeds
Chinese, English
AGPL 3.0
Chatterbox
500M
6GB
✅
English
Apache 2.0
XTTS (Coqui)
467M
6GB
✅
Multi
MPL 2.0
F5-TTS
335M
4GB
✅
Multi
CC-BY-NC 4.0
Requirements
GPU
Any with 3GB VRAM
RTX 3060 or better
VRAM
3GB
6GB
RAM
8GB
16GB
Disk
2GB
5GB
Python
3.9+
3.11+
CUDA
11.8+
12.0+
Clore.ai recommendation: An RTX 3060 ($0.15–0.30/day) is more than enough. Even the cheapest GPU instances on Clore.ai will run Kani-TTS-2 comfortably. For batch processing (audiobooks, datasets), an RTX 4090 ($0.5–2/day) provides excellent throughput.
Installation
Quick Start
Three lines to generate speech:
Usage Examples
1. Basic Text-to-Speech
2. Voice Cloning
Clone any voice from a short reference audio sample:
3. Batch Generation for Audiobooks
Generate multiple chapters efficiently:
4. OpenAI-Compatible Streaming API
For real-time applications, use the OpenAI-compatible server:
Then use it with any OpenAI TTS client:
Tips for Clore.ai Users
This is the cheapest model to run — At 3GB VRAM, Kani-TTS-2 runs on literally any GPU instance on Clore.ai. An RTX 3060 at $0.15/day is more than sufficient for production TTS.
Combine with a language model — Rent one GPU instance and run both a small LLM (e.g., Mistral 3 8B) and Kani-TTS-2 simultaneously for a complete voice assistant. They'll share the GPU with room to spare.
Pre-compute speaker embeddings — Extract speaker embeddings once and save them. This avoids loading the WavLM embedder model on every request.
Use the OpenAI-compatible server — The
kani-tts-2-openai-serverprovides a drop-in replacement for OpenAI's TTS API, making it easy to integrate with existing applications.Train on custom languages — Kani-TTS-2 includes a full pretraining framework (kani-tts-2-pretrain). Fine-tune the model on your own language dataset — it only takes 8× H100s for ~6 hours.
Troubleshooting
ImportError: cannot import LFM2
Install the correct transformers version: pip install -U "transformers==4.56.0"
Audio quality is poor / robotic
Increase temperature to 0.8–0.9; ensure reference audio for cloning is clean (no background noise)
Voice cloning doesn't sound like reference
Use 5–15 seconds of clear, single-speaker audio. Avoid music or background noise in the reference
CUDA out of memory
Shouldn't happen with 3GB model — check if other processes are using GPU memory (nvidia-smi)
Audio cuts off mid-sentence
Kani-TTS-2 supports up to ~40 seconds. Split longer texts into sentences and concatenate the outputs
Slow on CPU
GPU inference is strongly recommended. Even a basic GPU is 10–50× faster than CPU
Further Reading
GitHub — kani-tts-2 — PyPI package, usage docs, advanced examples
HuggingFace — kani-tts-2-en — English model weights
Pretraining Framework — Train your own TTS model from scratch
OpenAI-Compatible Server — Drop-in replacement for OpenAI TTS API
Speaker Embedding Model — WavLM-based voice embedder
MarkTechPost Overview — Community coverage
Last updated
Was this helpful?