# Kani-TTS-2 Voice Cloning

Kani-TTS-2 by nineninesix.ai (released February 15, 2026) is a 400M-parameter open-source text-to-speech model that achieves high-fidelity speech synthesis using only **3GB of VRAM**. Built on LiquidAI's LFM2 architecture with NVIDIA NanoCodec, it treats audio as a language — generating natural-sounding speech with zero-shot voice cloning from a short reference audio clip. At under half the size of competing models and a fraction of the compute, Kani-TTS-2 is perfect for real-time conversational AI, audiobook generation, and voice cloning on budget hardware.

**HuggingFace:** [nineninesix/kani-tts-2-en](https://huggingface.co/nineninesix/kani-tts-2-en) **GitHub:** [nineninesix-ai/kani-tts-2](https://github.com/nineninesix-ai/kani-tts-2) **PyPI:** [kani-tts-2](https://pypi.org/project/kani-tts-2/) **License:** Apache 2.0

## Key Features

* **400M parameters, 3GB VRAM** — runs on virtually any modern GPU, including RTX 3060
* **Zero-shot voice cloning** — clone any voice from a 3–30 second reference audio sample
* **Speaker embeddings** — WavLM-based 128-dim speaker representations for precise voice control
* **Up to 40 seconds of continuous audio** — suitable for longer passages and dialogue
* **Real-time or faster** — RTF \~0.2 on RTX 5080, real-time even on budget GPUs
* **Apache 2.0** — fully open for personal and commercial use
* **Pretraining framework included** — train your own TTS model from scratch on any language

## Comparison with Other TTS Models

| Model          | Parameters | Min VRAM | Voice Cloning   | Language             | License      |
| -------------- | ---------- | -------- | --------------- | -------------------- | ------------ |
| **Kani-TTS-2** | 400M       | 3GB      | ✅ Zero-shot     | English (extensible) | Apache 2.0   |
| Kokoro         | 82M        | 2GB      | ❌ Preset voices | EN, JP, CN           | Apache 2.0   |
| Zonos          | 400M       | 8GB      | ✅               | Multi                | Apache 2.0   |
| ChatTTS        | 300M       | 4GB      | ❌ Random seeds  | Chinese, English     | AGPL 3.0     |
| Chatterbox     | 500M       | 6GB      | ✅               | English              | Apache 2.0   |
| XTTS (Coqui)   | 467M       | 6GB      | ✅               | Multi                | MPL 2.0      |
| F5-TTS         | 335M       | 4GB      | ✅               | Multi                | CC-BY-NC 4.0 |

## Requirements

| Component | Minimum           | Recommended        |
| --------- | ----------------- | ------------------ |
| GPU       | Any with 3GB VRAM | RTX 3060 or better |
| VRAM      | 3GB               | 6GB                |
| RAM       | 8GB               | 16GB               |
| Disk      | 2GB               | 5GB                |
| Python    | 3.9+              | 3.11+              |
| CUDA      | 11.8+             | 12.0+              |

**Clore.ai recommendation:** An RTX 3060 (~~$0.15–0.30/day) is more than enough. Even the cheapest GPU instances on Clore.ai will run Kani-TTS-2 comfortably. For batch processing (audiobooks, datasets), an RTX 4090 (~~$0.5–2/day) provides excellent throughput.

## Installation

```bash
# Install the package
pip install kani-tts-2

# IMPORTANT: Install compatible transformers version (required for LFM2 architecture)
pip install -U "transformers==4.56.0"

# Optional: Install soundfile for saving audio
pip install soundfile
```

## Quick Start

Three lines to generate speech:

```python
from kani_tts import KaniTTS

# Initialize with the English model
model = KaniTTS('nineninesix/kani-tts-2-en')

# Generate speech
audio, text = model("Hello! Welcome to Kani TTS 2, the next generation of efficient text-to-speech.")

# Save to file
model.save_audio(audio, "output.wav")
```

## Usage Examples

### 1. Basic Text-to-Speech

```python
from kani_tts import KaniTTS

model = KaniTTS('nineninesix/kani-tts-2-en')

# Generate with custom parameters
audio, text = model(
    "The quick brown fox jumps over the lazy dog. "
    "This sentence contains every letter of the English alphabet.",
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.1
)

model.save_audio(audio, "pangram.wav")
print(f"Generated {len(audio) / 22000:.1f} seconds of audio")
```

### 2. Voice Cloning

Clone any voice from a short reference audio sample:

```python
from kani_tts import KaniTTS, SpeakerEmbedder

# Initialize models
model = KaniTTS('nineninesix/kani-tts-2-en')
embedder = SpeakerEmbedder()

# Extract speaker embedding from reference audio (3-30 seconds recommended)
speaker_embedding = embedder.embed_audio_file("reference_voice.wav")

# Generate speech in the cloned voice
audio, text = model(
    "This is a demonstration of voice cloning with Kani TTS 2. "
    "The voice you hear should match the reference audio sample.",
    speaker_emb=speaker_embedding
)

model.save_audio(audio, "cloned_output.wav")
```

### 3. Batch Generation for Audiobooks

Generate multiple chapters efficiently:

```python
from kani_tts import KaniTTS, SpeakerEmbedder
import soundfile as sf

model = KaniTTS('nineninesix/kani-tts-2-en')
embedder = SpeakerEmbedder()

# Use a narrator voice
narrator_emb = embedder.embed_audio_file("narrator_sample.wav")

chapters = [
    "Chapter One. It was a bright cold day in April, and the clocks were striking thirteen.",
    "Chapter Two. The hallway smelt of boiled cabbage and old rag mats.",
    "Chapter Three. Outside, even through the shut window pane, the world looked cold.",
]

for i, chapter_text in enumerate(chapters):
    audio, _ = model(chapter_text, speaker_emb=narrator_emb)
    model.save_audio(audio, f"chapter_{i+1}.wav")
    print(f"Generated chapter {i+1}")
```

### 4. OpenAI-Compatible Streaming API

For real-time applications, use the OpenAI-compatible server:

```bash
# Clone the server
git clone https://github.com/nineninesix-ai/kani-tts-2-openai-server.git
cd kani-tts-2-openai-server

# Install dependencies
pip install -r requirements.txt

# Start the server
python server.py --model nineninesix/kani-tts-2-en --host 0.0.0.0 --port 8080
```

Then use it with any OpenAI TTS client:

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.audio.speech.create(
    model="kani-tts-2-en",
    voice="default",
    input="Hello from the OpenAI-compatible Kani TTS server!"
)

response.stream_to_file("streamed_output.wav")
```

## Tips for Clore.ai Users

1. **This is the cheapest model to run** — At 3GB VRAM, Kani-TTS-2 runs on literally any GPU instance on Clore.ai. An RTX 3060 at $0.15/day is more than sufficient for production TTS.
2. **Combine with a language model** — Rent one GPU instance and run both a small LLM (e.g., Mistral 3 8B) and Kani-TTS-2 simultaneously for a complete voice assistant. They'll share the GPU with room to spare.
3. **Pre-compute speaker embeddings** — Extract speaker embeddings once and save them. This avoids loading the WavLM embedder model on every request.
4. **Use the OpenAI-compatible server** — The `kani-tts-2-openai-server` provides a drop-in replacement for OpenAI's TTS API, making it easy to integrate with existing applications.
5. **Train on custom languages** — Kani-TTS-2 includes a full pretraining framework ([kani-tts-2-pretrain](https://github.com/nineninesix-ai/kani-tts-2-pretrain)). Fine-tune the model on your own language dataset — it only takes 8× H100s for \~6 hours.

## Troubleshooting

| Issue                                      | Solution                                                                                              |
| ------------------------------------------ | ----------------------------------------------------------------------------------------------------- |
| `ImportError: cannot import LFM2`          | Install the correct transformers version: `pip install -U "transformers==4.56.0"`                     |
| Audio quality is poor / robotic            | Increase `temperature` to 0.8–0.9; ensure reference audio for cloning is clean (no background noise)  |
| Voice cloning doesn't sound like reference | Use 5–15 seconds of clear, single-speaker audio. Avoid music or background noise in the reference     |
| `CUDA out of memory`                       | Shouldn't happen with 3GB model — check if other processes are using GPU memory (`nvidia-smi`)        |
| Audio cuts off mid-sentence                | Kani-TTS-2 supports up to \~40 seconds. Split longer texts into sentences and concatenate the outputs |
| Slow on CPU                                | GPU inference is strongly recommended. Even a basic GPU is 10–50× faster than CPU                     |

## Further Reading

* [GitHub — kani-tts-2](https://github.com/nineninesix-ai/kani-tts-2) — PyPI package, usage docs, advanced examples
* [HuggingFace — kani-tts-2-en](https://huggingface.co/nineninesix/kani-tts-2-en) — English model weights
* [Pretraining Framework](https://github.com/nineninesix-ai/kani-tts-2-pretrain) — Train your own TTS model from scratch
* [OpenAI-Compatible Server](https://github.com/nineninesix-ai/kani-tts-2-openai-server) — Drop-in replacement for OpenAI TTS API
* [Speaker Embedding Model](https://huggingface.co/nineninesix/speaker-emb-tbr) — WavLM-based voice embedder
* [MarkTechPost Overview](https://www.marktechpost.com/2026/02/15/meet-kani-tts-2-a-400m-param-open-source-text-to-speech-model-that-runs-in-3gb-vram-with-voice-cloning-support/) — Community coverage
