Qwen3-TTS Voice Cloning

Multilingual voice cloning and TTS with Qwen3-TTS — 10+ languages, streaming, emotion control

Qwen3-TTS by Alibaba is a state-of-the-art text-to-speech model supporting 10+ languages with voice cloning from just 3 seconds of audio. It features natural language emotion control ("speak happily", "whisper softly"), streaming with 97ms latency, and two model sizes (0.6B and 1.7B). Released under Apache 2.0, it's one of the most capable open-source TTS systems available.

Key Features

  • 10+ languages: English, Chinese, Japanese, Korean, French, German, Spanish, and more

  • 3-second voice cloning: Clone any voice from a short audio sample

  • Natural emotion control: Control style with plain text instructions

  • Streaming support: 97ms first-token latency — great for real-time apps

  • Two sizes: 0.6B (4GB VRAM) and 1.7B (8GB VRAM)

  • Fine-tunable: Base models available for custom training

  • Apache 2.0 license: Full commercial use

Model Variants

Model
Parameters
VRAM
Quality
Speed
Best For

Qwen3-TTS-0.6B-Instruct

0.6B

4GB

Good

Fast

Real-time, budget GPUs

Qwen3-TTS-1.7B-Instruct

1.7B

8GB

Best

Medium

Production quality

Qwen3-TTS-0.6B-Base

0.6B

4GB

Fine-tuning

Qwen3-TTS-1.7B-Base

1.7B

8GB

Fine-tuning

Requirements

Component
0.6B
1.7B

GPU

RTX 3060 6GB

RTX 3080 10GB

VRAM

4GB

8GB

RAM

8GB

16GB

Disk

5GB

10GB

Python

3.10+

3.10+

Recommended Clore.ai GPU: RTX 3060 ($0.15–0.3/day) for 0.6B, RTX 3080 ($0.2–0.5/day) for 1.7B

Installation

Quick Start — Voice Cloning

Emotion Control

Multilingual Generation

Comparison with Other TTS Models

Feature
Qwen3-TTS
Zonos
Dia
Kokoro
XTTS

Languages

10+

1 (EN)

1 (EN)

1 (EN)

17

Voice Clone

3 sec

2-30 sec

No

No

6 sec

Streaming

✅ (97ms)

Emotion Control

✅ Natural

✅ Auto

Multi-Speaker

Min VRAM

4GB

8GB

8GB

2GB

6GB

License

Apache 2.0

Apache 2.0

Apache 2.0

Apache 2.0

AGPL

Tips for Clore.ai Users

  • 0.6B on RTX 3060: Best budget option at $0.15/day — good enough for most TTS tasks

  • Batch processing: Generate all audio clips in one session to maximize rental time

  • Cache reference audio: Keep your voice references on persistent storage

  • Streaming for real-time: Use the streaming API for chatbot/assistant applications

  • Fine-tune for custom voices: Rent a RTX 4090 for a few hours to fine-tune the base model on your voice data

Troubleshooting

Issue
Solution

Out of memory on 1.7B

Switch to 0.6B or use torch_dtype=torch.float16

Voice clone sounds wrong

Use 5-10 seconds of clean audio (no background noise)

Wrong language output

Explicitly pass language parameter

Slow first generation

Normal — model loads on first call. Subsequent calls are fast

Further Reading

Last updated

Was this helpful?