Qwen3-TTS Voice Cloning

Multilingual voice cloning and TTS with Qwen3-TTS — 10+ languages, streaming, emotion control

Qwen3-TTS by Alibaba is a state-of-the-art text-to-speech model supporting 10+ languages with voice cloning from just 3 seconds of audio. It features natural language emotion control ("speak happily", "whisper softly"), streaming with 97ms latency, and two model sizes (0.6B and 1.7B). Released under Apache 2.0, it's one of the most capable open-source TTS systems available.

Key Features

10+ languages: English, Chinese, Japanese, Korean, French, German, Spanish, and more
3-second voice cloning: Clone any voice from a short audio sample
Natural emotion control: Control style with plain text instructions
Streaming support: 97ms first-token latency — great for real-time apps
Two sizes: 0.6B (4GB VRAM) and 1.7B (8GB VRAM)
Fine-tunable: Base models available for custom training
Apache 2.0 license: Full commercial use

Model Variants

Model

Parameters

VRAM

Quality

Speed

Best For

Qwen3-TTS-0.6B-Instruct

0.6B

4GB

Good

Fast

Real-time, budget GPUs

Qwen3-TTS-1.7B-Instruct

1.7B

8GB

Best

Medium

Production quality

Qwen3-TTS-0.6B-Base

0.6B

4GB

—

Fine-tuning

Qwen3-TTS-1.7B-Base

1.7B

8GB

—

Fine-tuning

Requirements

Component

0.6B

1.7B

GPU

RTX 3060 6GB

RTX 3080 10GB

VRAM

4GB

8GB

RAM

8GB

16GB

Disk

5GB

10GB

Python

3.10+

Recommended Clore.ai GPU: RTX 3060 ($0.15–0.3/day) for 0.6B, RTX 3080 ($0.2–0.5/day) for 1.7B

Installation

pip install transformers torch torchaudio soundfile

Quick Start — Voice Cloning

import torch
import torchaudio
from transformers import AutoModelForCausalLM, AutoProcessor

model_name = "Qwen/Qwen3-TTS-12Hz-1.7B-Instruct"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load reference voice (3+ seconds of any voice)
reference_audio, sr = torchaudio.load("reference_voice.wav")

# Generate speech cloning that voice
text = "Welcome to Clore.ai, the decentralized GPU rental marketplace."
inputs = processor(
    text=text,
    audio=reference_audio,
    sampling_rate=sr,
    return_tensors="pt"
).to("cuda")

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=2048)

# Decode and save
audio = processor.decode(output[0])
torchaudio.save("output.wav", audio.unsqueeze(0), 24000)

Emotion Control

# Control emotion with natural language instructions
prompts = [
    ("Speak happily and energetically", "Great news! We just launched the new feature!"),
    ("Whisper softly and gently", "Let me tell you a secret about GPU pricing..."),
    ("Speak professionally and clearly", "The quarterly results show a 40% increase in revenue."),
    ("Speak with excitement", "You won't believe the benchmark results!"),
]

for style, text in prompts:
    inputs = processor(
        text=text,
        style_prompt=style,
        audio=reference_audio,
        sampling_rate=sr,
        return_tensors="pt"
    ).to("cuda")
    
    output = model.generate(**inputs, max_new_tokens=2048)
    audio = processor.decode(output[0])
    torchaudio.save(f"output_{style[:10]}.wav", audio.unsqueeze(0), 24000)

Multilingual Generation

# Generate in different languages (same voice!)
texts = {
    "en": "Hello, welcome to the GPU marketplace.",
    "zh": "你好，欢迎来到GPU市场。",
    "ja": "こんにちは、GPUマーケットプレイスへようこそ。",
    "ko": "안녕하세요, GPU 마켓플레이스에 오신 것을 환영합니다.",
    "fr": "Bonjour, bienvenue sur le marché GPU.",
    "de": "Hallo, willkommen auf dem GPU-Marktplatz.",
}

for lang, text in texts.items():
    inputs = processor(
        text=text, audio=reference_audio, sampling_rate=sr,
        language=lang, return_tensors="pt"
    ).to("cuda")
    output = model.generate(**inputs, max_new_tokens=2048)
    audio = processor.decode(output[0])
    torchaudio.save(f"output_{lang}.wav", audio.unsqueeze(0), 24000)

Comparison with Other TTS Models

Feature

Qwen3-TTS

Zonos

Dia

Kokoro

XTTS

Languages

10+

1 (EN)

Voice Clone

3 sec

2-30 sec

6 sec

Streaming

✅ (97ms)

❌

✅

Emotion Control

✅ Natural

❌

✅ Auto

❌

Multi-Speaker

❌

✅

❌

Min VRAM

4GB

8GB

2GB

6GB

License

Apache 2.0

AGPL

Tips for Clore.ai Users

0.6B on RTX 3060: Best budget option at $0.15/day — good enough for most TTS tasks
Batch processing: Generate all audio clips in one session to maximize rental time
Cache reference audio: Keep your voice references on persistent storage
Streaming for real-time: Use the streaming API for chatbot/assistant applications
Fine-tune for custom voices: Rent a RTX 4090 for a few hours to fine-tune the base model on your voice data

Troubleshooting

Issue

Solution

Out of memory on 1.7B

Switch to 0.6B or use torch_dtype=torch.float16

Voice clone sounds wrong

Use 5-10 seconds of clean audio (no background noise)

Wrong language output

Explicitly pass language parameter

Slow first generation

Normal — model loads on first call. Subsequent calls are fast

hashtagKey Features

hashtagModel Variants

hashtagRequirements

hashtagInstallation

hashtagQuick Start — Voice Cloning

hashtagEmotion Control

hashtagMultilingual Generation

hashtagComparison with Other TTS Models

hashtagTips for Clore.ai Users

hashtagTroubleshooting

hashtagFurther Reading

Key Features

Model Variants

Requirements

Installation

Quick Start — Voice Cloning

Emotion Control

Multilingual Generation

Comparison with Other TTS Models

Tips for Clore.ai Users

Troubleshooting

Further Reading