# Qwen3-TTS Voice Cloning

Qwen3-TTS by Alibaba is a state-of-the-art text-to-speech model supporting **10+ languages** with voice cloning from just 3 seconds of audio. It features natural language emotion control ("speak happily", "whisper softly"), streaming with 97ms latency, and two model sizes (0.6B and 1.7B). Released under Apache 2.0, it's one of the most capable open-source TTS systems available.

## Key Features

* **10+ languages**: English, Chinese, Japanese, Korean, French, German, Spanish, and more
* **3-second voice cloning**: Clone any voice from a short audio sample
* **Natural emotion control**: Control style with plain text instructions
* **Streaming support**: 97ms first-token latency — great for real-time apps
* **Two sizes**: 0.6B (4GB VRAM) and 1.7B (8GB VRAM)
* **Fine-tunable**: Base models available for custom training
* **Apache 2.0 license**: Full commercial use

## Model Variants

| Model                   | Parameters | VRAM | Quality | Speed  | Best For               |
| ----------------------- | ---------- | ---- | ------- | ------ | ---------------------- |
| Qwen3-TTS-0.6B-Instruct | 0.6B       | 4GB  | Good    | Fast   | Real-time, budget GPUs |
| Qwen3-TTS-1.7B-Instruct | 1.7B       | 8GB  | Best    | Medium | Production quality     |
| Qwen3-TTS-0.6B-Base     | 0.6B       | 4GB  | —       | —      | Fine-tuning            |
| Qwen3-TTS-1.7B-Base     | 1.7B       | 8GB  | —       | —      | Fine-tuning            |

## Requirements

| Component | 0.6B         | 1.7B          |
| --------- | ------------ | ------------- |
| GPU       | RTX 3060 6GB | RTX 3080 10GB |
| VRAM      | 4GB          | 8GB           |
| RAM       | 8GB          | 16GB          |
| Disk      | 5GB          | 10GB          |
| Python    | 3.10+        | 3.10+         |

**Recommended Clore.ai GPU**: RTX 3060 ($0.15–0.3/day) for 0.6B, RTX 3080 ($0.2–0.5/day) for 1.7B

## Installation

```bash
pip install transformers torch torchaudio soundfile
```

## Quick Start — Voice Cloning

```python
import torch
import torchaudio
from transformers import AutoModelForCausalLM, AutoProcessor

model_name = "Qwen/Qwen3-TTS-12Hz-1.7B-Instruct"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load reference voice (3+ seconds of any voice)
reference_audio, sr = torchaudio.load("reference_voice.wav")

# Generate speech cloning that voice
text = "Welcome to Clore.ai, the decentralized GPU rental marketplace."
inputs = processor(
    text=text,
    audio=reference_audio,
    sampling_rate=sr,
    return_tensors="pt"
).to("cuda")

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=2048)

# Decode and save
audio = processor.decode(output[0])
torchaudio.save("output.wav", audio.unsqueeze(0), 24000)
```

## Emotion Control

```python
# Control emotion with natural language instructions
prompts = [
    ("Speak happily and energetically", "Great news! We just launched the new feature!"),
    ("Whisper softly and gently", "Let me tell you a secret about GPU pricing..."),
    ("Speak professionally and clearly", "The quarterly results show a 40% increase in revenue."),
    ("Speak with excitement", "You won't believe the benchmark results!"),
]

for style, text in prompts:
    inputs = processor(
        text=text,
        style_prompt=style,
        audio=reference_audio,
        sampling_rate=sr,
        return_tensors="pt"
    ).to("cuda")
    
    output = model.generate(**inputs, max_new_tokens=2048)
    audio = processor.decode(output[0])
    torchaudio.save(f"output_{style[:10]}.wav", audio.unsqueeze(0), 24000)
```

## Multilingual Generation

```python
# Generate in different languages (same voice!)
texts = {
    "en": "Hello, welcome to the GPU marketplace.",
    "zh": "你好，欢迎来到GPU市场。",
    "ja": "こんにちは、GPUマーケットプレイスへようこそ。",
    "ko": "안녕하세요, GPU 마켓플레이스에 오신 것을 환영합니다.",
    "fr": "Bonjour, bienvenue sur le marché GPU.",
    "de": "Hallo, willkommen auf dem GPU-Marktplatz.",
}

for lang, text in texts.items():
    inputs = processor(
        text=text, audio=reference_audio, sampling_rate=sr,
        language=lang, return_tensors="pt"
    ).to("cuda")
    output = model.generate(**inputs, max_new_tokens=2048)
    audio = processor.decode(output[0])
    torchaudio.save(f"output_{lang}.wav", audio.unsqueeze(0), 24000)
```

## Comparison with Other TTS Models

| Feature         | Qwen3-TTS  | Zonos      | Dia        | Kokoro     | XTTS  |
| --------------- | ---------- | ---------- | ---------- | ---------- | ----- |
| Languages       | 10+        | 1 (EN)     | 1 (EN)     | 1 (EN)     | 17    |
| Voice Clone     | 3 sec      | 2-30 sec   | No         | No         | 6 sec |
| Streaming       | ✅ (97ms)   | ❌          | ❌          | ❌          | ✅     |
| Emotion Control | ✅ Natural  | ❌          | ✅ Auto     | ❌          | ❌     |
| Multi-Speaker   | ❌          | ❌          | ✅          | ❌          | ❌     |
| Min VRAM        | 4GB        | 8GB        | 8GB        | 2GB        | 6GB   |
| License         | Apache 2.0 | Apache 2.0 | Apache 2.0 | Apache 2.0 | AGPL  |

## Tips for Clore.ai Users

* **0.6B on RTX 3060**: Best budget option at $0.15/day — good enough for most TTS tasks
* **Batch processing**: Generate all audio clips in one session to maximize rental time
* **Cache reference audio**: Keep your voice references on persistent storage
* **Streaming for real-time**: Use the streaming API for chatbot/assistant applications
* **Fine-tune for custom voices**: Rent a RTX 4090 for a few hours to fine-tune the base model on your voice data

## Troubleshooting

| Issue                    | Solution                                                      |
| ------------------------ | ------------------------------------------------------------- |
| Out of memory on 1.7B    | Switch to 0.6B or use `torch_dtype=torch.float16`             |
| Voice clone sounds wrong | Use 5-10 seconds of clean audio (no background noise)         |
| Wrong language output    | Explicitly pass `language` parameter                          |
| Slow first generation    | Normal — model loads on first call. Subsequent calls are fast |

## Further Reading

* [HuggingFace Models](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Instruct)
* [Qwen3-TTS Documentation](https://qwen.readthedocs.io/)
* [Voice Cloning Guide](https://medium.com/@zh.milo/qwen3-tts-the-complete-2026-guide)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/audio-and-voice/qwen3-tts.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
