> For the complete documentation index, see [llms.txt](https://docs.clore.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.clore.ai/guides/guides_v2-de/audio-and-sprache/qwen3-tts.md).

# Qwen3-TTS Voice Cloning

Qwen3-TTS von Alibaba ist ein hochmoderner Text-zu-Sprache-Modell, das unterstützt **10+ Sprachen** mit Stimmklonen bereits aus nur 3 Sekunden Audio. Es bietet natürliche Emotionssteuerung in der Sprache ("sprich fröhlich", "flüstere leise"), Streaming mit 97 ms Latenz und zwei Modellgrößen (0,6B und 1,7B). Veröffentlicht unter Apache 2.0 ist es eines der leistungsfähigsten Open-Source-TTS-Systeme.

## Hauptmerkmale

* **10+ Sprachen**: Englisch, Chinesisch, Japanisch, Koreanisch, Französisch, Deutsch, Spanisch und mehr
* **3-Sekunden-Stimmenklon**: Klone jede Stimme aus einer kurzen Audioaufnahme
* **Natürliche Emotionssteuerung**: Steuere den Stil mit einfachen Textanweisungen
* **Streaming-Unterstützung**: 97 ms First-Token-Latenz — ideal für Echtzeit-Anwendungen
* **Zwei Größen**: 0,6B (4GB VRAM) und 1,7B (8GB VRAM)
* **Feinabstimmbar**: Basismodelle verfügbar für individuelles Training
* **Apache-2.0-Lizenz**: Volle kommerzielle Nutzung

## Modellvarianten

| Modell                  | Parameter | VRAM | Qualität  | Geschwindigkeit | Am besten geeignet für           |
| ----------------------- | --------- | ---- | --------- | --------------- | -------------------------------- |
| Qwen3-TTS-0.6B-Instruct | 0,6B      | 4GB  | Gut       | Schnell         | Echtzeit, budgetfreundliche GPUs |
| Qwen3-TTS-1.7B-Instruct | 1,7B      | 8GB  | Am besten | Mittel          | Produktionsqualität              |
| Qwen3-TTS-0.6B-Base     | 0,6B      | 4GB  | —         | —               | Feinabstimmung                   |
| Qwen3-TTS-1.7B-Base     | 1,7B      | 8GB  | —         | —               | Feinabstimmung                   |

## Anforderungen

| Komponente | 0,6B         | 1,7B          |
| ---------- | ------------ | ------------- |
| GPU        | RTX 3060 6GB | RTX 3080 10GB |
| VRAM       | 4GB          | 8GB           |
| RAM        | 8GB          | 16GB          |
| Festplatte | 5GB          | 10GB          |
| Python     | 3.10+        | 3.10+         |

**Empfohlene Clore.ai-GPU**: RTX 3060 (0,15–0,3 $/Tag) für 0,6B, RTX 3080 (0,2–0,5 $/Tag) für 1,7B

## Installation

```bash
pip install transformers torch torchaudio soundfile
```

## Schnellstart — Stimmenklon

```python
import torch
import torchaudio
from transformers import AutoModelForCausalLM, AutoProcessor

model_name = "Qwen/Qwen3-TTS-12Hz-1.7B-Instruct"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Lade Referenzstimme (3+ Sekunden jeder Stimme)
reference_audio, sr = torchaudio.load("reference_voice.wav")

# Erzeuge Sprache, die diese Stimme klont
text = "Welcome to Clore.ai, the decentralized GPU rental marketplace."
inputs = processor(
    text=text,
    audio=reference_audio,
    sampling_rate=sr,
    return_tensors="pt"
).to("cuda")

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=2048)

# Dekodieren und speichern
audio = processor.decode(output[0])
torchaudio.save("output.wav", audio.unsqueeze(0), 24000)
```

## Emotionssteuerung

```python
# Steuere die Emotion mit natürlichen Sprachbefehlen
prompts = [
    ("Sprich glücklich und energiegeladen", "Großartige Neuigkeiten! Wir haben gerade das neue Feature gestartet!"),
    ("Flüstere leise und sanft", "Lass mich dir ein Geheimnis über GPU-Preise erzählen..."),
    ("Sprich professionell und klar", "Die Quartalsergebnisse zeigen einen Umsatzanstieg von 40%."),
    ("Sprich aufgeregt", "Du wirst die Benchmark-Ergebnisse nicht glauben!"),
]

für style, text in prompts:
    inputs = processor(
        text=text,
        style_prompt=style,
        audio=reference_audio,
        sampling_rate=sr,
        return_tensors="pt"
    ).to("cuda")
    
    output = model.generate(**inputs, max_new_tokens=2048)
    audio = processor.decode(output[0])
    torchaudio.save(f"output_{style[:10]}.wav", audio.unsqueeze(0), 24000)
```

## Mehrsprachige Erzeugung

```python
# Erzeuge in verschiedenen Sprachen (gleiche Stimme!)
texts = {
    "en": "Hello, welcome to the GPU marketplace.",
    "zh": "你好，欢迎来到GPU市场。",
    "ja": "こんにちは、GPUマーケットプレイスへようこそ。",
    "ko": "안녕하세요, GPU 마켓플레이스에 오신 것을 환영합니다.",
    "fr": "Bonjour, bienvenue sur le marché GPU.",
    "de": "Hallo, willkommen auf dem GPU-Marktplatz.",
}

for lang, text in texts.items():
    inputs = processor(
        text=text, audio=reference_audio, sampling_rate=sr,
        language=lang, return_tensors="pt"
    ).to("cuda")
    output = model.generate(**inputs, max_new_tokens=2048)
    audio = processor.decode(output[0])
    torchaudio.save(f"output_{lang}.wav", audio.unsqueeze(0), 24000)
```

## Vergleich mit anderen TTS-Modellen

| Funktion          | Qwen3-TTS   | Zonos      | Dia           | Kokoro     | XTTS  |
| ----------------- | ----------- | ---------- | ------------- | ---------- | ----- |
| Sprachen          | 10+         | 1 (EN)     | 1 (EN)        | 1 (EN)     | 17    |
| Stimmenklon       | 3 Sek       | 2-30 Sek   | Nein          | Nein       | 6 Sek |
| Streaming         | ✅ (97ms)    | ❌          | ❌             | ❌          | ✅     |
| Emotionssteuerung | ✅ Natürlich | ❌          | ✅ Automatisch | ❌          | ❌     |
| Mehrere Sprecher  | ❌           | ❌          | ✅             | ❌          | ❌     |
| Min. VRAM         | 4GB         | 8GB        | 8GB           | 2GB        | 6GB   |
| Lizenz            | Apache 2.0  | Apache 2.0 | Apache 2.0    | Apache 2.0 | AGPL  |

## Tipps für Clore.ai-Nutzer

* **0,6B auf RTX 3060**: Beste Budget-Option bei 0,15 $/Tag — für die meisten TTS-Aufgaben ausreichend
* **Batch-Verarbeitung**: Erzeuge alle Audioclips in einer Sitzung, um die Mietzeit zu maximieren
* **Referenzaudio zwischenspeichern**: Bewahre deine Stimmreferenzen auf persistentem Speicher auf
* **Streaming für Echtzeit**: Verwende die Streaming-API für Chatbot-/Assistenten-Anwendungen
* **Feinabstimmung für individuelle Stimmen**: Miete eine RTX 4090 für ein paar Stunden, um das Basismodell mit deinen Stimmendaten feinzuabstimmen

## Fehlerbehebung

| Problem                    | Lösung                                                                             |
| -------------------------- | ---------------------------------------------------------------------------------- |
| Speichermangel bei 1,7B    | Wechsle zu 0,6B oder verwende `torch_dtype=torch.float16`                          |
| Stimmenklon klingt falsch  | Verwende 5–10 Sekunden sauberes Audio (keine Hintergrundgeräusche)                 |
| Falsche Sprachausgabe      | Gib explizit `Sprache` Parameter                                                   |
| Langsame erste Generierung | Normal — Modell wird beim ersten Aufruf geladen. Nachfolgende Aufrufe sind schnell |

## Weiterführende Lektüre

* [HuggingFace-Modelle](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Instruct)
* [Qwen3-TTS Dokumentation](https://qwen.readthedocs.io/)
* [Anleitung zum Stimmenklon](https://medium.com/@zh.milo/qwen3-tts-the-complete-2026-guide)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-de/audio-and-sprache/qwen3-tts.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.