> For the complete documentation index, see [llms.txt](https://docs.clore.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.clore.ai/guides/guides_v2-hi/audio-and-voice/voxtral-tts.md).

# Voxtral TTS

> **मिस्ट्रल का ओपन-वेट टेक-टू-स्पीच मॉडल: 4B पैरामीटर, 9 भाषाएँ, ज़ीरो-शॉट वॉयस क्लोनिंग, केवल 3 GB VRAM.**

| स्पेक             | मान                                                                            |
| ----------------- | ------------------------------------------------------------------------------ |
| **डेवलपर**        | मिस्ट्रल AI                                                                    |
| **पैरामीटर**      | 4 अरब                                                                          |
| **आर्किटेक्चर**   | केवल डिकोडर TTS                                                                |
| **भाषाएँ**        | 9 (अंग्रेज़ी, फ़्रेंच, जर्मन, स्पेनिश, हिंदी, अरबी, पुर्तगाली, इतालवी, जापानी) |
| **लाइसेंस**       | Apache 2.0 (ओपन वेट्स)                                                         |
| **VRAM**          | \~3 GB (FP16)                                                                  |
| **लेटेंसी**       | 10-सेकंड आउटपुट के लिए 70 ms                                                   |
| **वॉयस क्लोनिंग** | 3-सेकंड के संदर्भ से ज़ीरो-शॉट                                                 |
| **रिलीज़**        | 26 मार्च, 2026                                                                 |

## Voxtral TTS क्यों?

Voxtral TTS, ElevenLabs और OpenAI TTS के लिए मिस्ट्रल का ओपन-वेट जवाब है। Clore.ai उपयोगकर्ताओं के लिए प्रमुख फायदे:

* **किसी भी GPU पर चलता है** — केवल 3 GB VRAM का मतलब है कि RTX 3060 भी बिल्कुल ठीक काम करता है
* **कोई API शुल्क नहीं** — स्वयं होस्ट किया गया = शून्य सीमांत लागत पर असीमित सिंथेसिस
* **डेटा गोपनीयता** — ऑडियो कभी भी आपकी मशीन से बाहर नहीं जाता
* **ज़ीरो-शॉट क्लोनिंग** — 3 सेकंड के संदर्भ ऑडियो से किसी भी आवाज़ को क्लोन करें
* **मूल रूप से 9 भाषाएँ** — जिसमें हिंदी और अरबी शामिल हैं, जो अक्सर प्रतिस्पर्धियों में नहीं होतीं
* **रीयल-टाइम गति** — RTX 4070+ पर RTF 0.1–0.2× (10-सेकंड क्लिप 1–2 सेकंड में)

## Clore.ai पर GPU आवश्यकताएँ

| GPU           | VRAM  | प्रदर्शन                              | Clore.ai कीमत     |
| ------------- | ----- | ------------------------------------- | ----------------- |
| RTX 3060 12GB | 12 GB | ✅ अच्छा — 3–4× रीयल-टाइम              | $0.10/दिन से शुरू |
| RTX 3090 24GB | 24 GB | ✅ बहुत अच्छा — बैच प्रोसेसिंग         | $0.30/दिन से शुरू |
| RTX 4070 12GB | 12 GB | ✅ उत्कृष्ट — 5–10× रीयल-टाइम          | $0.25/दिन से शुरू |
| RTX 4090 24GB | 24 GB | ✅ आवश्यकता से अधिक — सब-सेकंड लेटेंसी | $0.50/दिन से शुरू |

> **सिफारिश:** अधिकांश उपयोग मामलों के लिए RTX 3060 12GB (Clore.ai पर $0.10/दिन) सबसे उपयुक्त विकल्प है। Voxtral को केवल 3 GB VRAM की आवश्यकता होती है, इसलिए आप इसे अन्य मॉडलों के साथ भी चला सकते हैं।

## Clore.ai पर त्वरित शुरुआत

### चरण 1: GPU सर्वर किराए पर लें

1. पर जाएँ [Clore.ai मार्केटप्लेस](https://clore.ai/marketplace)
2. 8+ GB VRAM वाले किसी भी GPU के लिए फ़िल्टर करें
3. एक चुनें **Docker** डिप्लॉयमेंट
4. इमेज का उपयोग करें: `pytorch/pytorch:2.4.0-cuda12.4-cudnn9-devel`

### चरण 2: निर्भरताएँ इंस्टॉल करें

```bash
# SSH या Jupyter टर्मिनल से कनेक्ट करें
pip install torch torchaudio transformers accelerate

# Voxtral TTS पैकेज इंस्टॉल करें
pip install voxtral-tts

# या सीधे HuggingFace का उपयोग करें
pip install huggingface_hub
huggingface-cli download mistralai/Voxtral-TTS --local-dir ./voxtral-tts
```

### चरण 3: बुनियादी टेक-टू-स्पीच

```python
from voxtral import VoxtralTTS

# मॉडल आरंभ करें (वज़न स्वतः डाउनलोड होंगे ~6 GB)
model = VoxtralTTS.from_pretrained("mistralai/Voxtral-TTS")
model.to("cuda")

# बुनियादी सिंथेसिस
audio = model.synthesize(
    text="Clore.ai में आपका स्वागत है — विकेंद्रीकृत GPU मार्केटप्लेस।",
    language="en"
)
audio.save("output.wav")
print(f"{audio.duration:.1f}s ऑडियो जनरेट किया गया")
```

### चरण 4: ज़ीरो-शॉट वॉयस क्लोनिंग

```python
# 3-सेकंड के संदर्भ से आवाज़ क्लोन करें
audio = model.synthesize(
    text="यह मेरी क्लोन की गई आवाज़ है जो GPU कंप्यूटिंग के बारे में बोल रही है।",
    reference_audio="reference_speaker.wav",  # 3+ सेकंड
    language="en"
)
audio.save("cloned_output.wav")
```

### चरण 5: बहु-भाषा सिंथेसिस

```python
# 9 समर्थित भाषाओं में सिंथेसाइज़ करें
languages = {
    "en": "Hello, this is Voxtral speaking in English.",
    "fr": "Bonjour, c'est Voxtral qui parle en français.",
    "de": "Hallo, hier spricht Voxtral auf Deutsch.",
    "es": "Hola, Voxtral hablando en español.",
    "hi": "नमस्ते, यह Voxtral हिंदी में बोल रहा है।",
    "ar": "مرحبا، هذا Voxtral يتحدث بالعربية.",
    "pt": "Olá, aqui é o Voxtral falando em português.",
    "it": "Ciao, qui parla Voxtral in italiano.",
    "ja": "こんにちは、Voxtralが日本語で話しています。",
}

for lang, text in languages.items():
    audio = model.synthesize(text=text, language=lang)
    audio.save(f"voxtral_{lang}.wav")
    print(f"[{lang}] {audio.duration:.1f}s जनरेट किया गया")
```

## प्रोडक्शन API सर्वर

अपने ऐप्लिकेशन में इंटीग्रेशन के लिए Voxtral को REST API के रूप में डिप्लॉय करें:

```python
# server.py — Voxtral TTS के लिए FastAPI रैपर
from fastapi import FastAPI, UploadFile, File
from fastapi.responses import StreamingResponse
from voxtral import VoxtralTTS
import io
import soundfile as sf

app = FastAPI(title="Voxtral TTS API")
model = VoxtralTTS.from_pretrained("mistralai/Voxtral-TTS").to("cuda")

@app.post("/synthesize")
async def synthesize(
    text: str,
    language: str = "en",
    reference: UploadFile = File(None)
):
    kwargs = {"text": text, "language": language}
    if reference:
        ref_bytes = await reference.read()
        kwargs["reference_audio"] = ref_bytes
    
    audio = model.synthesize(**kwargs)
    
    # WAV स्ट्रीम के रूप में लौटाएँ
    buffer = io.BytesIO()
    sf.write(buffer, audio.numpy(), samplerate=24000, format="WAV")
    buffer.seek(0)
    
    return StreamingResponse(buffer, media_type="audio/wav")

@app.get("/health")
async def health():
    return {"status": "ok", "model": "voxtral-tts", "languages": 9}
```

```bash
# API सर्वर चलाएँ
pip install fastapi uvicorn python-multipart soundfile
uvicorn server:app --host 0.0.0.0 --port 8000

# इसे टेस्ट करें
curl -X POST "http://localhost:8000/synthesize?text=Hello%20world&language=en" \\
  --output hello.wav
```

## Docker डिप्लॉयमेंट

```dockerfile
FROM pytorch/pytorch:2.4.0-cuda12.4-cudnn9-devel

WORKDIR /app
RUN pip install voxtral-tts fastapi uvicorn python-multipart soundfile

# मॉडल वज़न पहले से डाउनलोड करें
RUN python -c "from voxtral import VoxtralTTS; VoxtralTTS.from_pretrained('mistralai/Voxtral-TTS')"

COPY server.py .
EXPOSE 8000
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000"]
```

```bash
# बिल्ड और रन करें
docker build -t voxtral-tts-api .
docker run --gpus all -p 8000:8000 voxtral-tts-api
```

## Voxtral बनाम अन्य TTS मॉडल

| विशेषता           | Voxtral TTS    | ElevenLabs         | Qwen3-TTS      | Kokoro TTS | Fish Speech     |
| ----------------- | -------------- | ------------------ | -------------- | ---------- | --------------- |
| **ओपन वेट्स**     | ✅ Apache 2.0   | ❌ केवल API         | ✅              | ✅          | ✅               |
| **VRAM**          | 3 GB           | लागू नहीं (क्लाउड) | 8 GB           | 2 GB       | 4 GB            |
| **भाषाएँ**        | 9              | 30+                | 50+            | 5          | 8               |
| **वॉयस क्लोनिंग** | 3 सेकंड संदर्भ | 1 सेकंड संदर्भ     | 5 सेकंड संदर्भ | ❌          | 10 सेकंड संदर्भ |
| **लेटेंसी**       | 70 ms          | \~200 ms           | \~150 ms       | 50 ms      | 100 ms          |
| **गुणवत्ता**      | ⭐⭐⭐⭐⭐          | ⭐⭐⭐⭐⭐              | ⭐⭐⭐⭐           | ⭐⭐⭐⭐       | ⭐⭐⭐⭐            |
| **स्व-होस्टेड**   | ✅              | ❌                  | ✅              | ✅          | ✅               |

## बड़े प्रोजेक्ट्स के लिए बैच प्रोसेसिंग

```python
import concurrent.futures
from voxtral import VoxtralTTS

model = VoxtralTTS.from_pretrained("mistralai/Voxtral-TTS").to("cuda")

# एक पूरे ऑडियोबुक अध्याय को प्रोसेस करें
paragraphs = [
    "अध्याय 1: शुरुआत...",
    "वह एक अँधेरी और तूफ़ानी रात थी...",
    "नायक आगे बढ़ा...",
    # ... सैकड़ों पैराग्राफ
]

def process_paragraph(idx_text):
    idx, text = idx_text
    audio = model.synthesize(text=text, language="en")
    audio.save(f"chapter1_part{idx:04d}.wav")
    return idx

# क्रमिक प्रोसेसिंग (GPU-आधारित)
for i, text in enumerate(paragraphs):
    process_paragraph((i, text))
    
print(f"{len(paragraphs)} पैराग्राफ प्रोसेस किए गए")
```

## रीयल-टाइम अनुप्रयोगों के लिए स्ट्रीमिंग मोड

```python
# लाइव अनुप्रयोगों के लिए स्ट्रीमिंग सिंथेसिस
async def stream_synthesis(text: str, language: str = "en"):
    """कम-लेटेंसी प्लेबैक के लिए ऑडियो को स्ट्रीमिंग खंडों में जनरेट करें।"""
    model = VoxtralTTS.from_pretrained("mistralai/Voxtral-TTS").to("cuda")
    
    async for chunk in model.synthesize_stream(
        text=text,
        language=language,
        chunk_size=4096  # 24kHz पर प्रति खंड ~170ms
    ):
        yield chunk.numpy().tobytes()
```

## समस्या-समाधान

| समस्या                      | समाधान                                                                            |
| --------------------------- | --------------------------------------------------------------------------------- |
| छोटे GPU पर OOM             | इस्तेमाल करें `model.half()` FP16 के लिए (VRAM लगभग 1.5 GB तक घटती है)            |
| पहला इन्फरेंस धीमा          | सामान्य — मॉडल पहली बार चलने पर CUDA kernels कंपाइल करता है (\~30s)               |
| भाषा X के लिए खराब गुणवत्ता | सही `language` पैरामीटर सुनिश्चित करें; कुछ भाषाओं के लिए लंबा संदर्भ ऑडियो चाहिए |
| ऑडियो आर्टिफैक्ट्स          | बढ़ाएँ `reference_audio` बेहतर वॉयस क्लोनिंग के लिए लंबाई 5–10 सेकंड करें         |
| मॉडल डाउनलोड विफल           | सेट करें `HF_TOKEN` गेटेड मॉडल एक्सेस के लिए env वेरिएबल                          |

## लागत विश्लेषण: Clore.ai पर Voxtral बनाम क्लाउड TTS

| सेवा                    | 1M वर्ण/महीना | टिप्पणियाँ                             |
| ----------------------- | ------------- | -------------------------------------- |
| ElevenLabs Pro          | $99/माह       | 500K वर्ण शामिल, अतिरिक्त शुल्क        |
| OpenAI TTS              | $15/माह       | 1M वर्ण पर $15                         |
| Google Cloud TTS        | $16/माह       | मानक आवाज़ें                           |
| **Clore.ai पर Voxtral** | **$3–15/माह** | RTX 3060 @ $0.10–0.50/दिन, असीमित वर्ण |

> **निष्कर्ष:** Clore.ai पर Voxtral को स्वयं होस्ट करना, क्लाउड TTS APIs की तुलना में 6–30× सस्ता है, बिना किसी वर्ण सीमा और पूर्ण डेटा गोपनीयता के साथ।

## अधिक पढ़ें

* [HuggingFace पर Voxtral TTS](https://huggingface.co/mistralai/Voxtral-TTS)
* [Mistral AI ब्लॉग — Voxtral घोषणा](https://mistral.ai/news/voxtral-tts)
* [Clore.ai पर TTS मॉडल्स की तुलना करें](/guides/guides_v2-hi/comparisons/tts-comparison.md)
* [अन्य ऑडियो और वॉयस गाइड्स](/guides/guides_v2-hi/audio-and-voice/audio-voice.md)

***

*अंतिम अपडेट: 30 मार्च, 2026*


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-hi/audio-and-voice/voxtral-tts.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.