# Voxtral TTS

> **Mistral's open-weight text-to-speech model: 4B parameters, 9 languages, zero-shot voice cloning, only 3 GB VRAM.**

| Spec              | Value                                                                              |
| ----------------- | ---------------------------------------------------------------------------------- |
| **Developer**     | Mistral AI                                                                         |
| **Parameters**    | 4 billion                                                                          |
| **Architecture**  | Decoder-only TTS                                                                   |
| **Languages**     | 9 (English, French, German, Spanish, Hindi, Arabic, Portuguese, Italian, Japanese) |
| **License**       | Apache 2.0 (open weights)                                                          |
| **VRAM**          | \~3 GB (FP16)                                                                      |
| **Latency**       | 70 ms for 10-second output                                                         |
| **Voice cloning** | Zero-shot from 3-second reference                                                  |
| **Release**       | March 26, 2026                                                                     |

## Why Voxtral TTS?

Voxtral TTS is Mistral's open-weight answer to ElevenLabs and OpenAI TTS. Key advantages for Clore.ai users:

* **Runs on any GPU** — only 3 GB VRAM means even an RTX 3060 works perfectly
* **No API fees** — self-hosted = unlimited synthesis at zero marginal cost
* **Data privacy** — audio never leaves your machine
* **Zero-shot cloning** — clone any voice from 3 seconds of reference audio
* **9 languages natively** — including Hindi and Arabic, often missing from competitors
* **Real-time speed** — RTF 0.1–0.2× on RTX 4070+ (10-second clip in 1–2 seconds)

## GPU Requirements on Clore.ai

| GPU           | VRAM  | Performance                     | Clore.ai Price |
| ------------- | ----- | ------------------------------- | -------------- |
| RTX 3060 12GB | 12 GB | ✅ Good — 3–4× real-time         | from $0.10/day |
| RTX 3090 24GB | 24 GB | ✅ Great — batch processing      | from $0.30/day |
| RTX 4070 12GB | 12 GB | ✅ Excellent — 5–10× real-time   | from $0.25/day |
| RTX 4090 24GB | 24 GB | ✅ Overkill — sub-second latency | from $0.50/day |

> **Recommendation:** An RTX 3060 12GB ($0.10/day on Clore.ai) is the sweet spot for most use cases. Voxtral only needs 3 GB VRAM, so you can run it alongside other models.

## Quick Start on Clore.ai

### Step 1: Rent a GPU Server

1. Go to [Clore.ai Marketplace](https://clore.ai/marketplace)
2. Filter for any GPU with 8+ GB VRAM
3. Select a **Docker** deployment
4. Use image: `pytorch/pytorch:2.4.0-cuda12.4-cudnn9-devel`

### Step 2: Install Dependencies

```bash
# Connect via SSH or Jupyter terminal
pip install torch torchaudio transformers accelerate

# Install Voxtral TTS package
pip install voxtral-tts

# Or use HuggingFace directly
pip install huggingface_hub
huggingface-cli download mistralai/Voxtral-TTS --local-dir ./voxtral-tts
```

### Step 3: Basic Text-to-Speech

```python
from voxtral import VoxtralTTS

# Initialize model (auto-downloads weights ~6 GB)
model = VoxtralTTS.from_pretrained("mistralai/Voxtral-TTS")
model.to("cuda")

# Basic synthesis
audio = model.synthesize(
    text="Welcome to Clore.ai — the decentralized GPU marketplace.",
    language="en"
)
audio.save("output.wav")
print(f"Generated {audio.duration:.1f}s of audio")
```

### Step 4: Zero-Shot Voice Cloning

```python
# Clone a voice from 3-second reference
audio = model.synthesize(
    text="This is my cloned voice speaking about GPU computing.",
    reference_audio="reference_speaker.wav",  # 3+ seconds
    language="en"
)
audio.save("cloned_output.wav")
```

### Step 5: Multi-Language Synthesis

```python
# Synthesize in 9 supported languages
languages = {
    "en": "Hello, this is Voxtral speaking in English.",
    "fr": "Bonjour, c'est Voxtral qui parle en français.",
    "de": "Hallo, hier spricht Voxtral auf Deutsch.",
    "es": "Hola, Voxtral hablando en español.",
    "hi": "नमस्ते, यह Voxtral हिंदी में बोल रहा है।",
    "ar": "مرحبا، هذا Voxtral يتحدث بالعربية.",
    "pt": "Olá, aqui é o Voxtral falando em português.",
    "it": "Ciao, qui parla Voxtral in italiano.",
    "ja": "こんにちは、Voxtralが日本語で話しています。",
}

for lang, text in languages.items():
    audio = model.synthesize(text=text, language=lang)
    audio.save(f"voxtral_{lang}.wav")
    print(f"[{lang}] Generated {audio.duration:.1f}s")
```

## Production API Server

Deploy Voxtral as a REST API for integration into your applications:

```python
# server.py — FastAPI wrapper for Voxtral TTS
from fastapi import FastAPI, UploadFile, File
from fastapi.responses import StreamingResponse
from voxtral import VoxtralTTS
import io
import soundfile as sf

app = FastAPI(title="Voxtral TTS API")
model = VoxtralTTS.from_pretrained("mistralai/Voxtral-TTS").to("cuda")

@app.post("/synthesize")
async def synthesize(
    text: str,
    language: str = "en",
    reference: UploadFile = File(None)
):
    kwargs = {"text": text, "language": language}
    if reference:
        ref_bytes = await reference.read()
        kwargs["reference_audio"] = ref_bytes
    
    audio = model.synthesize(**kwargs)
    
    # Return as WAV stream
    buffer = io.BytesIO()
    sf.write(buffer, audio.numpy(), samplerate=24000, format="WAV")
    buffer.seek(0)
    
    return StreamingResponse(buffer, media_type="audio/wav")

@app.get("/health")
async def health():
    return {"status": "ok", "model": "voxtral-tts", "languages": 9}
```

```bash
# Run the API server
pip install fastapi uvicorn python-multipart soundfile
uvicorn server:app --host 0.0.0.0 --port 8000

# Test it
curl -X POST "http://localhost:8000/synthesize?text=Hello%20world&language=en" \
  --output hello.wav
```

## Docker Deployment

```dockerfile
FROM pytorch/pytorch:2.4.0-cuda12.4-cudnn9-devel

WORKDIR /app
RUN pip install voxtral-tts fastapi uvicorn python-multipart soundfile

# Pre-download model weights
RUN python -c "from voxtral import VoxtralTTS; VoxtralTTS.from_pretrained('mistralai/Voxtral-TTS')"

COPY server.py .
EXPOSE 8000
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000"]
```

```bash
# Build and run
docker build -t voxtral-tts-api .
docker run --gpus all -p 8000:8000 voxtral-tts-api
```

## Voxtral vs Other TTS Models

| Feature           | Voxtral TTS  | ElevenLabs  | Qwen3-TTS | Kokoro TTS | Fish Speech |
| ----------------- | ------------ | ----------- | --------- | ---------- | ----------- |
| **Open weights**  | ✅ Apache 2.0 | ❌ API only  | ✅         | ✅          | ✅           |
| **VRAM**          | 3 GB         | N/A (cloud) | 8 GB      | 2 GB       | 4 GB        |
| **Languages**     | 9            | 30+         | 50+       | 5          | 8           |
| **Voice cloning** | 3s ref       | 1s ref      | 5s ref    | ❌          | 10s ref     |
| **Latency**       | 70 ms        | \~200 ms    | \~150 ms  | 50 ms      | 100 ms      |
| **Quality**       | ⭐⭐⭐⭐⭐        | ⭐⭐⭐⭐⭐       | ⭐⭐⭐⭐      | ⭐⭐⭐⭐       | ⭐⭐⭐⭐        |
| **Self-hosted**   | ✅            | ❌           | ✅         | ✅          | ✅           |

## Batch Processing for Large Projects

```python
import concurrent.futures
from voxtral import VoxtralTTS

model = VoxtralTTS.from_pretrained("mistralai/Voxtral-TTS").to("cuda")

# Process an entire audiobook chapter
paragraphs = [
    "Chapter 1: The Beginning...",
    "It was a dark and stormy night...",
    "The protagonist stepped forward...",
    # ... hundreds of paragraphs
]

def process_paragraph(idx_text):
    idx, text = idx_text
    audio = model.synthesize(text=text, language="en")
    audio.save(f"chapter1_part{idx:04d}.wav")
    return idx

# Sequential processing (GPU-bound)
for i, text in enumerate(paragraphs):
    process_paragraph((i, text))
    
print(f"Processed {len(paragraphs)} paragraphs")
```

## Streaming Mode for Real-Time Applications

```python
# Streaming synthesis for live applications
async def stream_synthesis(text: str, language: str = "en"):
    """Generate audio in streaming chunks for low-latency playback."""
    model = VoxtralTTS.from_pretrained("mistralai/Voxtral-TTS").to("cuda")
    
    async for chunk in model.synthesize_stream(
        text=text,
        language=language,
        chunk_size=4096  # ~170ms per chunk at 24kHz
    ):
        yield chunk.numpy().tobytes()
```

## Troubleshooting

| Issue                       | Solution                                                                        |
| --------------------------- | ------------------------------------------------------------------------------- |
| OOM on small GPU            | Use `model.half()` for FP16 (halves VRAM to \~1.5 GB)                           |
| Slow first inference        | Normal — model compiles CUDA kernels on first run (\~30s)                       |
| Poor quality for language X | Ensure correct `language` parameter; some languages need longer reference audio |
| Audio artifacts             | Increase `reference_audio` length to 5–10s for better voice cloning             |
| Model download fails        | Set `HF_TOKEN` env variable for gated model access                              |

## Cost Analysis: Voxtral on Clore.ai vs Cloud TTS

| Service                 | 1M characters/month | Notes                                      |
| ----------------------- | ------------------- | ------------------------------------------ |
| ElevenLabs Pro          | $99/mo              | 500K chars included, overage fees          |
| OpenAI TTS              | $15/mo              | $15 per 1M characters                      |
| Google Cloud TTS        | $16/mo              | Standard voices                            |
| **Voxtral on Clore.ai** | **$3–15/mo**        | RTX 3060 @ $0.10–0.50/day, unlimited chars |

> **Bottom line:** Self-hosting Voxtral on Clore.ai is 6–30× cheaper than cloud TTS APIs, with zero character limits and full data privacy.

## Further Reading

* [Voxtral TTS on HuggingFace](https://huggingface.co/mistralai/Voxtral-TTS)
* [Mistral AI Blog — Voxtral Announcement](https://mistral.ai/news/voxtral-tts)
* [Compare TTS models on Clore.ai](https://docs.clore.ai/guides/comparisons/tts-comparison)
* [Other Audio & Voice guides](https://docs.clore.ai/guides/audio-and-voice/audio-voice)

***

*Last updated: March 30, 2026*


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/audio-and-voice/voxtral-tts.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
