# XTTS (Coqui)

Generate natural speech with voice cloning using Coqui XTTS.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Renting on CLORE.AI

1. Visit [CLORE.AI Marketplace](https://clore.ai/marketplace)
2. Filter by GPU type, VRAM, and price
3. Choose **On-Demand** (fixed rate) or **Spot** (bid price)
4. Configure your order:
   * Select Docker image
   * Set ports (TCP for SSH, HTTP for web UIs)
   * Add environment variables if needed
   * Enter startup command
5. Select payment: **CLORE**, **BTC**, or **USDT/USDC**
6. Create order and wait for deployment

### Access Your Server

* Find connection details in **My Orders**
* Web interfaces: Use the HTTP port URL
* SSH: `ssh -p <port> root@<proxy-address>`

## What is XTTS?

XTTS (by Coqui) offers:

* High-quality text-to-speech
* Voice cloning from 6 seconds of audio
* 17 languages supported
* Emotional control
* Streaming support

## Requirements

| Mode           | VRAM | Recommended |
| -------------- | ---- | ----------- |
| Inference      | 4GB  | RTX 3060    |
| Fast Inference | 6GB  | RTX 3080    |
| Streaming      | 4GB  | RTX 3060    |

## Quick Deploy

**Docker Image:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime
```

**Ports:**

```
22/tcp
8000/http
```

**Command:**

```bash
pip install TTS && \
tts-server --model_name tts_models/multilingual/multi-dataset/xtts_v2
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

## Installation

```bash
pip install TTS
```

## Basic Usage

### Simple TTS

```python
from TTS.api import TTS

# Load XTTS v2
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

# Generate speech
tts.tts_to_file(
    text="Hello, this is a test of the XTTS text to speech system.",
    file_path="output.wav",
    language="en"
)
```

### Voice Cloning

```python
from TTS.api import TTS

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

# Clone voice from reference audio (6+ seconds)
tts.tts_to_file(
    text="This is my cloned voice speaking new text.",
    file_path="cloned_output.wav",
    speaker_wav="reference_voice.wav",
    language="en"
)
```

## Multiple Languages

```python
from TTS.api import TTS

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

# English
tts.tts_to_file(
    text="Hello, how are you today?",
    file_path="english.wav",
    speaker_wav="voice.wav",
    language="en"
)

# Spanish
tts.tts_to_file(
    text="Hola, ¿cómo estás hoy?",
    file_path="spanish.wav",
    speaker_wav="voice.wav",
    language="es"
)

# German
tts.tts_to_file(
    text="Hallo, wie geht es dir heute?",
    file_path="german.wav",
    speaker_wav="voice.wav",
    language="de"
)

# Russian
tts.tts_to_file(
    text="Привет, как дела?",
    file_path="russian.wav",
    speaker_wav="voice.wav",
    language="ru"
)
```

### Supported Languages

| Code  | Language   |
| ----- | ---------- |
| en    | English    |
| es    | Spanish    |
| fr    | French     |
| de    | German     |
| it    | Italian    |
| pt    | Portuguese |
| pl    | Polish     |
| tr    | Turkish    |
| ru    | Russian    |
| nl    | Dutch      |
| cs    | Czech      |
| ar    | Arabic     |
| zh-cn | Chinese    |
| ja    | Japanese   |
| hu    | Hungarian  |
| ko    | Korean     |
| hi    | Hindi      |

## Streaming TTS

```python
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
import torch
import sounddevice as sd

# Load model
config = XttsConfig()
config.load_json("path/to/config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="path/to/model")
model.cuda()

# Get speaker embedding
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path="reference.wav"
)

# Stream generation
chunks = model.inference_stream(
    text="This is a streaming test of the XTTS system.",
    language="en",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
    stream_chunk_size=20
)

# Play in real-time
for chunk in chunks:
    audio = chunk.cpu().numpy()
    sd.play(audio, samplerate=24000)
    sd.wait()
```

## Gradio Interface

```python
import gradio as gr
from TTS.api import TTS
import tempfile

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

def generate_speech(text, reference_audio, language):
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
        if reference_audio:
            tts.tts_to_file(
                text=text,
                file_path=f.name,
                speaker_wav=reference_audio,
                language=language
            )
        else:
            tts.tts_to_file(
                text=text,
                file_path=f.name,
                language=language
            )
        return f.name

demo = gr.Interface(
    fn=generate_speech,
    inputs=[
        gr.Textbox(label="Text to speak", lines=5),
        gr.Audio(type="filepath", label="Reference Voice (optional)"),
        gr.Dropdown(
            ["en", "es", "fr", "de", "it", "pt", "ru", "zh-cn", "ja"],
            value="en",
            label="Language"
        )
    ],
    outputs=gr.Audio(label="Generated Speech"),
    title="XTTS Voice Cloning"
)

demo.launch(server_name="0.0.0.0", server_port=7860)
```

## API Server

```python
from fastapi import FastAPI, UploadFile, File, Form
from fastapi.responses import FileResponse
from TTS.api import TTS
import tempfile
import os

app = FastAPI()
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

@app.post("/synthesize")
async def synthesize(
    text: str = Form(...),
    language: str = Form(default="en"),
    speaker: UploadFile = File(default=None)
):
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as out_file:
        if speaker:
            # Save uploaded reference
            with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as ref_file:
                ref_file.write(await speaker.read())
                ref_path = ref_file.name

            tts.tts_to_file(
                text=text,
                file_path=out_file.name,
                speaker_wav=ref_path,
                language=language
            )
            os.unlink(ref_path)
        else:
            tts.tts_to_file(
                text=text,
                file_path=out_file.name,
                language=language
            )

        return FileResponse(out_file.name, media_type="audio/wav")

# Run: uvicorn server:app --host 0.0.0.0 --port 8000
```

## Batch Processing

```python
from TTS.api import TTS
import os

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

texts = [
    "Welcome to our platform.",
    "Please review the following information.",
    "Thank you for your attention.",
    "Have a great day!"
]

reference_voice = "speaker.wav"
output_dir = "./audio_files"
os.makedirs(output_dir, exist_ok=True)

for i, text in enumerate(texts):
    output_path = f"{output_dir}/audio_{i:03d}.wav"

    tts.tts_to_file(
        text=text,
        file_path=output_path,
        speaker_wav=reference_voice,
        language="en"
    )

    print(f"Generated: {output_path}")
```

## Fine-tuning Voice

For better voice cloning:

```python
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

config = XttsConfig()
model = Xtts.init_from_config(config)

# Use multiple reference samples for better quality
reference_files = [
    "sample1.wav",
    "sample2.wav",
    "sample3.wav"
]

# Extract speaker embedding from multiple samples
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path=reference_files
)

# Generate with averaged embedding
output = model.inference(
    text="High quality cloned speech.",
    language="en",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding
)
```

## Audio Preprocessing

```python
import librosa
import soundfile as sf

def prepare_reference(input_path, output_path, target_sr=22050):
    # Load and resample
    audio, sr = librosa.load(input_path, sr=target_sr)

    # Trim silence
    audio, _ = librosa.effects.trim(audio, top_db=20)

    # Normalize
    audio = librosa.util.normalize(audio)

    # Save
    sf.write(output_path, audio, target_sr)

# Prepare reference audio
prepare_reference("raw_voice.wav", "clean_voice.wav")
```

## Performance

| Mode      | GPU      | Speed           |
| --------- | -------- | --------------- |
| Standard  | RTX 3060 | \~0.5x realtime |
| Standard  | RTX 4090 | \~2x realtime   |
| Streaming | RTX 3060 | \~1x realtime   |
| Streaming | RTX 4090 | \~3x realtime   |

## Quality Tips

* Use 6-15 seconds of clean reference audio
* Avoid background noise in reference
* Match language of text and reference
* Use multiple reference samples for better results

## Troubleshooting

### Poor Voice Quality

* Clean reference audio
* Longer reference (10+ seconds)
* Match speaking style

### Wrong Language Pronunciation

* Ensure correct language code
* Use native speaker reference

### Slow Generation

* Enable GPU inference
* Use streaming mode
* Reduce text length per call

## Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

| GPU       | Hourly Rate | Daily Rate | 4-Hour Session |
| --------- | ----------- | ---------- | -------------- |
| RTX 3060  | \~$0.03     | \~$0.70    | \~$0.12        |
| RTX 3090  | \~$0.06     | \~$1.50    | \~$0.25        |
| RTX 4090  | \~$0.10     | \~$2.30    | \~$0.40        |
| A100 40GB | \~$0.17     | \~$4.00    | \~$0.70        |
| A100 80GB | \~$0.25     | \~$6.00    | \~$1.00        |

*Prices vary by provider and demand. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

**Save money:**

* Use **Spot** market for flexible workloads (often 30-50% cheaper)
* Pay with **CLORE** tokens
* Compare prices across different providers

## Next Steps

* [Bark TTS](https://docs.clore.ai/guides/audio-and-voice/bark-tts) - Expressive TTS
* [SadTalker](https://docs.clore.ai/guides/talking-heads/sadtalker) - Talking heads
* [RVC Voice Clone](https://docs.clore.ai/guides/audio-and-voice/rvc-voice-clone) - Voice conversion
