# XTTS (Coqui)

Genera habla natural con clonación de voz usando Coqui XTTS.

{% hint style="success" %}
Todos los ejemplos se pueden ejecutar en servidores GPU alquilados a través de [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Alquilar en CLORE.AI

1. Visita [CLORE.AI Marketplace](https://clore.ai/marketplace)
2. Filtrar por tipo de GPU, VRAM y precio
3. Elegir **Bajo demanda** (tarifa fija) o **Spot** (precio de puja)
4. Configura tu pedido:
   * Selecciona imagen Docker
   * Establece puertos (TCP para SSH, HTTP para interfaces web)
   * Agrega variables de entorno si es necesario
   * Introduce el comando de inicio
5. Selecciona pago: **CLORE**, **BTC**, o **USDT/USDC**
6. Crea el pedido y espera el despliegue

### Accede a tu servidor

* Encuentra los detalles de conexión en **Mis Pedidos**
* Interfaces web: Usa la URL del puerto HTTP
* SSH: `ssh -p <port> root@<proxy-address>`

## ¿Qué es XTTS?

XTTS (por Coqui) ofrece:

* Texto a voz de alta calidad
* Clonación de voz a partir de 6 segundos de audio
* 17 idiomas compatibles
* Control emocional
* Soporte de transmisión

## Requisitos

| Modo              | VRAM | Recomendado |
| ----------------- | ---- | ----------- |
| Inferencia        | 4GB  | RTX 3060    |
| Inferencia rápida | 6GB  | RTX 3080    |
| Streaming         | 4GB  | RTX 3060    |

## Despliegue rápido

**Imagen Docker:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime
```

**Puertos:**

```
22/tcp
8000/http
```

**Comando:**

```bash
pip install TTS && \
tts-server --model_name tts_models/multilingual/multi-dataset/xtts_v2
```

## Accediendo a tu servicio

Después del despliegue, encuentra tu `http_pub` URL en **Mis Pedidos**:

1. Ir a **Mis Pedidos** página
2. Haz clic en tu pedido
3. Encuentra la `http_pub` URL (por ejemplo, `abc123.clorecloud.net`)

Usa `https://TU_HTTP_PUB_URL` en lugar de `localhost` en los ejemplos abajo.

## Instalación

```bash
pip install TTS
```

## Uso básico

### TTS simple

```python
from TTS.api import TTS

# Cargar XTTS v2
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

# Generar voz
tts.tts_to_file(
    text="Hello, this is a test of the XTTS text to speech system.",
    file_path="output.wav",
    language="en"
)
```

### Clonación de voz

```python
from TTS.api import TTS

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

# Clonar voz desde audio de referencia (6+ segundos)
tts.tts_to_file(
    text="Esta es mi voz clonada hablando un nuevo texto.",
    file_path="cloned_output.wav",
    speaker_wav="reference_voice.wav",
    language="en"
)
```

## Múltiples idiomas

```python
from TTS.api import TTS

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

# Inglés
tts.tts_to_file(
    text="Hello, how are you today?",
    file_path="english.wav",
    speaker_wav="voice.wav",
    language="en"
)

# Español
tts.tts_to_file(
    text="Hola, ¿cómo estás hoy?",
    file_path="spanish.wav",
    speaker_wav="voice.wav",
    language="es"
)

# Alemán
tts.tts_to_file(
    text="Hallo, wie geht es dir heute?",
    file_path="german.wav",
    speaker_wav="voice.wav",
    language="de"
)

# Ruso
tts.tts_to_file(
    text="Привет, как дела?",
    file_path="russian.wav",
    speaker_wav="voice.wav",
    language="ru"
)
```

### Idiomas compatibles

| Código | Idioma     |
| ------ | ---------- |
| en     | Inglés     |
| es     | Español    |
| fr     | Francés    |
| de     | Alemán     |
| it     | Italiano   |
| pt     | Portugués  |
| pl     | Polaco     |
| tr     | Turco      |
| ru     | Ruso       |
| nl     | Neerlandés |
| cs     | Checo      |
| ar     | Árabe      |
| zh-cn  | Chino      |
| ja     | Japonés    |
| hu     | Húngaro    |
| ko     | Coreano    |
| hi     | Hindi      |

## TTS por transmisión

```python
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
import torch
import sounddevice as sd

# Cargar modelo
config = XttsConfig()
config.load_json("path/to/config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="path/to/model")
model.cuda()

# Obtener embedding del hablante
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path="reference.wav"
)

# Generación por transmisión
chunks = model.inference_stream(
    text="This is a streaming test of the XTTS system.",
    language="en",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
    stream_chunk_size=20
)

# Reproducir en tiempo real
for chunk in chunks:
    audio = chunk.cpu().numpy()
    sd.play(audio, samplerate=24000)
    sd.wait()
```

## Interfaz Gradio

```python
import gradio as gr
from TTS.api import TTS
import tempfile

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

def generate_speech(text, reference_audio, language):
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
        if reference_audio:
            tts.tts_to_file(
                text=text,
                file_path=f.name,
                speaker_wav=reference_audio,
                language=language
            )
        else:
            tts.tts_to_file(
                text=text,
                file_path=f.name,
                language=language
            )
        return f.name

demo = gr.Interface(
    fn=generate_speech,
    inputs=[
        gr.Textbox(label="Text to speak", lines=5),
        gr.Audio(type="filepath", label="Reference Voice (optional)"),
        gr.Dropdown(
            ["en", "es", "fr", "de", "it", "pt", "ru", "zh-cn", "ja"],
            value="en",
            label="Language"
        )
    ],
    outputs=gr.Audio(label="Voz generada"),
    title="XTTS Voice Cloning"
)

demo.launch(server_name="0.0.0.0", server_port=7860)
```

## Servidor API

```python
from fastapi import FastAPI, UploadFile, File, Form
from fastapi.responses import FileResponse
from TTS.api import TTS
import tempfile
import os

app = FastAPI()
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

@app.post("/synthesize")
async def synthesize(
    text: str = Form(...),
    language: str = Form(default="en"),
    speaker: UploadFile = File(default=None)
):
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as out_file:
        if speaker:
            # Guardar referencia subida
            with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as ref_file:
                ref_file.write(await speaker.read())
                ref_path = ref_file.name

            tts.tts_to_file(
                text=text,
                file_path=out_file.name,
                speaker_wav=ref_path,
                language=language
            )
            os.unlink(ref_path)
        else:
            tts.tts_to_file(
                text=text,
                file_path=out_file.name,
                language=language
            )

        return FileResponse(out_file.name, media_type="audio/wav")

# Ejecutar: uvicorn server:app --host 0.0.0.0 --port 8000
```

## Procesamiento por lotes

```python
from TTS.api import TTS
import os

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

texts = [
    "Bienvenido a nuestra plataforma.",
    "Por favor revise la siguiente información.",
    "Gracias por su atención.",
    "¡Que tenga un gran día!"
]

reference_voice = "speaker.wav"
output_dir = "./audio_files"
os.makedirs(output_dir, exist_ok=True)

for i, text in enumerate(texts):
    output_path = f"{output_dir}/audio_{i:03d}.wav"

    tts.tts_to_file(
        text=text,
        file_path=output_path,
        speaker_wav=reference_voice,
        language="en"
    )

    print(f"Generated: {output_path}")
```

## Ajuste fino de la voz

Para una mejor clonación de voz:

```python
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

config = XttsConfig()
model = Xtts.init_from_config(config)

# Use múltiples muestras de referencia para mejor calidad
reference_files = [
    "sample1.wav",
    "sample2.wav",
    "sample3.wav"
]

# Extraer embedding del hablante de múltiples muestras
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path=reference_files
)

# Generar con embedding promediado
output = model.inference(
    text="High quality cloned speech.",
    language="en",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding
)
```

## Preprocesamiento de audio

```python
import librosa
import soundfile as sf

def prepare_reference(input_path, output_path, target_sr=22050):
    # Cargar y re-muestrear
    audio, sr = librosa.load(input_path, sr=target_sr)

    # Recortar silencio
    audio, _ = librosa.effects.trim(audio, top_db=20)

    # Normalizar
    audio = librosa.util.normalize(audio)

    # Guardar
    sf.write(output_path, audio, target_sr)

# Preparar audio de referencia
prepare_reference("raw_voice.wav", "clean_voice.wav")
```

## Rendimiento

| Modo      | GPU      | Velocidad             |
| --------- | -------- | --------------------- |
| Estándar  | RTX 3060 | \~0.5x en tiempo real |
| Estándar  | RTX 4090 | \~2x tiempo real      |
| Streaming | RTX 3060 | \~1x tiempo real      |
| Streaming | RTX 4090 | \~3x en tiempo real   |

## Consejos de calidad

* Use de 6 a 15 segundos de audio de referencia limpio
* Evite el ruido de fondo en la referencia
* Haga coincidir el idioma del texto y de la referencia
* Use múltiples muestras de referencia para mejores resultados

## Solución de problemas

### Mala calidad de voz

* Audio de referencia limpio
* Referencia más larga (10+ segundos)
* Coincidir el estilo de habla

### Pronunciación en idioma incorrecto

* Asegúrese del código de idioma correcto
* Use referencia de hablante nativo

### Generación lenta

* Habilitar inferencia en GPU
* Usar modo de transmisión
* Reducir la longitud del texto por llamada

## Estimación de costos

Tarifas típicas del marketplace de CLORE.AI (a fecha de 2024):

| GPU       | Tarifa por hora | Tarifa diaria | Sesión de 4 horas |
| --------- | --------------- | ------------- | ----------------- |
| RTX 3060  | \~$0.03         | \~$0.70       | \~$0.12           |
| RTX 3090  | \~$0.06         | \~$1.50       | \~$0.25           |
| RTX 4090  | \~$0.10         | \~$2.30       | \~$0.40           |
| A100 40GB | \~$0.17         | \~$4.00       | \~$0.70           |
| A100 80GB | \~$0.25         | \~$6.00       | \~$1.00           |

*Los precios varían según el proveedor y la demanda. Consulta* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *para las tarifas actuales.*

**Ahorra dinero:**

* Usa **Spot** market para cargas de trabajo flexibles (a menudo 30-50% más barato)
* Paga con **CLORE** tokens
* Compara precios entre diferentes proveedores

## Próximos pasos

* [Bark TTS](/guides/guides_v2-es/audio-y-voz/bark-tts.md) - TTS expresivo
* [SadTalker](/guides/guides_v2-es/cabezas-parlantes/sadtalker.md) - Cabezas parlantes
* [Clon de voz RVC](/guides/guides_v2-es/audio-y-voz/rvc-voice-clone.md) - Conversión de voz


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-es/audio-y-voz/xtts-coqui.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.