# MeloTTS

MeloTTS es una biblioteca de texto a voz (TTS) multilingüe y de alta calidad desarrollada por **MyShell AI**. Ofrece síntesis de voz rápida y con sonido natural en múltiples idiomas y acentos del inglés, diseñada tanto para investigación como para despliegue en producción. MeloTTS está optimizado para la velocidad: puede generar voz significativamente más rápido que en tiempo real incluso en CPU, manteniendo una alta calidad de audio adecuada para uso comercial.

MeloTTS actualmente soporta:

* **Inglés** (Americano, Británico, Indio, Australiano, Predeterminado)
* **Chino (simplificado y chino-inglés mixto)**
* **Japonés**
* **Coreano**
* **Español**
* **Francés**

Aspectos destacados:

* ⚡ **Inferencia rápida** — más rápido que en tiempo real en CPU, extremadamente rápido en GPU
* 🌍 **Multilingüe** — 6 idiomas con variantes de acento para inglés
* 🐳 **Listo para Docker** — imagen oficial de Docker disponible
* 🔌 **API REST** — API HTTP para integración en cualquier aplicación
* 📱 **Calidad de nivel de producción** — usado en los productos de consumo de MyShell

{% hint style="success" %}
Todos los ejemplos se pueden ejecutar en servidores GPU alquilados a través de [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

***

## Requisitos del servidor

| Parámetro | Mínimo                 | Recomendado             |
| --------- | ---------------------- | ----------------------- |
| GPU       | NVIDIA GTX 1080 (8 GB) | NVIDIA RTX 3090 (24 GB) |
| VRAM      | 4 GB                   | 8–16 GB                 |
| RAM       | 8 GB                   | 16 GB                   |
| CPU       | 4 núcleos              | 8 núcleos               |
| Disco     | 10 GB                  | 20 GB                   |
| SO        | Ubuntu 20.04+          | Ubuntu 22.04            |
| CUDA      | 11.7+ (opcional)       | 12.1+                   |
| Python    | 3.8+                   | 3.10                    |
| Puertos   | 22, 8888               | 22, 8888                |

{% hint style="info" %}
MeloTTS es excepcionalmente eficiente: funciona bien en CPU para solicitudes individuales y se beneficia enormemente de la GPU para procesamiento por lotes. Incluso una GPU económica duplica el rendimiento de forma drástica.
{% endhint %}

***

## Despliegue rápido en CLORE.AI

{% hint style="warning" %}
**Nota:** MeloTTS no tiene una imagen Docker preconstruida oficial en Docker Hub (`myshell-ai/melotts` no existe). El enfoque recomendado es usar una imagen base NVIDIA CUDA e instalar MeloTTS vía pip desde el repositorio oficial de GitHub.
{% endhint %}

### 1. Encuentra un servidor adecuado

Ve a [CLORE.AI Marketplace](https://clore.ai/marketplace) y filtra por:

* **VRAM**: ≥ 4 GB (o solo CPU para bajo volumen)
* **GPU**: Cualquier GPU NVIDIA (GTX 1080+, serie RTX, A100)
* **Disco**: ≥ 10 GB

### 2. Configura tu despliegue

**Imagen Docker:**

```
nvidia/cuda:12.1.0-devel-ubuntu22.04
```

**Mapeo de puertos:**

```
22   → acceso SSH
8888 → servidor API de MeloTTS
```

**Variables de entorno:**

```
NVIDIA_VISIBLE_DEVICES=all
```

**Comando de inicio** (ejecutar después de hacer SSH al servidor):

```bash
apt-get update && apt-get install -y python3-pip ffmpeg espeak-ng git && \
git clone https://github.com/myshell-ai/MeloTTS.git && \
cd MeloTTS && pip install -e . && \
python -m unidic download && \
python3 -c "import nltk; nltk.download('averaged_perceptron_tagger_eng')" && \
python -m melo.api_server --host 0.0.0.0 --port 8888
```

### 3. Accede a la API

```
http://<tu-ip-servidor-clore>:8888
```

Prueba con:

```bash
curl -X POST http://<ip-del-servidor>:8888/synthesize \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello from Clore.ai!", "language": "EN", "speaker_id": "EN-Default"}'
```

***

## Configuración paso a paso

### Paso 1: Conéctate por SSH a tu servidor

```bash
ssh root@<your-clore-server-ip> -p <ssh-port>
```

### Paso 2: Construir y ejecutar el contenedor

Dado que MeloTTS no tiene una imagen preconstruida en Docker Hub, usa una base NVIDIA CUDA e instala MeloTTS desde la fuente:

```bash
# Ejecutar un contenedor CUDA e instalar MeloTTS dentro de él
docker run -d \
  --name melotts \
  --gpus all \
  -p 8888:8888 \
  -v /workspace/melotts/outputs:/app/outputs \
  -e NVIDIA_VISIBLE_DEVICES=all \
  nvidia/cuda:12.1.0-devel-ubuntu22.04 \
  bash -c "apt-get update && apt-get install -y python3-pip ffmpeg espeak-ng git && \
    git clone https://github.com/myshell-ai/MeloTTS.git /app/MeloTTS && \
    cd /app/MeloTTS && pip install -e . && \
    python -m unidic download && \
    python3 -c \"import nltk; nltk.download('averaged_perceptron_tagger_eng')\" && \
    python -m melo.api_server --host 0.0.0.0 --port 8888"
```

Alternativamente, construye una imagen Docker personalizada desde la fuente:

```bash
git clone https://github.com/myshell-ai/MeloTTS.git
cd MeloTTS
docker build -t melotts:local .
docker run -d \
  --name melotts \
  --gpus all \
  -p 8888:8888 \
  melotts:local
```

### Paso 3: Verificar que el servicio esté en ejecución

```bash
# Comprobar logs de contenedores
docker logs -f melotts

# Espera el inicio y luego prueba
curl http://localhost:8888/health
```

### Paso 4: Alternativa — interfaz Jupyter Notebook

```bash
docker run -d \
  --name melotts-jupyter \
  --gpus all \
  -p 8888:8888 \
  nvidia/cuda:12.1.0-devel-ubuntu22.04 \
  bash -c "pip install jupyter melo-tts && \
    jupyter notebook --ip=0.0.0.0 --port=8888 --no-browser --allow-root"
```

Acceder en: `http://<ip-del-servidor>:8888`

### Paso 5: Instalar desde pip (sin Docker)

```bash
# Instalar dependencias del sistema
apt-get install -y python3-pip ffmpeg espeak-ng

# Instalar MeloTTS
pip install melo-tts

# Descargar datos NLTK requeridos
python3 -c "import nltk; nltk.download('averaged_perceptron_tagger_eng')"
```

***

## Ejemplos de uso

### Ejemplo 1: TTS básico en inglés (Python)

```python
from melo.api import TTS

# Inicializar TTS en inglés
speed = 1.0  # Ajustar la velocidad de habla (0.5 = lento, 2.0 = rápido)
device = 'cuda'  # Usar 'cpu' si no hay GPU disponible

tts = TTS(language='EN', device=device)

# Obtener IDs de hablantes disponibles
speakers = tts.hps.data.spk2id
print("Hablantes disponibles:", list(speakers.keys()))
# Salida: ['EN-Default', 'EN-US', 'EN-GB', 'EN-India', 'EN-Australia', 'EN-Brazil']

# Generar voz
speaker_ids = tts.hps.data.spk2id
output_path = "output_english.wav"

tts.tts_to_file(
    text="Welcome to Clore.ai, your GPU cloud marketplace for AI workloads. Rent powerful GPUs in minutes.",
    speaker_id=speaker_ids['EN-Default'],
    output_path=output_path,
    speed=speed
)

print(f"Guardado en: {output_path}")
```

***

### Ejemplo 2: TTS multilingüe

```python
from melo.api import TTS

device = 'cuda'

# Definir pares idioma-texto
language_texts = [
    ('EN', 'EN-US', "GPU computing has transformed artificial intelligence research and development."),
    ('EN', 'EN-GB', "The United Kingdom leads Europe in AI investment and innovation."),
    ('ZH', 'ZH', "Clore.ai是一个去中心化的GPU云计算市场，为AI开发者提供算力服务。"),
    ('JP', 'JP', "人工知能の発展には大規模な計算資源が必要です。"),
    ('KR', 'KR', "Clore.ai는 AI 연구자를 위한 GPU 클라우드 마켓플레이스입니다."),
    ('SP', 'SP', "La inteligencia artificial está transformando todas las industrias del mundo."),
    ('FR', 'FR', "L'intelligence artificielle révolutionne la façon dont nous travaillons et vivons."),
]

for lang, speaker, text in language_texts:
    try:
        tts = TTS(language=lang, device=device)
        speaker_id = tts.hps.data.spk2id[speaker]

        output_file = f"output_{lang}_{speaker}.wav"
        tts.tts_to_file(text=text, speaker_id=speaker_id, output_path=output_file)
        print(f"✓ Generado [{lang}]: {output_file}")
    except Exception as e:
        print(f"✗ Error [{lang}]: {e}")
```

***

### Ejemplo 3: Uso de la API REST

```python
import requests
import json

API_BASE = "http://<tu-ip-servidor-clore>:8888"

# Consultar voces disponibles
response = requests.get(f"{API_BASE}/voices")
print("Voces disponibles:", json.dumps(response.json(), indent=2))

# Sintetizar voz
def synthesize(text, language="EN", speaker="EN-Default", speed=1.0):
    payload = {
        "text": text,
        "language": language,
        "speaker_id": speaker,
        "speed": speed,
        "format": "wav"
    }

    response = requests.post(
        f"{API_BASE}/synthesize",
        json=payload,
        timeout=30
    )

    if response.status_code == 200:
        return response.content
    else:
        raise Exception(f"Error de la API: {response.status_code} - {response.text}")

# Generar muestras
samples = [
    ("Hello, this is MeloTTS running on Clore.ai GPU servers.", "EN", "EN-US"),
    ("This is the British English accent variant.", "EN", "EN-GB"),
    ("Let me demonstrate the Indian English accent.", "EN", "EN-India"),
]

for text, lang, speaker in samples:
    audio_bytes = synthesize(text, lang, speaker)
    filename = f"api_output_{speaker.replace('-', '_')}.wav"
    with open(filename, "wb") as f:
        f.write(audio_bytes)
    print(f"Guardado: {filename}")
```

***

### Ejemplo 4: Procesamiento por lotes de alta velocidad

```python
from melo.api import TTS
from concurrent.futures import ThreadPoolExecutor
import soundfile as sf
import time
import numpy as np
from pathlib import Path

device = 'cuda'
tts = TTS(language='EN', device=device)
speaker_id = tts.hps.data.spk2id['EN-US']

# Gran lote de textos
texts = [
    f"This is sentence number {i}. It demonstrates fast batch processing with MeloTTS on Clore.ai GPU infrastructure."
    for i in range(1, 51)  # 50 oraciones
]

output_dir = Path("batch_output")
output_dir.mkdir(exist_ok=True)

start_time = time.time()

# Procesar lote
for i, text in enumerate(texts):
    output_path = str(output_dir / f"batch_{i+1:03d}.wav")
    tts.tts_to_file(
        text=text,
        speaker_id=speaker_id,
        output_path=output_path,
        speed=1.0,
        quiet=True
    )
    if (i + 1) % 10 == 0:
        elapsed = time.time() - start_time
        print(f"Progreso: {i+1}/50 | Tiempo: {elapsed:.1f}s | Ritmo: {(i+1)/elapsed:.1f} oraciones/s")

total_time = time.time() - start_time
print(f"\nLote completo: {len(texts)} oraciones en {total_time:.1f}s")
print(f"Promedio: {total_time/len(texts)*1000:.0f}ms por oración")
```

***

### Ejemplo 5: TTS mixto chino-inglés

```python
from melo.api import TTS

device = 'cuda'
tts = TTS(language='ZH', device=device)
speaker_id = tts.hps.data.spk2id['ZH']

# Texto en idioma mixto (chino + inglés)
mixed_texts = [
    "我们使用Clore.ai的GPU服务器来运行machine learning workloads。",
    "今天的AI conference讨论了large language models和speech synthesis技术。",
    "我的startup需要GPU资源来训练我们的deep learning模型。",
    "Clore.ai提供了非常competitive的价格，比AWS和GCP便宜很多。",
]

for i, text in enumerate(mixed_texts):
    output_file = f"mixed_zh_en_{i+1}.wav"
    tts.tts_to_file(
        text=text,
        speaker_id=speaker_id,
        output_path=output_file,
        speed=0.9  # Ligeramente más lento para mayor claridad
    )
    print(f"Generado: {output_file}")
    print(f"  Texto: {text[:60]}...")
```

***

## Configuración

### Configuración con Docker Compose

Dado que MeloTTS no tiene una imagen oficial en Docker Hub, usa la imagen base NVIDIA CUDA e instala MeloTTS desde la fuente al iniciar:

```yaml
version: '3.8'

services:
  melotts:
    image: nvidia/cuda:12.1.0-devel-ubuntu22.04
    container_name: melotts
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - PYTHONDONTWRITEBYTECODE=1
    ports:
      - "8888:8888"
    volumes:
      - ./outputs:/app/outputs
      - ./cache:/root/.cache
    command: >
      bash -c "apt-get update && apt-get install -y python3-pip ffmpeg espeak-ng git &&
      git clone https://github.com/myshell-ai/MeloTTS.git /app/MeloTTS &&
      cd /app/MeloTTS && pip install -e . &&
      python -m unidic download &&
      python3 -c 'import nltk; nltk.download(\"averaged_perceptron_tagger_eng\")' &&
      python -m melo.api_server --host 0.0.0.0 --port 8888"
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8888/health"]
      intervalo: 30s
      tiempo de espera: 10s
      reintentos: 3
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
```

### Opciones de configuración de la API

| Parámetro   | Por defecto | Descripción                                       |
| ----------- | ----------- | ------------------------------------------------- |
| `--host`    | `127.0.0.1` | Dirección de enlace (usar `0.0.0.0` para público) |
| `--port`    | `8888`      | Puerto del servidor API                           |
| `--workers` | `1`         | Número de procesos worker                         |
| `--device`  | `auto`      | `cuda`, `cpu`, o `auto`                           |

### Idiomas y hablantes soportados

| Idioma  | Código | IDs de hablantes                                                        |
| ------- | ------ | ----------------------------------------------------------------------- |
| Inglés  | `EN`   | `EN-Default`, `EN-US`, `EN-GB`, `EN-India`, `EN-Australia`, `EN-Brazil` |
| Chino   | `ZH`   | `ZH`                                                                    |
| Japonés | `JP`   | `JP`                                                                    |
| Coreano | `KR`   | `KR`                                                                    |
| Español | `SP`   | `SP`                                                                    |
| Francés | `FR`   | `FR`                                                                    |

***

## Consejos de rendimiento

### 1. Comparativa GPU vs CPU

Rendimiento de MeloTTS (RTF = Factor de Tiempo Real, más bajo es mejor):

| Dispositivo     | RTF     | Notas                              |
| --------------- | ------- | ---------------------------------- |
| CPU (8 núcleos) | \~0.3x  | Rápido, ideal para baja carga      |
| RTX 3080        | \~0.05x | 20× más rápido que en tiempo real  |
| RTX 4090        | \~0.02x | 50× más rápido que en tiempo real  |
| A100            | \~0.01x | 100× más rápido que en tiempo real |

### 2. Optimizar para rendimiento (throughput)

```python
# Desactivar el cálculo de gradientes para inferencia
import torch

with torch.no_grad():
    tts.tts_to_file(text, speaker_id, output_path)
```

### 3. Calentar el modelo (pre-warm)

```python
# Ejecutar una inferencia de calentamiento para cargar los kernels de CUDA
tts.tts_to_file(
    text="warmup",
    speaker_id=speaker_id,
    output_path="/dev/null"
)
print("Modelo calentado, listo para inferencia rápida")
```

### 4. Ajustar calidad de audio vs velocidad

```python
# Más rápido (calidad ligeramente inferior)
tts.tts_to_file(text, speaker_id, output_path, speed=1.2)

# Habla más lenta (mejor articulación)
tts.tts_to_file(text, speaker_id, output_path, speed=0.8)
```

### 5. Eficiencia de memoria

```python
# Liberar memoria GPU entre lotes grandes
import gc
import torch

gc.collect()
torch.cuda.empty_cache()
```

***

## Solución de problemas

### Problema: `espeak-ng` no encontrado

```bash
apt-get install -y espeak-ng
python3 -c "import phonemizer; print('phonemizer OK')"
```

### Problema: faltan datos de NLTK

```bash
python3 -c "
import nltk
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('punkt')
"
```

### Problema: El puerto 8888 entra en conflicto con Jupyter

MeloTTS usa el puerto 8888 por defecto, que choca con Jupyter Notebook. Soluciones:

```bash
# Opción 1: Ejecutar MeloTTS en un puerto diferente
python -m melo.api_server --host 0.0.0.0 --port 8889

# Opción 2: Ejecutar Jupyter en un puerto diferente
jupyter notebook --port 8890
```

### Problema: El texto chino no se muestra correctamente

```bash
# Instalar soporte de idioma chino
pip install jieba
apt-get install -y python3-opencc

# Prueba
python3 -c "from melo.api import TTS; t = TTS('ZH'); print('ZH OK')"
```

### Problema: Falló la descarga de la imagen Docker

```bash
# Construir desde la fuente en su lugar
git clone https://github.com/myshell-ai/MeloTTS.git
cd MeloTTS
pip install -e .
python3 -c "import nltk; nltk.download('averaged_perceptron_tagger_eng')"
```

### Problema: Inferencia lenta en GPU

```bash
# Verificar que la GPU está siendo usada
python3 -c "
import torch
from melo.api import TTS
tts = TTS('EN', device='cuda')
print(f'Dispositivo: {next(tts.model.parameters()).device}')
print(f'CUDA disponible: {torch.cuda.is_available()}')
"
```

***

## Recomendaciones de GPU en Clore.ai

MeloTTS es ligero: funciona bien en CPU para bajo volumen y escala linealmente con la potencia de cómputo de la GPU. No necesitas hardware caro.

| GPU       | VRAM  | Precio en Clore.ai | RTF (Factor de Tiempo Real) | Capacidad     |
| --------- | ----- | ------------------ | --------------------------- | ------------- |
| Solo CPU  | —     | \~$0.02/hr         | \~0.3×                      | \~3 req/min   |
| RTX 3090  | 24 GB | \~$0.12/h          | \~0.02× (50× tiempo real)   | \~100 req/min |
| RTX 4090  | 24 GB | \~$0.70/h          | \~0.01× (100× tiempo real)  | \~200 req/min |
| A100 40GB | 40 GB | \~$1.20/h          | \~0.005× (200× tiempo real) | \~400 req/min |

{% hint style="info" %}
**Mejor relación calidad/precio para cargas TTS:** RTX 3090 a ~~$0.12/hora ofrece 50× la velocidad TTS en tiempo real. Para una API de producción que atiende a cientos de usuarios, esto es más que suficiente. Instancias solo CPU (~~$0.02/hora) funcionan bien para desarrollo y despliegues de bajo tráfico.
{% endhint %}

**Recomendación para producción:** Para una API TTS multilingüe que sirva de 10 a 50 usuarios concurrentes, la RTX 3090 es el punto óptimo. Escala horizontalmente (múltiples instancias) en lugar de actualizar a una A100 costosa: MeloTTS no se beneficia proporcionalmente de GPUs de gama más alta.

***

## Enlaces

* **GitHub**: <https://github.com/myshell-ai/MeloTTS>
* **Docker**: No hay imagen oficial en Docker Hub — instalar desde [Fuente de GitHub](https://github.com/myshell-ai/MeloTTS) usando `nvidia/cuda:12.1.0-devel-ubuntu22.04` imagen base
* **Artículo (Paper)**: <https://arxiv.org/abs/2406.06753>
* **Hugging Face**: <https://huggingface.co/myshell-ai/MeloTTS-English>
* **MyShell AI**: <https://myshell.ai>
* **CLORE.AI Marketplace**: <https://clore.ai/marketplace>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-es/audio-y-voz/melotts.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.