# MeloTTS

MeloTTS is a high-quality, multilingual text-to-speech library developed by **MyShell AI**. It delivers fast, natural-sounding speech synthesis across multiple languages and English accents, designed for both research and production deployment. MeloTTS is optimized for speed — it can generate speech significantly faster than real-time even on CPU — while maintaining high audio quality suitable for commercial use.

MeloTTS currently supports:

* **English** (American, British, Indian, Australian, Default)
* **Chinese (Simplified & Mixed Chinese-English)**
* **Japanese**
* **Korean**
* **Spanish**
* **French**

Key highlights:

* ⚡ **Fast inference** — faster than real-time on CPU, blazing fast on GPU
* 🌍 **Multilingual** — 6 languages with accent variants for English
* 🐳 **Docker-ready** — official Docker image available
* 🔌 **REST API** — HTTP API for integration into any application
* 📱 **Production-grade** — used in MyShell's consumer products

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

***

## Server Requirements

| Parameter | Minimum                | Recommended             |
| --------- | ---------------------- | ----------------------- |
| GPU       | NVIDIA GTX 1080 (8 GB) | NVIDIA RTX 3090 (24 GB) |
| VRAM      | 4 GB                   | 8–16 GB                 |
| RAM       | 8 GB                   | 16 GB                   |
| CPU       | 4 cores                | 8 cores                 |
| Disk      | 10 GB                  | 20 GB                   |
| OS        | Ubuntu 20.04+          | Ubuntu 22.04            |
| CUDA      | 11.7+ (optional)       | 12.1+                   |
| Python    | 3.8+                   | 3.10                    |
| Ports     | 22, 8888               | 22, 8888                |

{% hint style="info" %}
MeloTTS is uniquely efficient — it runs well on CPU for single requests and benefits greatly from GPU for batch processing. Even a budget GPU doubles the throughput dramatically.
{% endhint %}

***

## Quick Deploy on CLORE.AI

{% hint style="warning" %}
**Note:** MeloTTS does not have an official pre-built Docker image on Docker Hub (`myshell-ai/melotts` does not exist). The recommended approach is to use an NVIDIA CUDA base image and install MeloTTS via pip from the official GitHub repository.
{% endhint %}

### 1. Find a suitable server

Go to [CLORE.AI Marketplace](https://clore.ai/marketplace) and filter by:

* **VRAM**: ≥ 4 GB (or CPU-only for low volume)
* **GPU**: Any NVIDIA GPU (GTX 1080+, RTX series, A100)
* **Disk**: ≥ 10 GB

### 2. Configure your deployment

**Docker Image:**

```
nvidia/cuda:12.1.0-devel-ubuntu22.04
```

**Port Mappings:**

```
22   → SSH access
8888 → MeloTTS API server
```

**Environment Variables:**

```
NVIDIA_VISIBLE_DEVICES=all
```

**Startup Command** (run after SSH into the server):

```bash
apt-get update && apt-get install -y python3-pip ffmpeg espeak-ng git && \
git clone https://github.com/myshell-ai/MeloTTS.git && \
cd MeloTTS && pip install -e . && \
python -m unidic download && \
python3 -c "import nltk; nltk.download('averaged_perceptron_tagger_eng')" && \
python -m melo.api_server --host 0.0.0.0 --port 8888
```

### 3. Access the API

```
http://<your-clore-server-ip>:8888
```

Test with:

```bash
curl -X POST http://<server-ip>:8888/synthesize \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello from Clore.ai!", "language": "EN", "speaker_id": "EN-Default"}'
```

***

## Step-by-Step Setup

### Step 1: SSH into your server

```bash
ssh root@<your-clore-server-ip> -p <ssh-port>
```

### Step 2: Build and run the container

Since MeloTTS has no pre-built Docker Hub image, use an NVIDIA CUDA base and install MeloTTS from source:

```bash
# Run a CUDA container and install MeloTTS inside it
docker run -d \
  --name melotts \
  --gpus all \
  -p 8888:8888 \
  -v /workspace/melotts/outputs:/app/outputs \
  -e NVIDIA_VISIBLE_DEVICES=all \
  nvidia/cuda:12.1.0-devel-ubuntu22.04 \
  bash -c "apt-get update && apt-get install -y python3-pip ffmpeg espeak-ng git && \
    git clone https://github.com/myshell-ai/MeloTTS.git /app/MeloTTS && \
    cd /app/MeloTTS && pip install -e . && \
    python -m unidic download && \
    python3 -c \"import nltk; nltk.download('averaged_perceptron_tagger_eng')\" && \
    python -m melo.api_server --host 0.0.0.0 --port 8888"
```

Alternatively, build a custom Docker image from source:

```bash
git clone https://github.com/myshell-ai/MeloTTS.git
cd MeloTTS
docker build -t melotts:local .
docker run -d \
  --name melotts \
  --gpus all \
  -p 8888:8888 \
  melotts:local
```

### Step 3: Verify the service is running

```bash
# Check container logs
docker logs -f melotts

# Wait for startup, then test
curl http://localhost:8888/health
```

### Step 4: Alternative — Jupyter Notebook interface

```bash
docker run -d \
  --name melotts-jupyter \
  --gpus all \
  -p 8888:8888 \
  nvidia/cuda:12.1.0-devel-ubuntu22.04 \
  bash -c "pip install jupyter melo-tts && \
    jupyter notebook --ip=0.0.0.0 --port=8888 --no-browser --allow-root"
```

Access at: `http://<server-ip>:8888`

### Step 5: Install from pip (without Docker)

```bash
# Install system dependencies
apt-get install -y python3-pip ffmpeg espeak-ng

# Install MeloTTS
pip install melo-tts

# Download required NLTK data
python3 -c "import nltk; nltk.download('averaged_perceptron_tagger_eng')"
```

***

## Usage Examples

### Example 1: Basic English TTS (Python)

```python
from melo.api import TTS

# Initialize English TTS
speed = 1.0  # Adjust speech speed (0.5 = slow, 2.0 = fast)
device = 'cuda'  # Use 'cpu' if no GPU available

tts = TTS(language='EN', device=device)

# Get available speaker IDs
speakers = tts.hps.data.spk2id
print("Available speakers:", list(speakers.keys()))
# Output: ['EN-Default', 'EN-US', 'EN-GB', 'EN-India', 'EN-Australia', 'EN-Brazil']

# Generate speech
speaker_ids = tts.hps.data.spk2id
output_path = "output_english.wav"

tts.tts_to_file(
    text="Welcome to Clore.ai, your GPU cloud marketplace for AI workloads. Rent powerful GPUs in minutes.",
    speaker_id=speaker_ids['EN-Default'],
    output_path=output_path,
    speed=speed
)

print(f"Saved to: {output_path}")
```

***

### Example 2: Multilingual TTS

```python
from melo.api import TTS

device = 'cuda'

# Define language-text pairs
language_texts = [
    ('EN', 'EN-US', "GPU computing has transformed artificial intelligence research and development."),
    ('EN', 'EN-GB', "The United Kingdom leads Europe in AI investment and innovation."),
    ('ZH', 'ZH', "Clore.ai是一个去中心化的GPU云计算市场，为AI开发者提供算力服务。"),
    ('JP', 'JP', "人工知能の発展には大規模な計算資源が必要です。"),
    ('KR', 'KR', "Clore.ai는 AI 연구자를 위한 GPU 클라우드 마켓플레이스입니다."),
    ('SP', 'SP', "La inteligencia artificial está transformando todas las industrias del mundo."),
    ('FR', 'FR', "L'intelligence artificielle révolutionne la façon dont nous travaillons et vivons."),
]

for lang, speaker, text in language_texts:
    try:
        tts = TTS(language=lang, device=device)
        speaker_id = tts.hps.data.spk2id[speaker]

        output_file = f"output_{lang}_{speaker}.wav"
        tts.tts_to_file(text=text, speaker_id=speaker_id, output_path=output_file)
        print(f"✓ Generated [{lang}]: {output_file}")
    except Exception as e:
        print(f"✗ Error [{lang}]: {e}")
```

***

### Example 3: REST API Usage

```python
import requests
import json

API_BASE = "http://<your-clore-server-ip>:8888"

# Check available voices
response = requests.get(f"{API_BASE}/voices")
print("Available voices:", json.dumps(response.json(), indent=2))

# Synthesize speech
def synthesize(text, language="EN", speaker="EN-Default", speed=1.0):
    payload = {
        "text": text,
        "language": language,
        "speaker_id": speaker,
        "speed": speed,
        "format": "wav"
    }

    response = requests.post(
        f"{API_BASE}/synthesize",
        json=payload,
        timeout=30
    )

    if response.status_code == 200:
        return response.content
    else:
        raise Exception(f"API error: {response.status_code} - {response.text}")

# Generate samples
samples = [
    ("Hello, this is MeloTTS running on Clore.ai GPU servers.", "EN", "EN-US"),
    ("This is the British English accent variant.", "EN", "EN-GB"),
    ("Let me demonstrate the Indian English accent.", "EN", "EN-India"),
]

for text, lang, speaker in samples:
    audio_bytes = synthesize(text, lang, speaker)
    filename = f"api_output_{speaker.replace('-', '_')}.wav"
    with open(filename, "wb") as f:
        f.write(audio_bytes)
    print(f"Saved: {filename}")
```

***

### Example 4: High-Speed Batch Processing

```python
from melo.api import TTS
from concurrent.futures import ThreadPoolExecutor
import soundfile as sf
import time
import numpy as np
from pathlib import Path

device = 'cuda'
tts = TTS(language='EN', device=device)
speaker_id = tts.hps.data.spk2id['EN-US']

# Large batch of texts
texts = [
    f"This is sentence number {i}. It demonstrates fast batch processing with MeloTTS on Clore.ai GPU infrastructure."
    for i in range(1, 51)  # 50 sentences
]

output_dir = Path("batch_output")
output_dir.mkdir(exist_ok=True)

start_time = time.time()

# Process batch
for i, text in enumerate(texts):
    output_path = str(output_dir / f"batch_{i+1:03d}.wav")
    tts.tts_to_file(
        text=text,
        speaker_id=speaker_id,
        output_path=output_path,
        speed=1.0,
        quiet=True
    )
    if (i + 1) % 10 == 0:
        elapsed = time.time() - start_time
        print(f"Progress: {i+1}/50 | Time: {elapsed:.1f}s | Rate: {(i+1)/elapsed:.1f} sentences/sec")

total_time = time.time() - start_time
print(f"\nBatch complete: {len(texts)} sentences in {total_time:.1f}s")
print(f"Average: {total_time/len(texts)*1000:.0f}ms per sentence")
```

***

### Example 5: Mixed Chinese-English TTS

```python
from melo.api import TTS

device = 'cuda'
tts = TTS(language='ZH', device=device)
speaker_id = tts.hps.data.spk2id['ZH']

# Mixed language text (Chinese + English)
mixed_texts = [
    "我们使用Clore.ai的GPU服务器来运行machine learning workloads。",
    "今天的AI conference讨论了large language models和speech synthesis技术。",
    "我的startup需要GPU资源来训练我们的deep learning模型。",
    "Clore.ai提供了非常competitive的价格，比AWS和GCP便宜很多。",
]

for i, text in enumerate(mixed_texts):
    output_file = f"mixed_zh_en_{i+1}.wav"
    tts.tts_to_file(
        text=text,
        speaker_id=speaker_id,
        output_path=output_file,
        speed=0.9  # Slightly slower for clarity
    )
    print(f"Generated: {output_file}")
    print(f"  Text: {text[:60]}...")
```

***

## Configuration

### Docker Compose Setup

Since MeloTTS has no official Docker Hub image, use the NVIDIA CUDA base image and install MeloTTS from source at startup:

```yaml
version: '3.8'

services:
  melotts:
    image: nvidia/cuda:12.1.0-devel-ubuntu22.04
    container_name: melotts
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - PYTHONDONTWRITEBYTECODE=1
    ports:
      - "8888:8888"
    volumes:
      - ./outputs:/app/outputs
      - ./cache:/root/.cache
    command: >
      bash -c "apt-get update && apt-get install -y python3-pip ffmpeg espeak-ng git &&
      git clone https://github.com/myshell-ai/MeloTTS.git /app/MeloTTS &&
      cd /app/MeloTTS && pip install -e . &&
      python -m unidic download &&
      python3 -c 'import nltk; nltk.download(\"averaged_perceptron_tagger_eng\")' &&
      python -m melo.api_server --host 0.0.0.0 --port 8888"
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8888/health"]
      interval: 30s
      timeout: 10s
      retries: 3
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
```

### API Configuration Options

| Parameter   | Default     | Description                             |
| ----------- | ----------- | --------------------------------------- |
| `--host`    | `127.0.0.1` | Bind address (use `0.0.0.0` for public) |
| `--port`    | `8888`      | API server port                         |
| `--workers` | `1`         | Number of worker processes              |
| `--device`  | `auto`      | `cuda`, `cpu`, or `auto`                |

### Supported Languages and Speakers

| Language | Code | Speaker IDs                                                             |
| -------- | ---- | ----------------------------------------------------------------------- |
| English  | `EN` | `EN-Default`, `EN-US`, `EN-GB`, `EN-India`, `EN-Australia`, `EN-Brazil` |
| Chinese  | `ZH` | `ZH`                                                                    |
| Japanese | `JP` | `JP`                                                                    |
| Korean   | `KR` | `KR`                                                                    |
| Spanish  | `SP` | `SP`                                                                    |
| French   | `FR` | `FR`                                                                    |

***

## Performance Tips

### 1. GPU vs CPU Benchmark

MeloTTS performance (RTF = Real-Time Factor, lower is better):

| Device        | RTF     | Notes                      |
| ------------- | ------- | -------------------------- |
| CPU (8 cores) | \~0.3x  | Fast, great for low load   |
| RTX 3080      | \~0.05x | 20x faster than real-time  |
| RTX 4090      | \~0.02x | 50x faster than real-time  |
| A100          | \~0.01x | 100x faster than real-time |

### 2. Optimize for Throughput

```python
# Disable gradient computation for inference
import torch

with torch.no_grad():
    tts.tts_to_file(text, speaker_id, output_path)
```

### 3. Pre-warm the Model

```python
# Run a warmup inference to load CUDA kernels
tts.tts_to_file(
    text="warmup",
    speaker_id=speaker_id,
    output_path="/dev/null"
)
print("Model warmed up, ready for fast inference")
```

### 4. Adjust Audio Quality vs Speed

```python
# Faster (slightly lower quality)
tts.tts_to_file(text, speaker_id, output_path, speed=1.2)

# Slower speech (better articulation)
tts.tts_to_file(text, speaker_id, output_path, speed=0.8)
```

### 5. Memory Efficiency

```python
# Free GPU memory between large batches
import gc
import torch

gc.collect()
torch.cuda.empty_cache()
```

***

## Troubleshooting

### Issue: `espeak-ng` not found

```bash
apt-get install -y espeak-ng
python3 -c "import phonemizer; print('phonemizer OK')"
```

### Issue: NLTK data missing

```bash
python3 -c "
import nltk
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('punkt')
"
```

### Issue: Port 8888 conflicts with Jupyter

MeloTTS uses port 8888 by default, which clashes with Jupyter Notebook. Solutions:

```bash
# Option 1: Run MeloTTS on a different port
python -m melo.api_server --host 0.0.0.0 --port 8889

# Option 2: Run Jupyter on a different port
jupyter notebook --port 8890
```

### Issue: Chinese text not rendering correctly

```bash
# Install Chinese language support
pip install jieba
apt-get install -y python3-opencc

# Test
python3 -c "from melo.api import TTS; t = TTS('ZH'); print('ZH OK')"
```

### Issue: Docker image pull fails

```bash
# Build from source instead
git clone https://github.com/myshell-ai/MeloTTS.git
cd MeloTTS
pip install -e .
python3 -c "import nltk; nltk.download('averaged_perceptron_tagger_eng')"
```

### Issue: Slow inference on GPU

```bash
# Verify GPU is being used
python3 -c "
import torch
from melo.api import TTS
tts = TTS('EN', device='cuda')
print(f'Device: {next(tts.model.parameters()).device}')
print(f'CUDA available: {torch.cuda.is_available()}')
"
```

***

## Clore.ai GPU Recommendations

MeloTTS is lightweight — it runs well on CPU for low volume and scales linearly with GPU compute. You don't need expensive hardware.

| GPU       | VRAM  | Clore.ai Price | RTF (Real-Time Factor)    | Capacity      |
| --------- | ----- | -------------- | ------------------------- | ------------- |
| CPU-only  | —     | \~$0.02/hr     | \~0.3×                    | \~3 req/min   |
| RTX 3090  | 24 GB | \~$0.12/hr     | \~0.02× (50× real-time)   | \~100 req/min |
| RTX 4090  | 24 GB | \~$0.70/hr     | \~0.01× (100× real-time)  | \~200 req/min |
| A100 40GB | 40 GB | \~$1.20/hr     | \~0.005× (200× real-time) | \~400 req/min |

{% hint style="info" %}
**Best value for TTS workloads:** RTX 3090 at ~~$0.12/hr delivers 50× real-time TTS speed. For a production API serving hundreds of users, this is more than sufficient. CPU-only instances (~~$0.02/hr) work fine for development and low-traffic deployments.
{% endhint %}

**Production recommendation:** For a multilingual TTS API serving 10–50 concurrent users, RTX 3090 is the sweet spot. Scale horizontally (multiple instances) rather than upgrading to expensive A100 — MeloTTS doesn't benefit proportionally from higher-end GPUs.

***

## Links

* **GitHub**: <https://github.com/myshell-ai/MeloTTS>
* **Docker**: No official Docker Hub image — install from [GitHub source](https://github.com/myshell-ai/MeloTTS) using `nvidia/cuda:12.1.0-devel-ubuntu22.04` base image
* **Paper**: <https://arxiv.org/abs/2406.06753>
* **Hugging Face**: <https://huggingface.co/myshell-ai/MeloTTS-English>
* **MyShell AI**: <https://myshell.ai>
* **CLORE.AI Marketplace**: <https://clore.ai/marketplace>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/audio-and-voice/melotts.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.