MeloTTS

Run MeloTTS high-quality multilingual TTS with fast inference on Clore.ai GPUs

MeloTTS is a high-quality, multilingual text-to-speech library developed by MyShell AI. It delivers fast, natural-sounding speech synthesis across multiple languages and English accents, designed for both research and production deployment. MeloTTS is optimized for speed — it can generate speech significantly faster than real-time even on CPU — while maintaining high audio quality suitable for commercial use.

MeloTTS currently supports:

  • English (American, British, Indian, Australian, Default)

  • Chinese (Simplified & Mixed Chinese-English)

  • Japanese

  • Korean

  • Spanish

  • French

Key highlights:

  • Fast inference — faster than real-time on CPU, blazing fast on GPU

  • 🌍 Multilingual — 6 languages with accent variants for English

  • 🐳 Docker-ready — official Docker image available

  • 🔌 REST API — HTTP API for integration into any application

  • 📱 Production-grade — used in MyShell's consumer products

circle-check

Server Requirements

Parameter
Minimum
Recommended

GPU

NVIDIA GTX 1080 (8 GB)

NVIDIA RTX 3090 (24 GB)

VRAM

4 GB

8–16 GB

RAM

8 GB

16 GB

CPU

4 cores

8 cores

Disk

10 GB

20 GB

OS

Ubuntu 20.04+

Ubuntu 22.04

CUDA

11.7+ (optional)

12.1+

Python

3.8+

3.10

Ports

22, 8888

22, 8888

circle-info

MeloTTS is uniquely efficient — it runs well on CPU for single requests and benefits greatly from GPU for batch processing. Even a budget GPU doubles the throughput dramatically.


Quick Deploy on CLORE.AI

circle-exclamation

1. Find a suitable server

Go to CLORE.AI Marketplacearrow-up-right and filter by:

  • VRAM: ≥ 4 GB (or CPU-only for low volume)

  • GPU: Any NVIDIA GPU (GTX 1080+, RTX series, A100)

  • Disk: ≥ 10 GB

2. Configure your deployment

Docker Image:

Port Mappings:

Environment Variables:

Startup Command (run after SSH into the server):

3. Access the API

Test with:


Step-by-Step Setup

Step 1: SSH into your server

Step 2: Build and run the container

Since MeloTTS has no pre-built Docker Hub image, use an NVIDIA CUDA base and install MeloTTS from source:

Alternatively, build a custom Docker image from source:

Step 3: Verify the service is running

Step 4: Alternative — Jupyter Notebook interface

Access at: http://<server-ip>:8888

Step 5: Install from pip (without Docker)


Usage Examples

Example 1: Basic English TTS (Python)


Example 2: Multilingual TTS


Example 3: REST API Usage


Example 4: High-Speed Batch Processing


Example 5: Mixed Chinese-English TTS


Configuration

Docker Compose Setup

Since MeloTTS has no official Docker Hub image, use the NVIDIA CUDA base image and install MeloTTS from source at startup:

API Configuration Options

Parameter
Default
Description

--host

127.0.0.1

Bind address (use 0.0.0.0 for public)

--port

8888

API server port

--workers

1

Number of worker processes

--device

auto

cuda, cpu, or auto

Supported Languages and Speakers

Language
Code
Speaker IDs

English

EN

EN-Default, EN-US, EN-GB, EN-India, EN-Australia, EN-Brazil

Chinese

ZH

ZH

Japanese

JP

JP

Korean

KR

KR

Spanish

SP

SP

French

FR

FR


Performance Tips

1. GPU vs CPU Benchmark

MeloTTS performance (RTF = Real-Time Factor, lower is better):

Device
RTF
Notes

CPU (8 cores)

~0.3x

Fast, great for low load

RTX 3080

~0.05x

20x faster than real-time

RTX 4090

~0.02x

50x faster than real-time

A100

~0.01x

100x faster than real-time

2. Optimize for Throughput

3. Pre-warm the Model

4. Adjust Audio Quality vs Speed

5. Memory Efficiency


Troubleshooting

Issue: espeak-ng not found

Issue: NLTK data missing

Issue: Port 8888 conflicts with Jupyter

MeloTTS uses port 8888 by default, which clashes with Jupyter Notebook. Solutions:

Issue: Chinese text not rendering correctly

Issue: Docker image pull fails

Issue: Slow inference on GPU


Clore.ai GPU Recommendations

MeloTTS is lightweight — it runs well on CPU for low volume and scales linearly with GPU compute. You don't need expensive hardware.

GPU
VRAM
Clore.ai Price
RTF (Real-Time Factor)
Capacity

CPU-only

~$0.02/hr

~0.3×

~3 req/min

RTX 3090

24 GB

~$0.12/hr

~0.02× (50× real-time)

~100 req/min

RTX 4090

24 GB

~$0.70/hr

~0.01× (100× real-time)

~200 req/min

A100 40GB

40 GB

~$1.20/hr

~0.005× (200× real-time)

~400 req/min

circle-info

Best value for TTS workloads: RTX 3090 at $0.12/hr delivers 50× real-time TTS speed. For a production API serving hundreds of users, this is more than sufficient. CPU-only instances ($0.02/hr) work fine for development and low-traffic deployments.

Production recommendation: For a multilingual TTS API serving 10–50 concurrent users, RTX 3090 is the sweet spot. Scale horizontally (multiple instances) rather than upgrading to expensive A100 — MeloTTS doesn't benefit proportionally from higher-end GPUs.


Last updated

Was this helpful?