TTS Engine Comparison

Compare the leading open-source text-to-speech engines for deployment on Clore.ai GPU servers.

circle-info

Text-to-Speech (TTS) converts written text into natural-sounding audio. This guide compares five leading open-source TTS engines: XTTS v2, Bark, Kokoro, Fish Speech, and MeloTTS — covering quality, speed, language support, and voice cloning capabilities.


Quick Decision Matrix

XTTS v2
Bark
Kokoro
Fish Speech
MeloTTS

Developer

Coqui AI

Suno AI

Hexgrad

Fish Audio

MyShell AI

Quality

⭐⭐⭐⭐⭐

⭐⭐⭐⭐

⭐⭐⭐⭐

⭐⭐⭐⭐⭐

⭐⭐⭐

Speed

Medium

Slow

Fast

Fast

Fastest

Voice cloning

✅ (3s clip)

✅ (voice presets)

✅ (limited)

✅ (10s clip)

Languages

17

10+

English

8+

8

Min VRAM

4GB

8GB

CPU ok

4GB

CPU ok

License

CPML (non-commercial)

MIT

Apache 2.0

CC BY-NC-SA

MIT

GitHub stars

35K+ (Coqui TTS)

38K+

12K+

14K+

15K+


Overview

XTTS v2

Coqui's XTTS v2 is the gold standard for open-source voice cloning TTS. It can clone any voice from a 3-second audio clip with exceptional fidelity.

Philosophy: Maximum expressiveness and voice cloning quality.

Bark

Suno's Bark is a transformer-based TTS model that generates highly expressive speech, including non-speech sounds: laughter, sighs, music, and sound effects.

Philosophy: Not just speech — full audio generation.

Kokoro

Kokoro is a lightweight, fast TTS model optimized for English. Despite its small size (~82M parameters), it delivers surprisingly high quality.

Philosophy: Small model, big quality, runs anywhere.

Fish Speech

Fish Audio's Fish Speech is a production-grade TTS with exceptional voice cloning from short clips. It uses a novel codec + language model architecture.

Philosophy: Production quality, fast inference, excellent cloning.

MeloTTS

MyShell's MeloTTS is ultra-fast, multi-accent TTS optimized for real-time applications. It runs efficiently on CPU and supports multiple English accents and Asian languages.

Philosophy: Real-time speed at any scale.


Quality Comparison

Naturalness Scores (MOS — Mean Opinion Score, 1-5)

circle-info

MOS scores are approximate values based on published papers and community evaluations. Actual quality depends heavily on text content and voice configuration.

Model
English MOS
Multilingual MOS
Expressiveness

XTTS v2

4.3

4.1

⭐⭐⭐⭐⭐

Bark

3.9

3.7

⭐⭐⭐⭐⭐ (unique)

Kokoro

4.2

N/A (EN only)

⭐⭐⭐

Fish Speech

4.4

4.2

⭐⭐⭐⭐

MeloTTS

3.8

3.6

⭐⭐

What Each Model Does Best

Model
Standout Quality Feature

XTTS v2

Near-perfect voice cloning, emotional range

Bark

Non-speech sounds, laughter, music, effects

Kokoro

Best quality-to-size ratio, natural cadence

Fish Speech

Best overall naturalness + cloning accuracy

MeloTTS

Consistent, clean output for long texts


Speed Benchmarks

Characters Per Second (CPU vs GPU)

Test: "The quick brown fox jumps over the lazy dog. How are you today?" (60 chars)

Model
CPU Speed
GPU Speed (RTX 3080)
Real-time Factor

XTTS v2

~15 chars/s

~150 chars/s

0.3× (GPU)

Bark

~5 chars/s

~40 chars/s

0.1× (GPU)

Kokoro

~200 chars/s

~800 chars/s

5× (GPU)

Fish Speech

~80 chars/s

~500 chars/s

3× (GPU)

MeloTTS

~500 chars/s

~2000 chars/s

12× (GPU)

Real-time factor > 1.0 means faster than playback speed

Time to Generate 1 Minute of Audio

Model
CPU
RTX 3080
A100

XTTS v2

~8 min

~30s

~10s

Bark

~20 min

~3 min

~45s

Kokoro

~20s

~5s

~2s

Fish Speech

~45s

~8s

~3s

MeloTTS

~8s

~2s

<1s

circle-check

Language Support

Supported Languages

Model
Languages
Notable

XTTS v2

17

EN, ES, FR, DE, IT, PT, PL, TR, RU, NL, CS, AR, ZH, JA, HU, KO, HI

Bark

10+

EN, ZH, FR, DE, HI, IT, JA, KO, PL, PT, RU, ES, TR

Kokoro

2

English (US/UK), Japanese (limited)

Fish Speech

8

EN, ZH, JA, KO, FR, DE, AR, ES

MeloTTS

8

EN (4 accents), ES, FR, ZH, JA, KO

Language Quality Notes

Model
English
Chinese
Japanese
European

XTTS v2

Excellent

Good

Good

Excellent

Bark

Good

Fair

Fair

Good

Kokoro

Excellent

Limited

Fish Speech

Excellent

Best

Good

Good

MeloTTS

Good

Good

Good

Good

circle-info

For Chinese TTS: Fish Speech and MeloTTS are the best open-source options. Both handle tones and characters naturally.

For multilingual applications: XTTS v2 supports the most languages with consistent quality across all of them.


Voice Cloning Comparison

Cloning Capabilities

Model
Reference Length
Cloning Quality
Zero-Shot

XTTS v2

3 seconds

⭐⭐⭐⭐⭐

Bark

Voice presets only

⭐⭐⭐

Partial

Kokoro

Not supported

Fish Speech

10 seconds

⭐⭐⭐⭐⭐

MeloTTS

Not supported

XTTS v2 Voice Cloning

Fish Speech Voice Cloning

Bark Voice Presets


XTTS v2: Deep Dive

Architecture

  • VITS + GPT hybrid architecture

  • Trained on 16K+ hours across 17 languages

  • 3-second minimum for zero-shot cloning

Installation on Clore.ai

Docker Deployment

Weaknesses: CPML license (non-commercial without permission), slower than Kokoro/MeloTTS


Bark: Deep Dive

Architecture

  • GPT-style transformer for audio token generation

  • Three-stage process: text → semantic → coarse → fine tokens

  • Generates actual audio codec tokens (EnCodec)

What Makes Bark Unique

Bark is the only open-source TTS that natively generates:

  • 🎵 Background music within speech

  • 😂 Laughter, sighs, throat-clearing

  • 🎭 Multiple speakers in one generation

  • 🌍 Mixed-language utterances

Markup Language

Installation

Weaknesses: Slow (3-stage pipeline), inconsistent across runs, no true voice cloning


Kokoro: Deep Dive

Architecture

  • 82M parameter StyleTTS2-based model

  • Extremely small but surprisingly high quality

  • Fast inference on CPU and GPU

Voices Available

Streaming Support

Weaknesses: English only (primarily), no voice cloning, limited expressiveness


Fish Speech: Deep Dive

Architecture

  • VQGAN + Language Model architecture

  • Trained on 700K+ hours of audio

  • Strong multilingual with Asian language support

Installation

Python API

Voice Cloning

Weaknesses: CC BY-NC-SA license (non-commercial), higher VRAM for best quality


MeloTTS: Deep Dive

Architecture

  • VITS2-based architecture

  • Multi-accent English training

  • Extremely optimized for inference speed

Accents and Languages

Batch Processing (Very Fast)

Weaknesses: No voice cloning, robotic at high speed, limited expressiveness


Deployment on Clore.ai

All-in-One TTS Server

VRAM Requirements Summary

Model
CPU
4GB GPU
8GB GPU
16GB GPU

XTTS v2

Slow

Bark

Very slow

Kokoro

Fast

Fish Speech

Medium

MeloTTS

Very fast


Integration Examples

OpenAI-Compatible API (for drop-in replacement)

LangChain Integration


When to Use Which

Decision Guide

By Application Type

Application
Best Choice
Why

Audiobook generation

XTTS v2

Natural, consistent voice

Real-time chatbot

MeloTTS or Kokoro

Fastest inference

Podcast automation

XTTS v2 or Fish Speech

Best cloning

Game characters

Bark

Expressive, varied voices

Customer service

MeloTTS

Scalable, fast

Accessibility tools

Kokoro

Lightweight, free

Voice dubbing

Fish Speech

Best cloning quality

Long-form narration

XTTS v2

Consistent quality


License Summary

circle-exclamation
Model
License
Commercial?
Notes

XTTS v2

Coqui Public Model License

❌ Free

Requires license for commercial

Bark

MIT

Free for all use

Kokoro

Apache 2.0

Free for all use

Fish Speech

CC BY-NC-SA 4.0

Non-commercial only

MeloTTS

MIT

Free for all use

Fully open for commercial use: Bark, Kokoro, MeloTTS


Cost on Clore.ai



Summary

Model
Use When

XTTS v2

Best voice cloning (3s ref), 17 languages, non-commercial

Bark

Expressive, laughter/effects, MIT license

Kokoro

Fast, high-quality English, Apache license

Fish Speech

Best CJK, production cloning, non-commercial

MeloTTS

Fastest, real-time, multi-accent English, MIT license

For most production Clore.ai deployments:

  • Real-time voice apps → MeloTTS or Kokoro (free, fast, MIT)

  • Voice cloning service → XTTS v2 or Fish Speech (check licensing)

  • Expressive narration → Bark or XTTS v2


Clore.ai GPU Recommendations

Use Case
Recommended GPU
Est. Cost on Clore.ai

Development/Testing

RTX 3090 (24GB)

~$0.12/gpu/hr

Production

RTX 4090 (24GB)

~$0.70/gpu/hr

Large Scale

A100 80GB

~$1.20/gpu/hr

💡 All examples in this guide can be deployed on Clore.aiarrow-up-right GPU servers. Browse available GPUs and rent by the hour — no commitments, full root access.

Last updated

Was this helpful?