# StyleTTS2

StyleTTS2 achieves human-rated naturalness scores above ground-truth recordings on the LJSpeech and LibriTTS benchmarks (MOS 4.55 vs 4.23 ground truth). It uses **style diffusion** and **adversarial training** to model speaking styles as a latent variable distribution, enabling expressive synthesis and zero-shot speaker adaptation from a short reference clip.

Unlike traditional TTS systems, StyleTTS2 can generalize to unseen speakers with a short reference audio clip, producing speech that rivals professional voice actors. It has been benchmarked to exceed human-rated naturalness scores on several datasets — a first for open-source TTS.

Key features:

* **Human-level naturalness** — surpasses human MOS scores on LJSpeech
* **Zero-shot speaker adaptation** — clone any voice from a short audio sample
* **Style diffusion** — expressive, varied prosody and speaking style
* **Multi-speaker support** — trained on LibriTTS (2,300+ speakers)
* **Lightweight inference** — runs efficiently on consumer GPUs

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

***

## Server Requirements

| Parameter | Minimum                | Recommended             |
| --------- | ---------------------- | ----------------------- |
| GPU       | NVIDIA RTX 3070 (8 GB) | NVIDIA RTX 4090 (24 GB) |
| VRAM      | 6 GB                   | 12–24 GB                |
| RAM       | 16 GB                  | 32 GB                   |
| CPU       | 4 cores                | 8+ cores                |
| Disk      | 15 GB                  | 30 GB                   |
| OS        | Ubuntu 20.04+          | Ubuntu 22.04            |
| CUDA      | 11.7+                  | 12.1+                   |
| Python    | 3.8+                   | 3.10                    |
| Ports     | 22, 7860               | 22, 7860                |

{% hint style="info" %}
StyleTTS2 is relatively lightweight — an RTX 3070 or 3080 handles real-time inference comfortably. For batch processing or serving concurrent users, use a 4090 or A100.
{% endhint %}

***

## Quick Deploy on CLORE.AI

StyleTTS2 requires a custom Docker build as there is no official pre-built image. The setup takes \~10 minutes.

### 1. Find a suitable server

Go to [CLORE.AI Marketplace](https://clore.ai/marketplace) and filter by:

* **VRAM**: ≥ 6 GB
* **GPU**: RTX 3070, 3080, 3090, 4080, 4090, A100
* **Disk**: ≥ 20 GB

### 2. Configure your deployment

**Docker Image (base):**

```
nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04
```

**Port Mappings:**

```
22   → SSH access
7860 → Gradio Web UI
```

**Startup Command:**

```bash
bash -c "apt-get update && apt-get install -y git python3 python3-pip ffmpeg espeak-ng && \
  git clone https://github.com/yl4579/StyleTTS2 /workspace/StyleTTS2 && \
  cd /workspace/StyleTTS2 && pip install -r requirements.txt && \
  python app.py"
```

### 3. Access the interface

```
http://<your-clore-server-ip>:7860
```

***

## Step-by-Step Setup

### Step 1: SSH into your server

```bash
ssh root@<your-clore-server-ip> -p <ssh-port>
```

### Step 2: Install system dependencies

```bash
apt-get update && apt-get install -y \
  git \
  python3 \
  python3-pip \
  python3-venv \
  ffmpeg \
  espeak-ng \
  libsndfile1 \
  build-essential \
  wget \
  curl
```

### Step 3: Clone StyleTTS2 repository

```bash
cd /workspace
git clone https://github.com/yl4579/StyleTTS2.git
cd StyleTTS2
```

### Step 4: Create Python virtual environment

```bash
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
```

### Step 5: Install dependencies

```bash
pip install -r requirements.txt

# Install additional dependencies if needed
pip install phonemizer gruut
```

### Step 6: Download pre-trained models

```bash
# Download LJSpeech model (single speaker, high quality)
mkdir -p Models/LJSpeech
wget -O Models/LJSpeech/epoch_2nd_00100.pth \
  "https://huggingface.co/yl4579/StyleTTS2-LJSpeech/resolve/main/Models/LJSpeech/epoch_2nd_00100.pth"

wget -O Models/LJSpeech/config.yml \
  "https://huggingface.co/yl4579/StyleTTS2-LJSpeech/resolve/main/Models/LJSpeech/config.yml"

# Download LibriTTS model (multi-speaker, zero-shot)
mkdir -p Models/LibriTTS
wget -O Models/LibriTTS/epochs_2nd_00020.pth \
  "https://huggingface.co/yl4579/StyleTTS2-LibriTTS/resolve/main/Models/LibriTTS/epochs_2nd_00020.pth"

wget -O Models/LibriTTS/config.yml \
  "https://huggingface.co/yl4579/StyleTTS2-LibriTTS/resolve/main/Models/LibriTTS/config.yml"
```

### Step 7: Build and run the Dockerfile

```bash
# Create Dockerfile
cat > Dockerfile << 'EOF'
FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04

ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y \
    git python3 python3-pip python3-venv \
    ffmpeg espeak-ng libsndfile1 build-essential wget curl \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /workspace
RUN git clone https://github.com/yl4579/StyleTTS2.git
WORKDIR /workspace/StyleTTS2
RUN pip install --no-cache-dir -r requirements.txt

EXPOSE 7860
CMD ["python3", "app.py", "--share"]
EOF

docker build -t styletts2:local .
docker run -d --name styletts2 --gpus all \
  -p 7860:7860 \
  -v /workspace/models:/workspace/StyleTTS2/Models \
  styletts2:local
```

### Step 8: Launch Gradio demo directly

```bash
source venv/bin/activate
python app.py
```

Access at `http://<server-ip>:7860`

***

## Usage Examples

### Example 1: Basic TTS via Python API

```python
import torch
import soundfile as sf
import numpy as np
from models import *
from utils import *
import yaml

# Load config
config = yaml.safe_load(open("Models/LJSpeech/config.yml"))

# Initialize model
model = build_model(recursive_munch(config['model_params']), "cpu")
params = torch.load("Models/LJSpeech/epoch_2nd_00100.pth", map_location='cpu')
model = load_F0_models(model)

# Move to GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
for key in model:
    model[key] = model[key].to(device)

print(f"Model loaded on: {device}")
print(f"VRAM used: {torch.cuda.memory_allocated()/1e9:.2f} GB")
```

***

### Example 2: Zero-Shot Voice Cloning

```python
import torch
import torchaudio
import soundfile as sf
from inference import StyleTTS2Inference

# Initialize inference engine
tts = StyleTTS2Inference(
    model_path="Models/LibriTTS/epochs_2nd_00020.pth",
    config_path="Models/LibriTTS/config.yml",
    device="cuda"
)

# Load reference audio (10-30 seconds works best)
reference_audio, sr = torchaudio.load("reference_voice.wav")

# Generate speech in the reference voice
text = "Clore.ai provides powerful GPU infrastructure for AI applications including text-to-speech synthesis."

audio = tts.synthesize(
    text=text,
    reference_audio=reference_audio,
    reference_sr=sr,
    diffusion_steps=10,    # Higher = better quality, slower
    embedding_scale=1.5,   # Controls style strength
    alpha=0.3,             # Acoustic style weight
    beta=0.7,              # Prosodic style weight
)

sf.write("cloned_voice_output.wav", audio, 24000)
print("Saved cloned voice output!")
```

***

### Example 3: Expressive Style Control

```python
from inference import StyleTTS2Inference
import soundfile as sf

tts = StyleTTS2Inference(
    model_path="Models/LibriTTS/epochs_2nd_00020.pth",
    config_path="Models/LibriTTS/config.yml",
    device="cuda"
)

text = "Welcome to our presentation on GPU computing with Clore.ai."

# Experiment with different style parameters
style_configs = [
    {"name": "neutral",    "alpha": 0.3, "beta": 0.7, "diffusion_steps": 5},
    {"name": "expressive", "alpha": 0.5, "beta": 0.9, "diffusion_steps": 15},
    {"name": "fast",       "alpha": 0.1, "beta": 0.3, "diffusion_steps": 3},
    {"name": "slow_deep",  "alpha": 0.7, "beta": 0.5, "diffusion_steps": 20},
]

for cfg in style_configs:
    audio = tts.synthesize(
        text=text,
        diffusion_steps=cfg["diffusion_steps"],
        alpha=cfg["alpha"],
        beta=cfg["beta"],
    )
    filename = f"style_{cfg['name']}.wav"
    sf.write(filename, audio, 24000)
    print(f"Generated: {filename}")
```

***

### Example 4: Gradio Web Interface

```python
import gradio as gr
import soundfile as sf
import numpy as np
from inference import StyleTTS2Inference
import tempfile

tts = StyleTTS2Inference(
    model_path="Models/LibriTTS/epochs_2nd_00020.pth",
    config_path="Models/LibriTTS/config.yml",
    device="cuda"
)

def synthesize(text, reference_audio, diffusion_steps, alpha, beta):
    if reference_audio is not None:
        sr, audio_data = reference_audio
        # Convert to the expected format
        ref_audio = audio_data.astype(np.float32) / 32768.0
    else:
        ref_audio = None
        sr = None

    output = tts.synthesize(
        text=text,
        reference_audio=ref_audio,
        reference_sr=sr,
        diffusion_steps=int(diffusion_steps),
        alpha=alpha,
        beta=beta,
    )

    # Save to temp file
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
        sf.write(f.name, output, 24000)
        return f.name

demo = gr.Interface(
    fn=synthesize,
    inputs=[
        gr.Textbox(label="Input Text", lines=4),
        gr.Audio(label="Reference Voice (optional)", type="numpy"),
        gr.Slider(1, 30, value=10, label="Diffusion Steps"),
        gr.Slider(0.0, 1.0, value=0.3, label="Alpha (Acoustic Style)"),
        gr.Slider(0.0, 1.0, value=0.7, label="Beta (Prosodic Style)"),
    ],
    outputs=gr.Audio(label="Generated Speech"),
    title="StyleTTS2 on Clore.ai GPU",
    description="Human-level TTS with style diffusion"
)

demo.launch(server_name="0.0.0.0", server_port=7860)
```

***

### Example 5: Batch Audiobook Generation

```python
import soundfile as sf
import numpy as np
from pathlib import Path
from inference import StyleTTS2Inference

tts = StyleTTS2Inference(
    model_path="Models/LibriTTS/epochs_2nd_00020.pth",
    config_path="Models/LibriTTS/config.yml",
    device="cuda"
)

# Split text into paragraphs
book_text = """
Chapter One: The GPU Revolution

The history of artificial intelligence cannot be separated from the evolution of 
graphics processing units. What began as specialized hardware for rendering pixels 
became the engine of the modern AI revolution.

Chapter Two: Distributed Computing

As models grew larger, single-GPU training gave way to distributed systems. 
Platforms like Clore.ai emerged to democratize access to this computational power, 
making enterprise-grade GPU infrastructure accessible to individuals and startups.
""".strip()

paragraphs = [p.strip() for p in book_text.split('\n\n') if p.strip()]
output_dir = Path("audiobook_output")
output_dir.mkdir(exist_ok=True)

audio_segments = []
for i, paragraph in enumerate(paragraphs):
    print(f"Processing paragraph {i+1}/{len(paragraphs)}...")
    audio = tts.synthesize(
        text=paragraph,
        diffusion_steps=10,
        alpha=0.3,
        beta=0.7,
    )
    segment_path = output_dir / f"segment_{i+1:03d}.wav"
    sf.write(str(segment_path), audio, 24000)
    audio_segments.append(audio)
    print(f"  ✓ Saved {segment_path}")

# Concatenate all segments with brief silence
silence = np.zeros(int(24000 * 0.5))  # 0.5 second silence
full_audio = []
for seg in audio_segments:
    full_audio.append(seg)
    full_audio.append(silence)

combined = np.concatenate(full_audio)
sf.write("audiobook_complete.wav", combined, 24000)
print(f"\nComplete audiobook: {len(combined)/24000:.1f} seconds")
```

***

## Configuration

### config.yml Key Parameters

```yaml
model_params:
  dim_in: 64
  hidden_dim: 512
  max_conv_dim: 512
  n_layer: 3
  n_mels: 80

diffusion:
  timesteps: 1000
  beta_schedule: "squaredcos_cap_v2"

training:
  batch_size: 16
  epochs: 100
  save_freq: 10
```

### Inference Parameters

| Parameter         | Range   | Default | Effect                                |
| ----------------- | ------- | ------- | ------------------------------------- |
| `diffusion_steps` | 1–30    | 10      | Quality vs speed trade-off            |
| `alpha`           | 0.0–1.0 | 0.3     | Acoustic style weight from reference  |
| `beta`            | 0.0–1.0 | 0.7     | Prosodic style weight from reference  |
| `embedding_scale` | 1.0–3.0 | 1.5     | Overall style intensity               |
| `t`               | 0.6–1.0 | 0.7     | Noise level (higher = more variation) |

***

## Performance Tips

### 1. Optimize Diffusion Steps

The default of 10 steps balances quality and speed. For real-time applications, use 5 steps; for maximum quality, use 20–30.

```python
# Real-time (fast)
audio = tts.synthesize(text, diffusion_steps=5)

# High quality (slow)
audio = tts.synthesize(text, diffusion_steps=20)
```

### 2. Use torch.compile (PyTorch 2.0+)

```python
import torch
model = torch.compile(model, mode="reduce-overhead")
```

### 3. Mixed Precision Inference

```python
with torch.autocast(device_type="cuda", dtype=torch.float16):
    audio = tts.synthesize(text, diffusion_steps=10)
```

### 4. Batch Multiple Sentences

Process multiple sentences together when possible to maximize GPU utilization and reduce overhead.

### 5. Cache Reference Speaker Embeddings

```python
# Compute once, reuse many times
speaker_embedding = tts.compute_speaker_embedding(reference_audio, sr)

# Reuse for multiple utterances
for text in texts:
    audio = tts.synthesize_with_embedding(text, speaker_embedding)
```

***

## Troubleshooting

### Issue: espeak-ng not found

```bash
apt-get install -y espeak-ng espeak-ng-data
# Verify installation
espeak-ng --version
```

### Issue: Phonemizer fails

```bash
pip install phonemizer
# Test
python3 -c "from phonemizer import phonemize; print(phonemize('hello world'))"
```

### Issue: CUDA out of memory

```bash
# Reduce batch size or use CPU offloading
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:256
# Or switch to FP16
```

### Issue: Poor audio quality

* Increase `diffusion_steps` to 15–20
* Ensure reference audio is clean, 16kHz minimum
* Try adjusting `alpha` and `beta` parameters
* Use a longer reference audio clip (15–30 seconds)

### Issue: Model download fails from Hugging Face

```bash
pip install huggingface_hub
python3 -c "
from huggingface_hub import snapshot_download
snapshot_download('yl4579/StyleTTS2-LibriTTS', local_dir='Models/LibriTTS')
"
```

***

## Clore.ai GPU Recommendations

StyleTTS2 is a lightweight model — the LibriTTS checkpoint is \~300MB, inference is fast even on modest GPUs.

| GPU       | VRAM  | Clore.ai Price | Inference Speed  | Best For                         |
| --------- | ----- | -------------- | ---------------- | -------------------------------- |
| CPU-only  | —     | \~$0.02/hr     | \~0.5× real-time | Development, testing             |
| RTX 3090  | 24 GB | \~$0.12/hr     | \~15× real-time  | Production API, voice cloning    |
| RTX 4090  | 24 GB | \~$0.70/hr     | \~25× real-time  | High-concurrency API             |
| A100 40GB | 40 GB | \~$1.20/hr     | \~40× real-time  | Large-batch audiobook generation |

{% hint style="info" %}
**RTX 3090 at \~$0.12/hr** is the optimal choice for StyleTTS2. The model is small enough that you spend almost nothing on GPU time — a full hour of synthesized audio costs under $0.01 in GPU rental. For audiobook production or voice cloning services, this is extremely cost-efficient.
{% endhint %}

**Zero-shot voice cloning quality tip:** Provide 15–30 seconds of clean reference audio at 22kHz or 24kHz. The style diffusion module needs enough audio to accurately capture speaking style, pace, and prosody. Noisy or short references degrade output quality significantly.

***

## Links

* **GitHub**: <https://github.com/yl4579/StyleTTS2>
* **Paper (arXiv)**: <https://arxiv.org/abs/2306.07691>
* **Hugging Face (LJSpeech)**: <https://huggingface.co/yl4579/StyleTTS2-LJSpeech>
* **Hugging Face (LibriTTS)**: <https://huggingface.co/yl4579/StyleTTS2-LibriTTS>
* **Demo Space**: <https://huggingface.co/spaces/styletts2/styletts2>
* **CLORE.AI Marketplace**: <https://clore.ai/marketplace>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/audio-and-voice/styletss2.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.