# F5-TTS

Generate natural speech with F5-TTS - a fast and fluent TTS system.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Renting on CLORE.AI

1. Visit [CLORE.AI Marketplace](https://clore.ai/marketplace)
2. Filter by GPU type, VRAM, and price
3. Choose **On-Demand** (fixed rate) or **Spot** (bid price)
4. Configure your order:
   * Select Docker image
   * Set ports (TCP for SSH, HTTP for web UIs)
   * Add environment variables if needed
   * Enter startup command
5. Select payment: **CLORE**, **BTC**, or **USDT/USDC**
6. Create order and wait for deployment

### Access Your Server

* Find connection details in **My Orders**
* Web interfaces: Use the HTTP port URL
* SSH: `ssh -p <port> root@<proxy-address>`

## What is F5-TTS?

F5-TTS offers:

* Fast inference (faster than real-time)
* Natural prosody and intonation
* Zero-shot voice cloning
* Multi-language support

## Resources

* **GitHub:** [SWivid/F5-TTS](https://github.com/SWivid/F5-TTS)
* **HuggingFace:** [SWivid/F5-TTS](https://huggingface.co/SWivid/F5-TTS)
* **Paper:** [F5-TTS Paper](https://arxiv.org/abs/2410.06885)
* **Demo:** [HuggingFace Space](https://huggingface.co/spaces/mrfakename/E2-F5-TTS)

## Recommended Hardware

| Component | Minimum       | Recommended   | Optimal       |
| --------- | ------------- | ------------- | ------------- |
| GPU       | RTX 3060 12GB | RTX 4080 16GB | RTX 4090 24GB |
| VRAM      | 6GB           | 12GB          | 16GB          |
| CPU       | 4 cores       | 8 cores       | 16 cores      |
| RAM       | 16GB          | 32GB          | 64GB          |
| Storage   | 20GB SSD      | 50GB NVMe     | 100GB NVMe    |
| Internet  | 100 Mbps      | 500 Mbps      | 1 Gbps        |

## Quick Deploy on CLORE.AI

**Docker Image:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime
```

**Ports:**

```
22/tcp
7860/http
```

**Command:**

```bash
pip install f5-tts && \
f5-tts-webui
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

## Installation

```bash
pip install f5-tts

# Or from source
git clone https://github.com/SWivid/F5-TTS.git
cd F5-TTS
pip install -e .
```

## What You Can Create

### Voice Content

* Podcast production
* Audiobook narration
* Voice-over for videos

### Accessibility

* Screen readers
* Document readers
* Learning materials

### Interactive Applications

* Voice assistants
* Gaming NPCs
* Customer service bots

### Creative Projects

* Character voices
* Audio dramas
* Music vocals

## Basic Usage

### Simple TTS

```python
from f5_tts import F5TTS

# Initialize
tts = F5TTS(device="cuda")

# Generate speech
audio = tts.generate(
    text="Hello! This is F5-TTS generating natural speech.",
    output_path="output.wav"
)
```

### Voice Cloning

```python
from f5_tts import F5TTS

tts = F5TTS(device="cuda")

# Clone voice from reference audio
audio = tts.generate(
    text="This is my cloned voice speaking new text.",
    ref_audio="reference_voice.wav",
    ref_text="This is the reference text spoken in the audio.",
    output_path="cloned_output.wav"
)
```

## Multi-Language Support

```python
from f5_tts import F5TTS

tts = F5TTS(device="cuda")

# English
tts.generate(
    text="Hello, how are you today?",
    ref_audio="english_speaker.wav",
    output_path="english.wav"
)

# Chinese
tts.generate(
    text="你好，今天怎么样？",
    ref_audio="chinese_speaker.wav",
    output_path="chinese.wav"
)

# French
tts.generate(
    text="Bonjour, comment allez-vous?",
    ref_audio="french_speaker.wav",
    output_path="french.wav"
)
```

## Batch Processing

```python
from f5_tts import F5TTS
import os

tts = F5TTS(device="cuda")

texts = [
    "Welcome to our product demonstration.",
    "Today we'll show you the key features.",
    "Let's start with the main dashboard.",
    "As you can see, the interface is intuitive.",
    "Thank you for watching!"
]

ref_audio = "narrator_voice.wav"
ref_text = "Sample text from the reference audio."
output_dir = "./narration"
os.makedirs(output_dir, exist_ok=True)

for i, text in enumerate(texts):
    print(f"Generating {i+1}/{len(texts)}: {text[:50]}...")

    tts.generate(
        text=text,
        ref_audio=ref_audio,
        ref_text=ref_text,
        output_path=f"{output_dir}/segment_{i:03d}.wav"
    )
```

## Long-Form Audio

```python
from f5_tts import F5TTS

tts = F5TTS(device="cuda")

long_text = """
Welcome to this comprehensive guide on machine learning.
In this chapter, we will explore the fundamentals of neural networks.
Neural networks are computing systems inspired by biological neural networks.
They consist of interconnected nodes that process information.
Let's begin with the basic concepts.
"""

# F5-TTS handles long text by splitting into sentences
audio = tts.generate(
    text=long_text,
    ref_audio="narrator.wav",
    output_path="long_narration.wav",
    chunk_size=200  # Characters per chunk
)
```

## Gradio Interface

```python
import gradio as gr
from f5_tts import F5TTS
import tempfile

tts = F5TTS(device="cuda")

def generate_speech(text, ref_audio, ref_text):
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
        tts.generate(
            text=text,
            ref_audio=ref_audio,
            ref_text=ref_text,
            output_path=f.name
        )
        return f.name

demo = gr.Interface(
    fn=generate_speech,
    inputs=[
        gr.Textbox(label="Text to Speak", lines=5),
        gr.Audio(type="filepath", label="Reference Voice"),
        gr.Textbox(label="Reference Text", lines=2)
    ],
    outputs=gr.Audio(label="Generated Speech"),
    title="F5-TTS Voice Cloning",
    description="Clone any voice with F5-TTS on CLORE.AI servers"
)

demo.launch(server_name="0.0.0.0", server_port=7860)
```

## API Server

```python
from fastapi import FastAPI, UploadFile, File, Form
from fastapi.responses import FileResponse
from f5_tts import F5TTS
import tempfile

app = FastAPI()
tts = F5TTS(device="cuda")

@app.post("/synthesize")
async def synthesize(
    text: str = Form(...),
    ref_audio: UploadFile = File(...),
    ref_text: str = Form(...)
):
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as ref_file:
        ref_file.write(await ref_audio.read())
        ref_path = ref_file.name

    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as out_file:
        tts.generate(
            text=text,
            ref_audio=ref_path,
            ref_text=ref_text,
            output_path=out_file.name
        )
        return FileResponse(out_file.name, media_type="audio/wav")

# Run: uvicorn server:app --host 0.0.0.0 --port 8000
```

## Performance

| Text Length | GPU      | Generation Time | Real-time Factor |
| ----------- | -------- | --------------- | ---------------- |
| 100 chars   | RTX 3090 | 0.5s            | 5x               |
| 100 chars   | RTX 4090 | 0.3s            | 8x               |
| 500 chars   | RTX 4090 | 1.2s            | 10x              |
| 1000 chars  | A100     | 2.0s            | 12x              |

## Common Problems & Solutions

### Poor Voice Match

**Problem:** Generated voice doesn't match reference

**Solutions:**

* Use 5-15 seconds of clear reference audio
* Provide accurate reference text transcription
* Avoid background noise in reference
* Match language of text and reference

### Pronunciation Issues

**Problem:** Mispronounces words or names

**Solutions:**

```python

# Use phonetic hints for difficult words
text = "Welcome to CLORE (pronounced KLOR) AI platform."

# Or use SSML-like formatting
text = "The CEO, John Smith (SMIHTH), will speak."
```

### Audio Quality Issues

**Problem:** Output sounds robotic or distorted

**Solutions:**

* Use high-quality reference audio (24kHz+)
* Clean reference from noise
* Try different reference samples
* Increase generation quality settings

### Memory Issues

**Problem:** Out of memory for long texts

**Solutions:**

```python

# Process in smaller chunks
tts.generate(
    text=long_text,
    chunk_size=100,  # Smaller chunks
    overlap=20  # Smooth transitions
)
```

### Slow Generation

**Problem:** Takes too long to generate

**Solutions:**

* Use GPU inference (CUDA)
* Reduce chunk\_size for faster processing
* Use RTX 4090 or better
* Enable half-precision (fp16)

## Troubleshooting

### Voice doesn't match reference

* Use 5-15 seconds of clear reference audio
* Transcribe reference text accurately
* Avoid background noise in reference

### Audio quality issues

* Use high sample rate reference (24kHz+)
* Clean reference from noise
* Try different reference samples

### Slow generation

* Use CUDA (not CPU)
* Reduce text length or chunk it
* Use smaller batch sizes

### Language mismatch

* Match text language with reference audio language
* Some languages need specific models

## Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

| GPU       | Hourly Rate | Daily Rate | 4-Hour Session |
| --------- | ----------- | ---------- | -------------- |
| RTX 3060  | \~$0.03     | \~$0.70    | \~$0.12        |
| RTX 3090  | \~$0.06     | \~$1.50    | \~$0.25        |
| RTX 4090  | \~$0.10     | \~$2.30    | \~$0.40        |
| A100 40GB | \~$0.17     | \~$4.00    | \~$0.70        |
| A100 80GB | \~$0.25     | \~$6.00    | \~$1.00        |

*Prices vary by provider and demand. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

**Save money:**

* Use **Spot** market for flexible workloads (often 30-50% cheaper)
* Pay with **CLORE** tokens
* Compare prices across different providers

## Next Steps

* [XTTS](https://docs.clore.ai/guides/audio-and-voice/xtts-coqui) - Alternative TTS
* [Bark TTS](https://docs.clore.ai/guides/audio-and-voice/bark-tts) - Expressive TTS
* [SadTalker](https://docs.clore.ai/guides/talking-heads/sadtalker) - Talking heads
