> For the complete documentation index, see [llms.txt](https://docs.clore.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.clore.ai/guides/guides_v2-fr/audio-et-voix/xtts-coqui.md).

# XTTS (Coqui)

Générez une parole naturelle avec clonage de voix en utilisant Coqui XTTS.

{% hint style="success" %}
Tous les exemples peuvent être exécutés sur des serveurs GPU loués via [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Location sur CLORE.AI

1. Visitez [CLORE.AI Marketplace](https://clore.ai/marketplace)
2. Filtrer par type de GPU, VRAM et prix
3. Choisir **À la demande** (tarif fixe) ou **Spot** (prix d'enchère)
4. Configurez votre commande :
   * Sélectionnez l'image Docker
   * Définissez les ports (TCP pour SSH, HTTP pour les interfaces web)
   * Ajoutez des variables d'environnement si nécessaire
   * Entrez la commande de démarrage
5. Sélectionnez le paiement : **CLORE**, **BTC**, ou **USDT/USDC**
6. Créez la commande et attendez le déploiement

### Accédez à votre serveur

* Trouvez les détails de connexion dans **Mes commandes**
* Interfaces Web : utilisez l'URL du port HTTP
* SSH : `ssh -p <port> root@<adresse-proxy>`

## Qu'est-ce que XTTS ?

XTTS (par Coqui) offre :

* Synthèse vocale de haute qualité
* Clonage de voix à partir de 6 secondes d'audio
* 17 langues prises en charge
* Contrôle émotionnel
* Prise en charge du streaming

## Exigences

| Mode             | VRAM | Recommandé |
| ---------------- | ---- | ---------- |
| Inférence        | 4 Go | RTX 3060   |
| Inférence rapide | 6 Go | RTX 3080   |
| Streaming        | 4 Go | RTX 3060   |

## Déploiement rapide

**Image Docker :**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime
```

**Ports :**

```
22/tcp
8000/http
```

**Commande :**

```bash
pip install TTS && \
tts-server --model_name tts_models/multilingual/multi-dataset/xtts_v2
```

## Accéder à votre service

Après le déploiement, trouvez votre `http_pub` URL dans **Mes commandes**:

1. Aller à la **Mes commandes** page
2. Cliquez sur votre commande
3. Trouvez l' `http_pub` URL (par ex., `abc123.clorecloud.net`)

Utilisez `https://VOTRE_HTTP_PUB_URL` au lieu de `localhost` dans les exemples ci-dessous.

## Installation

```bash
pip install TTS
```

## Utilisation de base

### TTS simple

```python
from TTS.api import TTS

# Charger XTTS v2
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

# Générer la parole
tts.tts_to_file(
    text="Hello, this is a test of the XTTS text to speech system.",
    file_path="output.wav",
    language="en"
)
```

### Clonage de voix

```python
from TTS.api import TTS

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

# Cloner la voix à partir d'un audio de référence (6+ secondes)
tts.tts_to_file(
    text="Ceci est ma voix clonée parlant un nouveau texte.",
    file_path="cloned_output.wav",
    speaker_wav="reference_voice.wav",
    language="en"
)
```

## Plusieurs langues

```python
from TTS.api import TTS

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

# English
tts.tts_to_file(
    text="Hello, how are you today?",
    file_path="english.wav",
    speaker_wav="voice.wav",
    language="en"
)

# Espagnol
tts.tts_to_file(
    text="Hola, ¿cómo estás hoy?",
    file_path="spanish.wav",
    speaker_wav="voice.wav",
    language="es"
)

# German
tts.tts_to_file(
    text="Hallo, wie geht es dir heute?",
    file_path="german.wav",
    speaker_wav="voice.wav",
    language="de"
)

# Russian
tts.tts_to_file(
    text="Привет, как дела?",
    file_path="russian.wav",
    speaker_wav="voice.wav",
    language="ru"
)
```

### Langues prises en charge

| Code  | Langue      |
| ----- | ----------- |
| en    | Anglais     |
| es    | Espagnol    |
| fr    | Français    |
| de    | Allemand    |
| it    | Italien     |
| pt    | Portugais   |
| pl    | Polonais    |
| tr    | Turc        |
| ru    | Russe       |
| nl    | Néerlandais |
| cs    | Tchèque     |
| ar    | Arabe       |
| zh-cn | Chinois     |
| ja    | Japonais    |
| hu    | Hongrois    |
| ko    | Coréen      |
| hi    | Hindi       |

## TTS en streaming

```python
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
import torch
import sounddevice as sd

# Charger le modèle
config = XttsConfig()
config.load_json("path/to/config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="path/to/model")
model.cuda()

# Obtenir l'embedding du locuteur
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path="reference.wav"
)

# Génération en streaming
chunks = model.inference_stream(
    text="This is a streaming test of the XTTS system.",
    language="en",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
    stream_chunk_size=20
)

# Lecture en temps réel
for chunk in chunks:
    audio = chunk.cpu().numpy()
    sd.play(audio, samplerate=24000)
    sd.wait()
```

## Interface Gradio

```python
import gradio as gr
from TTS.api import TTS
import tempfile

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

def generate_speech(text, reference_audio, language):
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
        if reference_audio:
            tts.tts_to_file(
                text=text,
                file_path=f.name,
                speaker_wav=reference_audio,
                language=language
            )
        else:
            tts.tts_to_file(
                text=text,
                file_path=f.name,
                language=language
            )
        return f.name

demo = gr.Interface(
    fn=generate_speech,
    inputs=[
        gr.Textbox(label="Text to speak", lines=5),
        gr.Audio(type="filepath", label="Reference Voice (optional)"),
        gr.Dropdown(
            ["en", "es", "fr", "de", "it", "pt", "ru", "zh-cn", "ja"],
            value="en",
            label="Language"
        )
    ],
    outputs=gr.Audio(label="Discours généré"),
    title="XTTS Voice Cloning"
)

demo.launch(server_name="0.0.0.0", server_port=7860)
```

## Serveur API

```python
from fastapi import FastAPI, UploadFile, File, Form
from fastapi.responses import FileResponse
from TTS.api import TTS
import tempfile
import os

app = FastAPI()
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

@app.post("/synthesize")
async def synthesize(
    text: str = Form(...),
    language: str = Form(default="en"),
    speaker: UploadFile = File(default=None)
):
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as out_file:
        if speaker:
            # Enregistrer la référence téléchargée
            with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as ref_file:
                ref_file.write(await speaker.read())
                ref_path = ref_file.name

            tts.tts_to_file(
                text=text,
                file_path=out_file.name,
                speaker_wav=ref_path,
                language=language
            )
            os.unlink(ref_path)
        else:
            tts.tts_to_file(
                text=text,
                file_path=out_file.name,
                language=language
            )

        return FileResponse(out_file.name, media_type="audio/wav")

# Lancer : uvicorn server:app --host 0.0.0.0 --port 8000
```

## Traitement par lots

```python
from TTS.api import TTS
import os

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

texts = [
    "Bienvenue sur notre plateforme.",
    "Veuillez consulter les informations suivantes.",
    "Merci pour votre attention.",
    "Bonne journée !"
]

reference_voice = "speaker.wav"
output_dir = "./audio_files"
os.makedirs(output_dir, exist_ok=True)

for i, text in enumerate(texts):
    output_path = f"{output_dir}/audio_{i:03d}.wav"

    tts.tts_to_file(
        text=text,
        file_path=output_path,
        speaker_wav=reference_voice,
        language="en"
    )

    print(f"Generated: {output_path}")
```

## Affinage de la voix

Pour un meilleur clonage de voix :

```python
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

config = XttsConfig()
model = Xtts.init_from_config(config)

# Utiliser plusieurs échantillons de référence pour une meilleure qualité
reference_files = [
    "sample1.wav",
    "sample2.wav",
    "sample3.wav"
]

# Extraire l'embedding du locuteur à partir de plusieurs échantillons
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path=reference_files
)

# Générer avec embedding moyenné
output = model.inference(
    text="High quality cloned speech.",
    language="en",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding
)
```

## Prétraitement audio

```python
import librosa
import soundfile as sf

def prepare_reference(input_path, output_path, target_sr=22050):
    # Charger et rééchantillonner
    audio, sr = librosa.load(input_path, sr=target_sr)

    # Couper le silence
    audio, _ = librosa.effects.trim(audio, top_db=20)

    # Normaliser
    audio = librosa.util.normalize(audio)

    # Enregistrer
    sf.write(output_path, audio, target_sr)

# Préparer l'audio de référence
prepare_reference("raw_voice.wav", "clean_voice.wav")
```

## Performances

| Mode      | GPU      | Vitesse              |
| --------- | -------- | -------------------- |
| Standard  | RTX 3060 | \~0.5x en temps réel |
| Standard  | RTX 4090 | \~2x en temps réel   |
| Streaming | RTX 3060 | \~1x en temps réel   |
| Streaming | RTX 4090 | \~3x en temps réel   |

## Conseils de qualité

* Utiliser 6 à 15 secondes d'audio de référence propre
* Évitez le bruit de fond dans la référence
* Faites correspondre la langue du texte et de la référence
* Utiliser plusieurs échantillons de référence pour de meilleurs résultats

## Dépannage

### Mauvaise qualité de la voix

* Audio de référence propre
* Référence plus longue (10+ secondes)
* Adapter le style de parole

### Mauvaise prononciation de la langue

* Assurer le code langue correct
* Utiliser une référence d'un locuteur natif

### Génération lente

* Activer l'inférence GPU
* Utiliser le mode streaming
* Réduire la longueur du texte par appel

## Estimation des coûts

Tarifs typiques du marché CLORE.AI (à partir de 2024) :

| GPU       | Tarif horaire | Tarif journalier | Session de 4 heures |
| --------- | ------------- | ---------------- | ------------------- |
| RTX 3060  | \~$0.03       | \~$0.70          | \~$0.12             |
| RTX 3090  | \~$0.06       | \~$1.50          | \~$0.25             |
| RTX 4090  | \~$0.10       | \~$2.30          | \~$0.40             |
| A100 40GB | \~$0.17       | \~$4.00          | \~$0.70             |
| A100 80GB | \~$0.25       | \~$6.00          | \~$1.00             |

*Les prix varient selon le fournisseur et la demande. Vérifiez* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *pour les tarifs actuels.*

**Économisez de l'argent :**

* Utilisez **Spot** market pour les charges de travail flexibles (souvent 30-50 % moins cher)
* Payer avec **CLORE** jetons
* Comparer les prix entre différents fournisseurs

## Prochaines étapes

* [Bark TTS](/guides/guides_v2-fr/audio-et-voix/bark-tts.md) - TTS expressif
* [SadTalker](/guides/guides_v2-fr/tetes-parlantes/sadtalker.md) - Têtes parlantes
* [RVC Voice Clone](/guides/guides_v2-fr/audio-et-voix/rvc-voice-clone.md) - Conversion de voix


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-fr/audio-et-voix/xtts-coqui.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.