# OpenVoice

Clona cualquier voz con solo segundos de audio usando OpenVoice.

{% hint style="success" %}
Todos los ejemplos se pueden ejecutar en servidores GPU alquilados a través de [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Alquilar en CLORE.AI

1. Visita [CLORE.AI Marketplace](https://clore.ai/marketplace)
2. Filtrar por tipo de GPU, VRAM y precio
3. Elegir **Bajo demanda** (tarifa fija) o **Spot** (precio de puja)
4. Configura tu pedido:
   * Selecciona imagen Docker
   * Establece puertos (TCP para SSH, HTTP para interfaces web)
   * Agrega variables de entorno si es necesario
   * Introduce el comando de inicio
5. Selecciona pago: **CLORE**, **BTC**, o **USDT/USDC**
6. Crea el pedido y espera el despliegue

### Accede a tu servidor

* Encuentra los detalles de conexión en **Mis Pedidos**
* Interfaces web: Usa la URL del puerto HTTP
* SSH: `ssh -p <port> root@<proxy-address>`

## ¿Qué es OpenVoice?

OpenVoice de MyShell puede:

* Clonar voces a partir de \~10 segundos de audio
* Controlar emoción, acento, ritmo
* Clonación de voz cross-lingual
* Conversión de voz zero-shot

## Requisitos

| Tarea                   | VRAM mínima | Recomendado |
| ----------------------- | ----------- | ----------- |
| Inferencia              | 4GB         | RTX 3060    |
| Procesamiento por lotes | 6GB         | RTX 3070    |

## Despliegue rápido

**Imagen Docker:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime
```

**Puertos:**

```
22/tcp
7860/http
```

**Comando:**

```bash
pip install git+https://github.com/myshell-ai/OpenVoice.git gradio && \
python -c "
import gradio as gr
from openvoice import se_extractor
from openvoice.api import ToneColorConverter
import torch

ckpt_converter = 'checkpoints_v2/converter'
device = 'cuda'
tone_color_converter = ToneColorConverter(f'{ckpt_converter}/config.json', device=device)
tone_color_converter.load_ckpt(f'{ckpt_converter}/checkpoint.pth')

def clone(source_audio, reference_audio):
    source_se, _ = se_extractor.get_se(source_audio, tone_color_converter, vad=False)
    target_se, _ = se_extractor.get_se(reference_audio, tone_color_converter, vad=False)

    output_path = 'output.wav'
    tone_color_converter.convert(
        audio_src_path=source_audio,
        src_se=source_se,
        tgt_se=target_se,
        output_path=output_path
    )
    return output_path

demo = gr.Interface(
    fn=clone,
    inputs=[gr.Audio(type='filepath', label='Source'), gr.Audio(type='filepath', label='Target Voice')],
    outputs=gr.Audio(label='Cloned'),
    title='OpenVoice Clone'
)
demo.launch(server_name='0.0.0.0', server_port=7860)
"
```

## Accediendo a tu servicio

Después del despliegue, encuentra tu `http_pub` URL en **Mis Pedidos**:

1. Ir a **Mis Pedidos** página
2. Haz clic en tu pedido
3. Encuentra la `http_pub` URL (por ejemplo, `abc123.clorecloud.net`)

Usa `https://TU_HTTP_PUB_URL` en lugar de `localhost` en los ejemplos abajo.

## Instalación

```bash
git clone https://github.com/myshell-ai/OpenVoice.git
cd OpenVoice
pip install -e .

# Descargar puntos de control
python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='myshell-ai/OpenVoice', local_dir='checkpoints')"
```

## Clonación básica de voz

```python
from openvoice import se_extractor
from openvoice.api import ToneColorConverter
import torch

# Inicializar
device = "cuda" if torch.cuda.is_available() else "cpu"
ckpt_converter = 'checkpoints_v2/converter'

tone_color_converter = ToneColorConverter(
    f'{ckpt_converter}/config.json',
    device=device
)
tone_color_converter.load_ckpt(f'{ckpt_converter}/checkpoint.pth')

# Extraer embeddings del hablante
source_se, _ = se_extractor.get_se("source_audio.wav", tone_color_converter, vad=False)
target_se, _ = se_extractor.get_se("target_voice.wav", tone_color_converter, vad=False)

# Convertir voz
tone_color_converter.convert(
    audio_src_path="source_audio.wav",
    src_se=source_se,
    tgt_se=target_se,
    output_path="output.wav"
)
```

## Con Texto a Voz

Generar habla en cualquier voz:

```python
from openvoice import se_extractor
from openvoice.api import ToneColorConverter, BaseSpeakerTTS
from melo.api import TTS

# Inicializar TTS
tts = TTS(language='EN', device=device)
speaker_ids = tts.hps.data.spk2id

# Generar habla base
tts.tts_to_file("Hello, this is a test.", speaker_ids['EN-US'], "base.wav")

# Clonar a la voz objetivo
source_se, _ = se_extractor.get_se("base.wav", tone_color_converter, vad=False)
target_se, _ = se_extractor.get_se("target_voice.wav", tone_color_converter, vad=False)

tone_color_converter.convert(
    audio_src_path="base.wav",
    src_se=source_se,
    tgt_se=target_se,
    output_path="cloned_speech.wav"
)
```

## Soporte multilingüe

```python
from melo.api import TTS

# Idiomas disponibles
languages = ['EN', 'ES', 'FR', 'ZH', 'JP', 'KR']

# Inglés
tts_en = TTS(language='EN', device=device)
tts_en.tts_to_file("Hello world", tts_en.hps.data.spk2id['EN-US'], "en.wav")

# Chino
tts_zh = TTS(language='ZH', device=device)
tts_zh.tts_to_file("你好世界", tts_zh.hps.data.spk2id['ZH'], "zh.wav")

# Japonés
tts_jp = TTS(language='JP', device=device)
tts_jp.tts_to_file("こんにちは", tts_jp.hps.data.spk2id['JP'], "jp.wav")
```

## Control de emoción

OpenVoice V2 admite control de emoción/estilo:

```python
from openvoice.api import BaseSpeakerTTS

# TTS base con estilos
base_speaker_tts = BaseSpeakerTTS(
    f'{ckpt_base}/config.json',
    device=device
)
base_speaker_tts.load_ckpt(f'{ckpt_base}/checkpoint.pth')

# Estilos disponibles
styles = ['default', 'whispering', 'cheerful', 'terrified', 'angry', 'sad', 'friendly']

for style in styles:
    base_speaker_tts.tts(
        "This is a test sentence.",
        f"output_{style}.wav",
        speaker='default',
        language='English',
        style=style
    )
```

## Procesamiento por lotes

```python
import os
from openvoice import se_extractor
from openvoice.api import ToneColorConverter

tone_color_converter = ToneColorConverter(
    f'{ckpt_converter}/config.json',
    device='cuda'
)
tone_color_converter.load_ckpt(f'{ckpt_converter}/checkpoint.pth')

# Obtener embedding de la voz objetivo una vez
target_se, _ = se_extractor.get_se("target_voice.wav", tone_color_converter, vad=False)

input_dir = "./audio_files"
output_dir = "./cloned"
os.makedirs(output_dir, exist_ok=True)

for filename in os.listdir(input_dir):
    if filename.endswith(('.wav', '.mp3')):
        input_path = os.path.join(input_dir, filename)
        output_path = os.path.join(output_dir, f"cloned_{filename}")

        source_se, _ = se_extractor.get_se(input_path, tone_color_converter, vad=False)

        tone_color_converter.convert(
            audio_src_path=input_path,
            src_se=source_se,
            tgt_se=target_se,
            output_path=output_path
        )
        print(f"Cloned: {filename}")
```

## Servidor API

```python
from fastapi import FastAPI, UploadFile
from fastapi.responses import FileResponse
from openvoice import se_extractor
from openvoice.api import ToneColorConverter
import tempfile
import shutil

app = FastAPI()

tone_color_converter = ToneColorConverter(
    'checkpoints_v2/converter/config.json',
    device='cuda'
)
tone_color_converter.load_ckpt('checkpoints_v2/converter/checkpoint.pth')

@app.post("/clone")
async def clone_voice(source: UploadFile, target: UploadFile):
    # Guardar archivos subidos
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as src_tmp:
        shutil.copyfileobj(source.file, src_tmp)
        src_path = src_tmp.name

    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tgt_tmp:
        shutil.copyfileobj(target.file, tgt_tmp)
        tgt_path = tgt_tmp.name

    # Extraer embeddings
    source_se, _ = se_extractor.get_se(src_path, tone_color_converter, vad=False)
    target_se, _ = se_extractor.get_se(tgt_path, tone_color_converter, vad=False)

    # Convertir
    output_path = tempfile.mktemp(suffix=".wav")
    tone_color_converter.convert(
        audio_src_path=src_path,
        src_se=source_se,
        tgt_se=target_se,
        output_path=output_path
    )

    return FileResponse(output_path, media_type="audio/wav")

# Ejecutar: uvicorn server:app --host 0.0.0.0 --port 8000
```

## Consejos de calidad

### Para mejores resultados

* Use 10-30 segundos de audio de referencia claro
* Evite el ruido de fondo
* Solo un hablante en la referencia
* Aproximar el ritmo de habla

### Preprocesamiento de audio

```python
import librosa
import soundfile as sf

def preprocess_audio(input_path, output_path, target_sr=22050):
    audio, sr = librosa.load(input_path, sr=target_sr)

    # Recortar silencio
    audio, _ = librosa.effects.trim(audio, top_db=20)

    # Normalizar
    audio = librosa.util.normalize(audio)

    sf.write(output_path, audio, target_sr)
    return output_path

preprocess_audio("raw_reference.wav", "clean_reference.wav")
```

## Comparación con otras herramientas

| Función             | OpenVoice    | RVC       | Bark  |
| ------------------- | ------------ | --------- | ----- |
| Audio de referencia | 10-30s       | 10+ min   | N/A   |
| Entrenamiento       | No requerido | Requerido | N/A   |
| Velocidad           | Rápido       | Medio     | Lento |
| Calidad             | Genial       | Mejor     | Bueno |
| Cross-lingual       | Sí           | Limitado  | Sí    |

## Rendimiento

| Tarea                    | GPU      | Tiempo |
| ------------------------ | -------- | ------ |
| Extraer embedding        | RTX 3090 | \~1s   |
| Convertir 10s de audio   | RTX 3090 | \~2s   |
| Convertir 1 min de audio | RTX 3090 | \~8s   |

## Solución de problemas

### Pobre coincidencia de voz

* Usar audio de referencia más largo
* Asegurar calidad de audio clara
* Comprobar ruido de fondo

### Artefactos de audio

* Reducir ajustes de velocidad/enfasis
* Usar formato de audio consistente
* Comprobar coincidencia de la tasa de muestreo

### Memoria insuficiente

* Procesar clips más cortos
* Reducir el tamaño del lote
* Limpiar caché de CUDA

## Estimación de costos

Tarifas típicas del marketplace de CLORE.AI (a fecha de 2024):

| GPU       | Tarifa por hora | Tarifa diaria | Sesión de 4 horas |
| --------- | --------------- | ------------- | ----------------- |
| RTX 3060  | \~$0.03         | \~$0.70       | \~$0.12           |
| RTX 3090  | \~$0.06         | \~$1.50       | \~$0.25           |
| RTX 4090  | \~$0.10         | \~$2.30       | \~$0.40           |
| A100 40GB | \~$0.17         | \~$4.00       | \~$0.70           |
| A100 80GB | \~$0.25         | \~$6.00       | \~$1.00           |

*Los precios varían según el proveedor y la demanda. Consulta* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *para las tarifas actuales.*

**Ahorra dinero:**

* Usa **Spot** market para cargas de trabajo flexibles (a menudo 30-50% más barato)
* Paga con **CLORE** tokens
* Compara precios entre diferentes proveedores

## Próximos pasos

* [Bark TTS](/guides/guides_v2-es/audio-y-voz/bark-tts.md) - Texto a voz
* [Clon de voz RVC](/guides/guides_v2-es/audio-y-voz/rvc-voice-clone.md) - Clonación basada en entrenamiento
* [Whisper Transcription](/guides/guides_v2-es/audio-y-voz/whisper-transcription.md) - Voz a texto


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-es/audio-y-voz/openvoice-clone.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.