# OpenVoice

Clone any voice with just seconds of audio using OpenVoice.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Renting on CLORE.AI

1. Visit [CLORE.AI Marketplace](https://clore.ai/marketplace)
2. Filter by GPU type, VRAM, and price
3. Choose **On-Demand** (fixed rate) or **Spot** (bid price)
4. Configure your order:
   * Select Docker image
   * Set ports (TCP for SSH, HTTP for web UIs)
   * Add environment variables if needed
   * Enter startup command
5. Select payment: **CLORE**, **BTC**, or **USDT/USDC**
6. Create order and wait for deployment

### Access Your Server

* Find connection details in **My Orders**
* Web interfaces: Use the HTTP port URL
* SSH: `ssh -p <port> root@<proxy-address>`

## What is OpenVoice?

OpenVoice by MyShell can:

* Clone voices from \~10 seconds of audio
* Control emotion, accent, rhythm
* Cross-lingual voice cloning
* Zero-shot voice conversion

## Requirements

| Task             | Min VRAM | Recommended |
| ---------------- | -------- | ----------- |
| Inference        | 4GB      | RTX 3060    |
| Batch processing | 6GB      | RTX 3070    |

## Quick Deploy

**Docker Image:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime
```

**Ports:**

```
22/tcp
7860/http
```

**Command:**

```bash
pip install git+https://github.com/myshell-ai/OpenVoice.git gradio && \
python -c "
import gradio as gr
from openvoice import se_extractor
from openvoice.api import ToneColorConverter
import torch

ckpt_converter = 'checkpoints_v2/converter'
device = 'cuda'
tone_color_converter = ToneColorConverter(f'{ckpt_converter}/config.json', device=device)
tone_color_converter.load_ckpt(f'{ckpt_converter}/checkpoint.pth')

def clone(source_audio, reference_audio):
    source_se, _ = se_extractor.get_se(source_audio, tone_color_converter, vad=False)
    target_se, _ = se_extractor.get_se(reference_audio, tone_color_converter, vad=False)

    output_path = 'output.wav'
    tone_color_converter.convert(
        audio_src_path=source_audio,
        src_se=source_se,
        tgt_se=target_se,
        output_path=output_path
    )
    return output_path

demo = gr.Interface(
    fn=clone,
    inputs=[gr.Audio(type='filepath', label='Source'), gr.Audio(type='filepath', label='Target Voice')],
    outputs=gr.Audio(label='Cloned'),
    title='OpenVoice Clone'
)
demo.launch(server_name='0.0.0.0', server_port=7860)
"
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

## Installation

```bash
git clone https://github.com/myshell-ai/OpenVoice.git
cd OpenVoice
pip install -e .

# Download checkpoints
python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='myshell-ai/OpenVoice', local_dir='checkpoints')"
```

## Basic Voice Cloning

```python
from openvoice import se_extractor
from openvoice.api import ToneColorConverter
import torch

# Initialize
device = "cuda" if torch.cuda.is_available() else "cpu"
ckpt_converter = 'checkpoints_v2/converter'

tone_color_converter = ToneColorConverter(
    f'{ckpt_converter}/config.json',
    device=device
)
tone_color_converter.load_ckpt(f'{ckpt_converter}/checkpoint.pth')

# Extract speaker embeddings
source_se, _ = se_extractor.get_se("source_audio.wav", tone_color_converter, vad=False)
target_se, _ = se_extractor.get_se("target_voice.wav", tone_color_converter, vad=False)

# Convert voice
tone_color_converter.convert(
    audio_src_path="source_audio.wav",
    src_se=source_se,
    tgt_se=target_se,
    output_path="output.wav"
)
```

## With Text-to-Speech

Generate speech in any voice:

```python
from openvoice import se_extractor
from openvoice.api import ToneColorConverter, BaseSpeakerTTS
from melo.api import TTS

# Initialize TTS
tts = TTS(language='EN', device=device)
speaker_ids = tts.hps.data.spk2id

# Generate base speech
tts.tts_to_file("Hello, this is a test.", speaker_ids['EN-US'], "base.wav")

# Clone to target voice
source_se, _ = se_extractor.get_se("base.wav", tone_color_converter, vad=False)
target_se, _ = se_extractor.get_se("target_voice.wav", tone_color_converter, vad=False)

tone_color_converter.convert(
    audio_src_path="base.wav",
    src_se=source_se,
    tgt_se=target_se,
    output_path="cloned_speech.wav"
)
```

## Multi-Language Support

```python
from melo.api import TTS

# Available languages
languages = ['EN', 'ES', 'FR', 'ZH', 'JP', 'KR']

# English
tts_en = TTS(language='EN', device=device)
tts_en.tts_to_file("Hello world", tts_en.hps.data.spk2id['EN-US'], "en.wav")

# Chinese
tts_zh = TTS(language='ZH', device=device)
tts_zh.tts_to_file("你好世界", tts_zh.hps.data.spk2id['ZH'], "zh.wav")

# Japanese
tts_jp = TTS(language='JP', device=device)
tts_jp.tts_to_file("こんにちは", tts_jp.hps.data.spk2id['JP'], "jp.wav")
```

## Emotion Control

OpenVoice V2 supports emotion/style control:

```python
from openvoice.api import BaseSpeakerTTS

# Base TTS with styles
base_speaker_tts = BaseSpeakerTTS(
    f'{ckpt_base}/config.json',
    device=device
)
base_speaker_tts.load_ckpt(f'{ckpt_base}/checkpoint.pth')

# Available styles
styles = ['default', 'whispering', 'cheerful', 'terrified', 'angry', 'sad', 'friendly']

for style in styles:
    base_speaker_tts.tts(
        "This is a test sentence.",
        f"output_{style}.wav",
        speaker='default',
        language='English',
        style=style
    )
```

## Batch Processing

```python
import os
from openvoice import se_extractor
from openvoice.api import ToneColorConverter

tone_color_converter = ToneColorConverter(
    f'{ckpt_converter}/config.json',
    device='cuda'
)
tone_color_converter.load_ckpt(f'{ckpt_converter}/checkpoint.pth')

# Get target voice embedding once
target_se, _ = se_extractor.get_se("target_voice.wav", tone_color_converter, vad=False)

input_dir = "./audio_files"
output_dir = "./cloned"
os.makedirs(output_dir, exist_ok=True)

for filename in os.listdir(input_dir):
    if filename.endswith(('.wav', '.mp3')):
        input_path = os.path.join(input_dir, filename)
        output_path = os.path.join(output_dir, f"cloned_{filename}")

        source_se, _ = se_extractor.get_se(input_path, tone_color_converter, vad=False)

        tone_color_converter.convert(
            audio_src_path=input_path,
            src_se=source_se,
            tgt_se=target_se,
            output_path=output_path
        )
        print(f"Cloned: {filename}")
```

## API Server

```python
from fastapi import FastAPI, UploadFile
from fastapi.responses import FileResponse
from openvoice import se_extractor
from openvoice.api import ToneColorConverter
import tempfile
import shutil

app = FastAPI()

tone_color_converter = ToneColorConverter(
    'checkpoints_v2/converter/config.json',
    device='cuda'
)
tone_color_converter.load_ckpt('checkpoints_v2/converter/checkpoint.pth')

@app.post("/clone")
async def clone_voice(source: UploadFile, target: UploadFile):
    # Save uploaded files
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as src_tmp:
        shutil.copyfileobj(source.file, src_tmp)
        src_path = src_tmp.name

    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tgt_tmp:
        shutil.copyfileobj(target.file, tgt_tmp)
        tgt_path = tgt_tmp.name

    # Extract embeddings
    source_se, _ = se_extractor.get_se(src_path, tone_color_converter, vad=False)
    target_se, _ = se_extractor.get_se(tgt_path, tone_color_converter, vad=False)

    # Convert
    output_path = tempfile.mktemp(suffix=".wav")
    tone_color_converter.convert(
        audio_src_path=src_path,
        src_se=source_se,
        tgt_se=target_se,
        output_path=output_path
    )

    return FileResponse(output_path, media_type="audio/wav")

# Run: uvicorn server:app --host 0.0.0.0 --port 8000
```

## Quality Tips

### For Best Results

* Use 10-30 seconds of clear reference audio
* Avoid background noise
* Single speaker only in reference
* Match speaking pace approximately

### Audio Preprocessing

```python
import librosa
import soundfile as sf

def preprocess_audio(input_path, output_path, target_sr=22050):
    audio, sr = librosa.load(input_path, sr=target_sr)

    # Trim silence
    audio, _ = librosa.effects.trim(audio, top_db=20)

    # Normalize
    audio = librosa.util.normalize(audio)

    sf.write(output_path, audio, target_sr)
    return output_path

preprocess_audio("raw_reference.wav", "clean_reference.wav")
```

## Comparison with Other Tools

| Feature         | OpenVoice  | RVC      | Bark |
| --------------- | ---------- | -------- | ---- |
| Reference Audio | 10-30s     | 10+ min  | N/A  |
| Training        | Not needed | Required | N/A  |
| Speed           | Fast       | Medium   | Slow |
| Quality         | Great      | Best     | Good |
| Cross-lingual   | Yes        | Limited  | Yes  |

## Performance

| Task                | GPU      | Time |
| ------------------- | -------- | ---- |
| Extract embedding   | RTX 3090 | \~1s |
| Convert 10s audio   | RTX 3090 | \~2s |
| Convert 1 min audio | RTX 3090 | \~8s |

## Troubleshooting

### Poor Voice Match

* Use longer reference audio
* Ensure clear audio quality
* Check for background noise

### Audio Artifacts

* Reduce speed/emphasis settings
* Use consistent audio format
* Check sampling rate match

### Out of Memory

* Process shorter clips
* Reduce batch size
* Clear CUDA cache

## Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

| GPU       | Hourly Rate | Daily Rate | 4-Hour Session |
| --------- | ----------- | ---------- | -------------- |
| RTX 3060  | \~$0.03     | \~$0.70    | \~$0.12        |
| RTX 3090  | \~$0.06     | \~$1.50    | \~$0.25        |
| RTX 4090  | \~$0.10     | \~$2.30    | \~$0.40        |
| A100 40GB | \~$0.17     | \~$4.00    | \~$0.70        |
| A100 80GB | \~$0.25     | \~$6.00    | \~$1.00        |

*Prices vary by provider and demand. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

**Save money:**

* Use **Spot** market for flexible workloads (often 30-50% cheaper)
* Pay with **CLORE** tokens
* Compare prices across different providers

## Next Steps

* [Bark TTS](/guides/audio-and-voice/bark-tts.md) - Text-to-speech
* [RVC Voice Clone](/guides/audio-and-voice/rvc-voice-clone.md) - Training-based cloning
* [Whisper Transcription](/guides/audio-and-voice/whisper-transcription.md) - Speech-to-text


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/audio-and-voice/openvoice-clone.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
