# RVC Voice Clone

Clone and convert voices using Retrieval-based Voice Conversion.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Renting on CLORE.AI

1. Visit [CLORE.AI Marketplace](https://clore.ai/marketplace)
2. Filter by GPU type, VRAM, and price
3. Choose **On-Demand** (fixed rate) or **Spot** (bid price)
4. Configure your order:
   * Select Docker image
   * Set ports (TCP for SSH, HTTP for web UIs)
   * Add environment variables if needed
   * Enter startup command
5. Select payment: **CLORE**, **BTC**, or **USDT/USDC**
6. Create order and wait for deployment

### Access Your Server

* Find connection details in **My Orders**
* Web interfaces: Use the HTTP port URL
* SSH: `ssh -p <port> root@<proxy-address>`

## What is RVC?

RVC (Retrieval-based Voice Conversion) can:

* Clone any voice with minimal training
* Convert singing/speaking voices
* Real-time voice conversion
* High-quality output

## Requirements

| Task      | Min VRAM | Recommended |
| --------- | -------- | ----------- |
| Inference | 4GB      | RTX 3060    |
| Training  | 8GB      | RTX 3090    |
| Real-time | 6GB      | RTX 3070    |

## Quick Deploy

**Docker Image:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
```

**Ports:**

```
22/tcp
7865/http
```

**Command:**

```bash
apt-get update && apt-get install -y ffmpeg git && \
cd /workspace && \
git clone https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI.git && \
cd Retrieval-based-Voice-Conversion-WebUI && \
pip install -r requirements.txt && \
python infer-web.py --host 0.0.0.0 --port 7865
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

## Installation

```bash

# Clone repository
git clone https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI.git
cd Retrieval-based-Voice-Conversion-WebUI

# Install dependencies
pip install -r requirements.txt

# Download models
python tools/download_models.py
```

## Voice Conversion (Inference)

### Using Web UI

1. Open `http://<proxy>:7865`
2. Go to "Model Inference" tab
3. Upload audio file
4. Select voice model
5. Adjust settings
6. Click "Convert"

### Python API

```python
from infer_pack.models import SynthesizerTrnMs256NSFsid, SynthesizerTrnMs768NSFsid
from vc_infer_pipeline import VC
import torch
import soundfile as sf

# Load model
model_path = "./models/my_voice.pth"
index_path = "./models/my_voice.index"

vc = VC(
    model_path=model_path,
    config_path="./configs/v2/48k.json",
    device="cuda"
)

# Convert audio
audio, sr = sf.read("input.wav")
output = vc.convert(
    audio=audio,
    f0_method="rmvpe",  # Pitch extraction method
    index_path=index_path,
    index_rate=0.75,
    f0_up_key=0,  # Pitch shift (semitones)
    protect=0.33
)

sf.write("output.wav", output, sr)
```

## Training Custom Voice

### Prepare Dataset

1. Collect 10-30 minutes of clean audio
2. Cut into 5-15 second clips
3. Remove background noise/music

```bash

# Split audio into clips
ffmpeg -i full_audio.mp3 -f segment -segment_time 10 -c copy clips/clip_%03d.mp3
```

### Train via Web UI

1. Go to "Train" tab
2. Enter experiment name
3. Set training folder path
4. Click "Process data"
5. Click "Feature extraction"
6. Click "Train"

### Train via Command Line

```bash

# Step 1: Process audio
python trainset_preprocess_pipeline_print.py \
    "./dataset" \
    48000 \
    8 \
    "./logs/experiment" \
    False

# Step 2: Extract features
python extract_f0_print.py \
    "./logs/experiment" \
    8 \
    "rmvpe"

python extract_feature_print.py \
    "cuda:0" \
    "1" \
    "0" \
    "0" \
    "./logs/experiment" \
    "v2"

# Step 3: Train
python train_nsf_sim_cache_sid_load_pretrain.py \
    -e "experiment" \
    -sr "48k" \
    -f0 1 \
    -bs 8 \
    -g 0 \
    -te 200 \
    -se 20 \
    -pg "./pretrained/f0G48k.pth" \
    -pd "./pretrained/f0D48k.pth" \
    -l 0 \
    -c 1 \
    -sw 0 \
    -v "v2"
```

## Training Parameters

| Parameter   | Description          | Recommended |
| ----------- | -------------------- | ----------- |
| Sample Rate | Audio quality        | 48000       |
| Batch Size  | Training batch       | 8-16        |
| Epochs      | Training iterations  | 200-500     |
| Save Every  | Checkpoint frequency | 20-50       |
| f0 Method   | Pitch extraction     | rmvpe       |

## F0 Methods

| Method  | Quality | Speed  | Best For |
| ------- | ------- | ------ | -------- |
| pm      | OK      | Fast   | Testing  |
| harvest | Good    | Slow   | General  |
| crepe   | Great   | Medium | Singing  |
| rmvpe   | Best    | Medium | All      |

## Real-Time Conversion

### Setup

```python
import pyaudio
import numpy as np
from infer_pack.models import SynthesizerTrnMs256NSFsid
from vc_infer_pipeline import VC

# Initialize
vc = VC(model_path="./models/voice.pth", device="cuda")

# Audio setup
CHUNK = 1024
FORMAT = pyaudio.paFloat32
CHANNELS = 1
RATE = 48000

p = pyaudio.PyAudio()
stream_in = p.open(format=FORMAT, channels=CHANNELS, rate=RATE,
                   input=True, frames_per_buffer=CHUNK)
stream_out = p.open(format=FORMAT, channels=CHANNELS, rate=RATE,
                    output=True, frames_per_buffer=CHUNK)

# Real-time loop
while True:
    audio_in = np.frombuffer(stream_in.read(CHUNK), dtype=np.float32)
    audio_out = vc.convert(audio_in)
    stream_out.write(audio_out.tobytes())
```

## Model Formats

### Convert to ONNX

```python
import torch

# Load PyTorch model
model = torch.load("model.pth")

# Export to ONNX
torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    input_names=["audio"],
    output_names=["converted"],
    dynamic_axes={"audio": {0: "length"}}
)
```

## Audio Preprocessing

### Remove Noise

```python
import noisereduce as nr
import soundfile as sf

audio, sr = sf.read("noisy.wav")
reduced_noise = nr.reduce_noise(y=audio, sr=sr)
sf.write("clean.wav", reduced_noise, sr)
```

### Normalize Volume

```python
from pydub import AudioSegment

audio = AudioSegment.from_wav("input.wav")
normalized = audio.normalize()
normalized.export("normalized.wav", format="wav")
```

### Remove Silence

```python
from pydub import AudioSegment
from pydub.silence import split_on_silence

audio = AudioSegment.from_wav("input.wav")
chunks = split_on_silence(audio, min_silence_len=500, silence_thresh=-40)
combined = sum(chunks)
combined.export("no_silence.wav", format="wav")
```

## Batch Processing

```python
import os
from vc_infer_pipeline import VC
import soundfile as sf

vc = VC(model_path="./models/voice.pth", device="cuda")

input_dir = "./inputs"
output_dir = "./outputs"
os.makedirs(output_dir, exist_ok=True)

for filename in os.listdir(input_dir):
    if filename.endswith(('.wav', '.mp3', '.flac')):
        input_path = os.path.join(input_dir, filename)
        output_path = os.path.join(output_dir, f"converted_{filename}")

        audio, sr = sf.read(input_path)
        converted = vc.convert(audio)
        sf.write(output_path, converted, sr)

        print(f"Converted: {filename}")
```

## Singing Voice Conversion

For songs, use appropriate settings:

```python
output = vc.convert(
    audio=audio,
    f0_method="rmvpe",  # Best for singing
    index_rate=0.5,     # Lower for singing
    f0_up_key=-2,       # Adjust pitch to match
    protect=0.5         # Protect consonants
)
```

## Common Issues

### Voice Sounds Robotic

* Use higher quality source audio
* Increase protect value (0.4-0.5)
* Try different f0 method

### Pitch Issues

* Adjust f0\_up\_key
* Use rmvpe f0 method
* Ensure consistent pitch in training data

### Audio Quality

* Use 48kHz sample rate
* Remove background noise from training data
* Train for more epochs

## API Server

```python
from fastapi import FastAPI, UploadFile
from fastapi.responses import FileResponse
from vc_infer_pipeline import VC
import soundfile as sf
import tempfile

app = FastAPI()
vc = VC(model_path="./models/voice.pth", device="cuda")

@app.post("/convert")
async def convert_voice(file: UploadFile, pitch: int = 0):
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp_in:
        content = await file.read()
        tmp_in.write(content)
        tmp_in_path = tmp_in.name

    audio, sr = sf.read(tmp_in_path)
    converted = vc.convert(audio, f0_up_key=pitch)

    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp_out:
        sf.write(tmp_out.name, converted, sr)
        return FileResponse(tmp_out.name, media_type="audio/wav")
```

## Training Tips

### For Better Quality

* Use 20+ minutes of clean audio
* Remove all background noise
* Consistent microphone/recording setup
* Include varied expressions/emotions

### For Faster Training

* Use 8-16 batch size
* Enable mixed precision
* Use NVMe SSD for dataset

## Performance

| Task                      | GPU      | Time          |
| ------------------------- | -------- | ------------- |
| Inference (1 min audio)   | RTX 3090 | \~5s          |
| Training (30 min dataset) | RTX 3090 | \~2 hours     |
| Real-time conversion      | RTX 3070 | <50ms latency |

## Troubleshooting

## Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

| GPU       | Hourly Rate | Daily Rate | 4-Hour Session |
| --------- | ----------- | ---------- | -------------- |
| RTX 3060  | \~$0.03     | \~$0.70    | \~$0.12        |
| RTX 3090  | \~$0.06     | \~$1.50    | \~$0.25        |
| RTX 4090  | \~$0.10     | \~$2.30    | \~$0.40        |
| A100 40GB | \~$0.17     | \~$4.00    | \~$0.70        |
| A100 80GB | \~$0.25     | \~$6.00    | \~$1.00        |

*Prices vary by provider and demand. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

**Save money:**

* Use **Spot** market for flexible workloads (often 30-50% cheaper)
* Pay with **CLORE** tokens
* Compare prices across different providers

## Next Steps

* [Bark TTS](/guides/audio-and-voice/bark-tts.md) - Text-to-speech
* [AudioCraft Music](/guides/audio-and-voice/audiocraft-music.md) - Music generation
* [Whisper Transcription](/guides/audio-and-voice/whisper-transcription.md) - Speech-to-text


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/audio-and-voice/rvc-voice-clone.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
