# RVC Voice Clone

Clone and convert voices using Retrieval-based Voice Conversion.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Renting on CLORE.AI

1. Visit [CLORE.AI Marketplace](https://clore.ai/marketplace)
2. Filter by GPU type, VRAM, and price
3. Choose **On-Demand** (fixed rate) or **Spot** (bid price)
4. Configure your order:
   * Select Docker image
   * Set ports (TCP for SSH, HTTP for web UIs)
   * Add environment variables if needed
   * Enter startup command
5. Select payment: **CLORE**, **BTC**, or **USDT/USDC**
6. Create order and wait for deployment

### Access Your Server

* Find connection details in **My Orders**
* Web interfaces: Use the HTTP port URL
* SSH: `ssh -p <port> root@<proxy-address>`

## What is RVC?

RVC (Retrieval-based Voice Conversion) can:

* Clone any voice with minimal training
* Convert singing/speaking voices
* Real-time voice conversion
* High-quality output

## Requirements

| Task      | Min VRAM | Recommended |
| --------- | -------- | ----------- |
| Inference | 4GB      | RTX 3060    |
| Training  | 8GB      | RTX 3090    |
| Real-time | 6GB      | RTX 3070    |

## Quick Deploy

**Docker Image:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
```

**Ports:**

```
22/tcp
7865/http
```

**Command:**

```bash
apt-get update && apt-get install -y ffmpeg git && \
cd /workspace && \
git clone https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI.git && \
cd Retrieval-based-Voice-Conversion-WebUI && \
pip install -r requirements.txt && \
python infer-web.py --host 0.0.0.0 --port 7865
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

## Installation

```bash

# Clone repository
git clone https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI.git
cd Retrieval-based-Voice-Conversion-WebUI

# Install dependencies
pip install -r requirements.txt

# Download models
python tools/download_models.py
```

## Voice Conversion (Inference)

### Using Web UI

1. Open `http://<proxy>:7865`
2. Go to "Model Inference" tab
3. Upload audio file
4. Select voice model
5. Adjust settings
6. Click "Convert"

### Python API

```python
from infer_pack.models import SynthesizerTrnMs256NSFsid, SynthesizerTrnMs768NSFsid
from vc_infer_pipeline import VC
import torch
import soundfile as sf

# Load model
model_path = "./models/my_voice.pth"
index_path = "./models/my_voice.index"

vc = VC(
    model_path=model_path,
    config_path="./configs/v2/48k.json",
    device="cuda"
)

# Convert audio
audio, sr = sf.read("input.wav")
output = vc.convert(
    audio=audio,
    f0_method="rmvpe",  # Pitch extraction method
    index_path=index_path,
    index_rate=0.75,
    f0_up_key=0,  # Pitch shift (semitones)
    protect=0.33
)

sf.write("output.wav", output, sr)
```

## Training Custom Voice

### Prepare Dataset

1. Collect 10-30 minutes of clean audio
2. Cut into 5-15 second clips
3. Remove background noise/music

```bash

# Split audio into clips
ffmpeg -i full_audio.mp3 -f segment -segment_time 10 -c copy clips/clip_%03d.mp3
```

### Train via Web UI

1. Go to "Train" tab
2. Enter experiment name
3. Set training folder path
4. Click "Process data"
5. Click "Feature extraction"
6. Click "Train"

### Train via Command Line

```bash

# Step 1: Process audio
python trainset_preprocess_pipeline_print.py \
    "./dataset" \
    48000 \
    8 \
    "./logs/experiment" \
    False

# Step 2: Extract features
python extract_f0_print.py \
    "./logs/experiment" \
    8 \
    "rmvpe"

python extract_feature_print.py \
    "cuda:0" \
    "1" \
    "0" \
    "0" \
    "./logs/experiment" \
    "v2"

# Step 3: Train
python train_nsf_sim_cache_sid_load_pretrain.py \
    -e "experiment" \
    -sr "48k" \
    -f0 1 \
    -bs 8 \
    -g 0 \
    -te 200 \
    -se 20 \
    -pg "./pretrained/f0G48k.pth" \
    -pd "./pretrained/f0D48k.pth" \
    -l 0 \
    -c 1 \
    -sw 0 \
    -v "v2"
```

## Training Parameters

| Parameter   | Description          | Recommended |
| ----------- | -------------------- | ----------- |
| Sample Rate | Audio quality        | 48000       |
| Batch Size  | Training batch       | 8-16        |
| Epochs      | Training iterations  | 200-500     |
| Save Every  | Checkpoint frequency | 20-50       |
| f0 Method   | Pitch extraction     | rmvpe       |

## F0 Methods

| Method  | Quality | Speed  | Best For |
| ------- | ------- | ------ | -------- |
| pm      | OK      | Fast   | Testing  |
| harvest | Good    | Slow   | General  |
| crepe   | Great   | Medium | Singing  |
| rmvpe   | Best    | Medium | All      |

## Real-Time Conversion

### Setup

```python
import pyaudio
import numpy as np
from infer_pack.models import SynthesizerTrnMs256NSFsid
from vc_infer_pipeline import VC

# Initialize
vc = VC(model_path="./models/voice.pth", device="cuda")

# Audio setup
CHUNK = 1024
FORMAT = pyaudio.paFloat32
CHANNELS = 1
RATE = 48000

p = pyaudio.PyAudio()
stream_in = p.open(format=FORMAT, channels=CHANNELS, rate=RATE,
                   input=True, frames_per_buffer=CHUNK)
stream_out = p.open(format=FORMAT, channels=CHANNELS, rate=RATE,
                    output=True, frames_per_buffer=CHUNK)

# Real-time loop
while True:
    audio_in = np.frombuffer(stream_in.read(CHUNK), dtype=np.float32)
    audio_out = vc.convert(audio_in)
    stream_out.write(audio_out.tobytes())
```

## Model Formats

### Convert to ONNX

```python
import torch

# Load PyTorch model
model = torch.load("model.pth")

# Export to ONNX
torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    input_names=["audio"],
    output_names=["converted"],
    dynamic_axes={"audio": {0: "length"}}
)
```

## Audio Preprocessing

### Remove Noise

```python
import noisereduce as nr
import soundfile as sf

audio, sr = sf.read("noisy.wav")
reduced_noise = nr.reduce_noise(y=audio, sr=sr)
sf.write("clean.wav", reduced_noise, sr)
```

### Normalize Volume

```python
from pydub import AudioSegment

audio = AudioSegment.from_wav("input.wav")
normalized = audio.normalize()
normalized.export("normalized.wav", format="wav")
```

### Remove Silence

```python
from pydub import AudioSegment
from pydub.silence import split_on_silence

audio = AudioSegment.from_wav("input.wav")
chunks = split_on_silence(audio, min_silence_len=500, silence_thresh=-40)
combined = sum(chunks)
combined.export("no_silence.wav", format="wav")
```

## Batch Processing

```python
import os
from vc_infer_pipeline import VC
import soundfile as sf

vc = VC(model_path="./models/voice.pth", device="cuda")

input_dir = "./inputs"
output_dir = "./outputs"
os.makedirs(output_dir, exist_ok=True)

for filename in os.listdir(input_dir):
    if filename.endswith(('.wav', '.mp3', '.flac')):
        input_path = os.path.join(input_dir, filename)
        output_path = os.path.join(output_dir, f"converted_{filename}")

        audio, sr = sf.read(input_path)
        converted = vc.convert(audio)
        sf.write(output_path, converted, sr)

        print(f"Converted: {filename}")
```

## Singing Voice Conversion

For songs, use appropriate settings:

```python
output = vc.convert(
    audio=audio,
    f0_method="rmvpe",  # Best for singing
    index_rate=0.5,     # Lower for singing
    f0_up_key=-2,       # Adjust pitch to match
    protect=0.5         # Protect consonants
)
```

## Common Issues

### Voice Sounds Robotic

* Use higher quality source audio
* Increase protect value (0.4-0.5)
* Try different f0 method

### Pitch Issues

* Adjust f0\_up\_key
* Use rmvpe f0 method
* Ensure consistent pitch in training data

### Audio Quality

* Use 48kHz sample rate
* Remove background noise from training data
* Train for more epochs

## API Server

```python
from fastapi import FastAPI, UploadFile
from fastapi.responses import FileResponse
from vc_infer_pipeline import VC
import soundfile as sf
import tempfile

app = FastAPI()
vc = VC(model_path="./models/voice.pth", device="cuda")

@app.post("/convert")
async def convert_voice(file: UploadFile, pitch: int = 0):
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp_in:
        content = await file.read()
        tmp_in.write(content)
        tmp_in_path = tmp_in.name

    audio, sr = sf.read(tmp_in_path)
    converted = vc.convert(audio, f0_up_key=pitch)

    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp_out:
        sf.write(tmp_out.name, converted, sr)
        return FileResponse(tmp_out.name, media_type="audio/wav")
```

## Training Tips

### For Better Quality

* Use 20+ minutes of clean audio
* Remove all background noise
* Consistent microphone/recording setup
* Include varied expressions/emotions

### For Faster Training

* Use 8-16 batch size
* Enable mixed precision
* Use NVMe SSD for dataset

## Performance

| Task                      | GPU      | Time          |
| ------------------------- | -------- | ------------- |
| Inference (1 min audio)   | RTX 3090 | \~5s          |
| Training (30 min dataset) | RTX 3090 | \~2 hours     |
| Real-time conversion      | RTX 3070 | <50ms latency |

## Troubleshooting

## Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

| GPU       | Hourly Rate | Daily Rate | 4-Hour Session |
| --------- | ----------- | ---------- | -------------- |
| RTX 3060  | \~$0.03     | \~$0.70    | \~$0.12        |
| RTX 3090  | \~$0.06     | \~$1.50    | \~$0.25        |
| RTX 4090  | \~$0.10     | \~$2.30    | \~$0.40        |
| A100 40GB | \~$0.17     | \~$4.00    | \~$0.70        |
| A100 80GB | \~$0.25     | \~$6.00    | \~$1.00        |

*Prices vary by provider and demand. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

**Save money:**

* Use **Spot** market for flexible workloads (often 30-50% cheaper)
* Pay with **CLORE** tokens
* Compare prices across different providers

## Next Steps

* [Bark TTS](https://docs.clore.ai/guides/audio-and-voice/bark-tts) - Text-to-speech
* [AudioCraft Music](https://docs.clore.ai/guides/audio-and-voice/audiocraft-music) - Music generation
* [Whisper Transcription](https://docs.clore.ai/guides/audio-and-voice/whisper-transcription) - Speech-to-text
