RVC Voice Clone

Clone and convert voices with RVC on Clore.ai GPUs

Clone and convert voices using Retrieval-based Voice Conversion.

All examples can be run on GPU servers rented through CLORE.AI Marketplace.

Renting on CLORE.AI

Visit CLORE.AI Marketplace
Filter by GPU type, VRAM, and price
Choose On-Demand (fixed rate) or Spot (bid price)
Configure your order:
- Select Docker image
- Set ports (TCP for SSH, HTTP for web UIs)
- Add environment variables if needed
- Enter startup command
Select payment: CLORE, BTC, or USDT/USDC
Create order and wait for deployment

Access Your Server

Find connection details in My Orders
Web interfaces: Use the HTTP port URL
SSH: ssh -p <port> root@<proxy-address>

What is RVC?

RVC (Retrieval-based Voice Conversion) can:

Clone any voice with minimal training
Convert singing/speaking voices
Real-time voice conversion
High-quality output

Requirements

Task

Min VRAM

Recommended

Inference

4GB

RTX 3060

Training

8GB

RTX 3090

Real-time

6GB

RTX 3070

Quick Deploy

Docker Image:

pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel

Ports:

22/tcp
7865/http

Command:

apt-get update && apt-get install -y ffmpeg git && \
cd /workspace && \
git clone https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI.git && \
cd Retrieval-based-Voice-Conversion-WebUI && \
pip install -r requirements.txt && \
python infer-web.py --host 0.0.0.0 --port 7865

Accessing Your Service

After deployment, find your http_pub URL in My Orders:

Go to My Orders page
Click on your order
Find the http_pub URL (e.g., abc123.clorecloud.net)

Use https://YOUR_HTTP_PUB_URL instead of localhost in examples below.

Installation


# Clone repository
git clone https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI.git
cd Retrieval-based-Voice-Conversion-WebUI

# Install dependencies
pip install -r requirements.txt

# Download models
python tools/download_models.py

Voice Conversion (Inference)

Using Web UI

Open http://<proxy>:7865
Go to "Model Inference" tab
Upload audio file
Select voice model
Adjust settings
Click "Convert"

Python API

from infer_pack.models import SynthesizerTrnMs256NSFsid, SynthesizerTrnMs768NSFsid
from vc_infer_pipeline import VC
import torch
import soundfile as sf

# Load model
model_path = "./models/my_voice.pth"
index_path = "./models/my_voice.index"

vc = VC(
    model_path=model_path,
    config_path="./configs/v2/48k.json",
    device="cuda"
)

# Convert audio
audio, sr = sf.read("input.wav")
output = vc.convert(
    audio=audio,
    f0_method="rmvpe",  # Pitch extraction method
    index_path=index_path,
    index_rate=0.75,
    f0_up_key=0,  # Pitch shift (semitones)
    protect=0.33
)

sf.write("output.wav", output, sr)

Training Custom Voice

Prepare Dataset

Collect 10-30 minutes of clean audio
Cut into 5-15 second clips
Remove background noise/music


# Split audio into clips
ffmpeg -i full_audio.mp3 -f segment -segment_time 10 -c copy clips/clip_%03d.mp3

Train via Web UI

Go to "Train" tab
Enter experiment name
Set training folder path
Click "Process data"
Click "Feature extraction"
Click "Train"

Train via Command Line


# Step 1: Process audio
python trainset_preprocess_pipeline_print.py \
    "./dataset" \
    48000 \
    8 \
    "./logs/experiment" \
    False

# Step 2: Extract features
python extract_f0_print.py \
    "./logs/experiment" \
    8 \
    "rmvpe"

python extract_feature_print.py \
    "cuda:0" \
    "1" \
    "0" \
    "0" \
    "./logs/experiment" \
    "v2"

# Step 3: Train
python train_nsf_sim_cache_sid_load_pretrain.py \
    -e "experiment" \
    -sr "48k" \
    -f0 1 \
    -bs 8 \
    -g 0 \
    -te 200 \
    -se 20 \
    -pg "./pretrained/f0G48k.pth" \
    -pd "./pretrained/f0D48k.pth" \
    -l 0 \
    -c 1 \
    -sw 0 \
    -v "v2"

Training Parameters

Parameter

Description

Recommended

Sample Rate

Audio quality

48000

Batch Size

Training batch

8-16

Epochs

Training iterations

200-500

Save Every

Checkpoint frequency

20-50

f0 Method

Pitch extraction

rmvpe

F0 Methods

Method

Quality

Speed

Best For

Fast

Testing

harvest

Good

Slow

General

crepe

Great

Medium

Singing

rmvpe

Best

Medium

All

Real-Time Conversion

Setup

import pyaudio
import numpy as np
from infer_pack.models import SynthesizerTrnMs256NSFsid
from vc_infer_pipeline import VC

# Initialize
vc = VC(model_path="./models/voice.pth", device="cuda")

# Audio setup
CHUNK = 1024
FORMAT = pyaudio.paFloat32
CHANNELS = 1
RATE = 48000

p = pyaudio.PyAudio()
stream_in = p.open(format=FORMAT, channels=CHANNELS, rate=RATE,
                   input=True, frames_per_buffer=CHUNK)
stream_out = p.open(format=FORMAT, channels=CHANNELS, rate=RATE,
                    output=True, frames_per_buffer=CHUNK)

# Real-time loop
while True:
    audio_in = np.frombuffer(stream_in.read(CHUNK), dtype=np.float32)
    audio_out = vc.convert(audio_in)
    stream_out.write(audio_out.tobytes())

Model Formats

Convert to ONNX

import torch

# Load PyTorch model
model = torch.load("model.pth")

# Export to ONNX
torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    input_names=["audio"],
    output_names=["converted"],
    dynamic_axes={"audio": {0: "length"}}
)

Audio Preprocessing

Remove Noise

import noisereduce as nr
import soundfile as sf

audio, sr = sf.read("noisy.wav")
reduced_noise = nr.reduce_noise(y=audio, sr=sr)
sf.write("clean.wav", reduced_noise, sr)

Normalize Volume

from pydub import AudioSegment

audio = AudioSegment.from_wav("input.wav")
normalized = audio.normalize()
normalized.export("normalized.wav", format="wav")

Remove Silence

from pydub import AudioSegment
from pydub.silence import split_on_silence

audio = AudioSegment.from_wav("input.wav")
chunks = split_on_silence(audio, min_silence_len=500, silence_thresh=-40)
combined = sum(chunks)
combined.export("no_silence.wav", format="wav")

Batch Processing

import os
from vc_infer_pipeline import VC
import soundfile as sf

vc = VC(model_path="./models/voice.pth", device="cuda")

input_dir = "./inputs"
output_dir = "./outputs"
os.makedirs(output_dir, exist_ok=True)

for filename in os.listdir(input_dir):
    if filename.endswith(('.wav', '.mp3', '.flac')):
        input_path = os.path.join(input_dir, filename)
        output_path = os.path.join(output_dir, f"converted_{filename}")

        audio, sr = sf.read(input_path)
        converted = vc.convert(audio)
        sf.write(output_path, converted, sr)

        print(f"Converted: {filename}")

Singing Voice Conversion

For songs, use appropriate settings:

output = vc.convert(
    audio=audio,
    f0_method="rmvpe",  # Best for singing
    index_rate=0.5,     # Lower for singing
    f0_up_key=-2,       # Adjust pitch to match
    protect=0.5         # Protect consonants
)

Common Issues

Voice Sounds Robotic

Use higher quality source audio
Increase protect value (0.4-0.5)
Try different f0 method

Pitch Issues

Adjust f0_up_key
Use rmvpe f0 method
Ensure consistent pitch in training data

Audio Quality

Use 48kHz sample rate
Remove background noise from training data
Train for more epochs

API Server

from fastapi import FastAPI, UploadFile
from fastapi.responses import FileResponse
from vc_infer_pipeline import VC
import soundfile as sf
import tempfile

app = FastAPI()
vc = VC(model_path="./models/voice.pth", device="cuda")

@app.post("/convert")
async def convert_voice(file: UploadFile, pitch: int = 0):
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp_in:
        content = await file.read()
        tmp_in.write(content)
        tmp_in_path = tmp_in.name

    audio, sr = sf.read(tmp_in_path)
    converted = vc.convert(audio, f0_up_key=pitch)

    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp_out:
        sf.write(tmp_out.name, converted, sr)
        return FileResponse(tmp_out.name, media_type="audio/wav")

Training Tips

For Better Quality

Use 20+ minutes of clean audio
Remove all background noise
Consistent microphone/recording setup
Include varied expressions/emotions

For Faster Training

Use 8-16 batch size
Enable mixed precision
Use NVMe SSD for dataset

Performance

Task

GPU

Time

Inference (1 min audio)

RTX 3090

~5s

Training (30 min dataset)

RTX 3090

~2 hours

Real-time conversion

RTX 3070

<50ms latency

Troubleshooting

Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

GPU

Hourly Rate

Daily Rate

4-Hour Session

RTX 3060

~$0.03

~$0.70

~$0.12

RTX 3090

~$0.06

~$1.50

~$0.25

RTX 4090

~$0.10

~$2.30

~$0.40

A100 40GB

~$0.17

~$4.00

~$0.70

A100 80GB

~$0.25

~$6.00

~$1.00

Prices vary by provider and demand. Check CLORE.AI Marketplace for current rates.

Save money:

Use Spot market for flexible workloads (often 30-50% cheaper)
Pay with CLORE tokens
Compare prices across different providers

Next Steps

Bark TTS - Text-to-speech
AudioCraft Music - Music generation
Whisper Transcription - Speech-to-text

PreviousOpenVoice NextDemucs Separation

Last updated 25 days ago

Was this helpful?

hashtagRenting on CLORE.AI

hashtagAccess Your Server

hashtagWhat is RVC?

hashtagRequirements

hashtagQuick Deploy

hashtagAccessing Your Service

hashtagInstallation

hashtagVoice Conversion (Inference)

hashtagUsing Web UI

hashtagPython API

hashtagTraining Custom Voice

hashtagPrepare Dataset

hashtagTrain via Web UI

hashtagTrain via Command Line

hashtagTraining Parameters

hashtagF0 Methods

hashtagReal-Time Conversion

hashtagSetup

hashtagModel Formats

hashtagConvert to ONNX

hashtagAudio Preprocessing

hashtagRemove Noise

hashtagNormalize Volume

hashtagRemove Silence

hashtagBatch Processing

hashtagSinging Voice Conversion

hashtagCommon Issues

hashtagVoice Sounds Robotic

hashtagPitch Issues

hashtagAudio Quality

hashtagAPI Server

hashtagTraining Tips

hashtagFor Better Quality

hashtagFor Faster Training

hashtagPerformance

hashtagTroubleshooting

hashtagCost Estimate

hashtagNext Steps

Renting on CLORE.AI

Access Your Server

What is RVC?

Requirements

Quick Deploy

Accessing Your Service

Installation

Voice Conversion (Inference)

Using Web UI

Python API

Training Custom Voice

Prepare Dataset

Train via Web UI

Train via Command Line

Training Parameters

F0 Methods

Real-Time Conversion

Setup

Model Formats

Convert to ONNX

Audio Preprocessing

Remove Noise

Normalize Volume

Remove Silence

Batch Processing

Singing Voice Conversion

Common Issues

Voice Sounds Robotic

Pitch Issues

Audio Quality

API Server

Training Tips

For Better Quality

For Faster Training

Performance

Troubleshooting

Cost Estimate

Next Steps