Whisper Transcription

Transcribe audio and video with OpenAI Whisper on Clore.ai GPUs

Transcribe audio and video files using OpenAI's Whisper on CLORE.AI GPUs.

All examples can be run on GPU servers rented through CLORE.AI Marketplace.

Server Requirements

Parameter

Minimum

Recommended

RAM

8GB

16GB+

VRAM

4GB (small)

10GB+ (large-v3)

Network

200Mbps

500Mbps+

Startup Time

~1-2 minutes

What is Whisper?

OpenAI Whisper is a speech recognition model that can:

Transcribe audio in 99 languages
Translate to English
Generate timestamps
Handle noisy audio

Model Sizes

Model

VRAM

Speed

Quality

Notes

tiny

1GB

~32x realtime

Basic

Fastest, lowest accuracy

base

1GB

~16x realtime

Good

Good balance for quick tasks

small

2GB

~6x realtime

Better

Recommended for most use cases

medium

5GB

~2x realtime

Great

High accuracy, moderate speed

large-v3

10GB

~1x realtime

Best

Highest accuracy

large-v3-turbo

6GB

~8x realtime

Best

8x faster than large-v3, similar quality

💡 Recommendation: Use large-v3-turbo for the best speed/quality tradeoff. It delivers comparable accuracy to large-v3 at 8x the speed with lower VRAM requirements.

Using large-v3-turbo

import whisper

# Load large-v3-turbo (8x faster than large-v3, similar quality)
model = whisper.load_model("large-v3-turbo", device="cuda")

result = model.transcribe("audio.mp3")
print(result["text"])

With Faster-Whisper:

from faster_whisper import WhisperModel

# large-v3-turbo via faster-whisper
model = WhisperModel("deepdml/faster-whisper-large-v3-turbo-ct2", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3")

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

WhisperX: Enhanced Alternative

For word-level timestamps, speaker diarization, and up to 70x faster processing, consider WhisperX:

pip install whisperx

import whisperx

model = whisperx.load_model("large-v3-turbo", device="cuda", compute_type="float16")
audio = whisperx.load_audio("audio.mp3")
result = model.transcribe(audio, batch_size=16)

# Word-level alignment
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device="cuda")
result = whisperx.align(result["segments"], model_a, metadata, audio, device="cuda")

# result["word_segments"] contains word-level timestamps

➡️ See the full WhisperX guide for speaker diarization and advanced features.

Quick Deploy (Recommended)

Use the pre-built Faster-Whisper server for instant deployment:

Docker Image:

fedirz/faster-whisper-server:latest-cuda

Ports:

22/tcp
8000/http

No command needed - server starts automatically.

Verify It's Working

After deployment, find your http_pub URL in My Orders and test:

curl https://your-http-pub.clorecloud.net/

# Expected: server info JSON response

If you get HTTP 502, wait 1-2 minutes - the service is still starting.

Transcribe via API

# Transcribe audio file
curl -X POST https://your-http-pub.clorecloud.net/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=Systran/faster-whisper-large-v3" \
  -F "response_format=json"

# With timestamps
curl -X POST https://your-http-pub.clorecloud.net/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=Systran/faster-whisper-large-v3" \
  -F "response_format=verbose_json" \
  -F "timestamp_granularities[]=word"

Complete API Reference (Faster-Whisper-Server)

Endpoints

Endpoint

Method

Description

/v1/audio/transcriptions

POST

Transcribe audio (OpenAI-compatible)

/v1/audio/translations

POST

Translate audio to English

/v1/models

GET

List all available models

/v1/models/{model_name}

GET

Get specific model info

/api/ps

GET

List currently loaded models

/api/ps/{model_name}

GET

Check if specific model is loaded

/api/pull/{model_name}

POST

Download and load a model

/health

GET

Health check endpoint

/docs

GET

Swagger UI documentation

/openapi.json

GET

OpenAPI specification

List Available Models

curl https://your-http-pub.clorecloud.net/v1/models

Response:

{
  "data": [
    {"id": "Systran/faster-whisper-large-v3", "object": "model"},
    {"id": "Systran/faster-whisper-medium", "object": "model"}
  ]
}

Swagger Documentation

Open in browser for interactive API testing:

https://your-http-pub.clorecloud.net/docs

Transcription Options

Parameter

Type

Description

file

File

Audio file to transcribe

model

String

Model to use (default: Systran/faster-whisper-large-v3)

language

String

Force specific language (e.g., en, ja, ru)

response_format

String

json, text, srt, vtt, verbose_json

temperature

Float

Sampling temperature (0.0-1.0)

timestamp_granularities[]

Array

word or segment for timestamps

Response Formats

JSON (default):

{"text": "Transcribed text here..."}

Verbose JSON:

{
  "text": "Full transcription...",
  "segments": [
    {"start": 0.0, "end": 2.5, "text": "First segment"},
    {"start": 2.5, "end": 5.0, "text": "Second segment"}
  ],
  "language": "en"
}

SRT:

1
00:00:00,000 --> 00:00:02,500
First segment

2
00:00:02,500 --> 00:00:05,000
Second segment

Alternative: Manual Installation

If you need more control, deploy with manual installation:

Docker Image:

pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime

Ports:

22/tcp

Command:

pip install openai-whisper faster-whisper

Manual installation takes 3-5 minutes. The pre-built image above is recommended for faster startup.

Basic Usage (SSH)

ssh -p <port> root@<proxy>

# Transcribe audio file
whisper audio.mp3 --model large-v3 --device cuda

# Output to specific format
whisper audio.mp3 --model large-v3 --output_format txt

# Specify language
whisper audio.mp3 --model large-v3 --language Japanese

Transcribe with Timestamps

whisper audio.mp3 --model large-v3 --word_timestamps True

Upload Audio Files

# Upload single file
scp -P <port> interview.mp3 root@<proxy>:/workspace/

# Upload folder
scp -P <port> -r ./audio_files/ root@<proxy>:/workspace/

Python API

import whisper

# Load model (downloads on first use)
model = whisper.load_model("large-v3", device="cuda")

# Transcribe
result = model.transcribe("audio.mp3")

# Print text
print(result["text"])

# Print with timestamps
for segment in result["segments"]:
    print(f"[{segment['start']:.2f}s -> {segment['end']:.2f}s] {segment['text']}")

Faster-Whisper (Recommended)

Faster-Whisper is 4x faster with lower VRAM usage:

pip install faster-whisper

from faster_whisper import WhisperModel

# Load model with optimizations
model = WhisperModel("large-v3", device="cuda", compute_type="float16")

# Transcribe
segments, info = model.transcribe("audio.mp3")

print(f"Detected language: {info.language}")
for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

Language Options

from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda")

# Auto-detect language
segments, info = model.transcribe("audio.mp3")
print(f"Language: {info.language} ({info.language_probability:.0%})")

# Force specific language
segments, _ = model.transcribe("audio.mp3", language="ja")

Translation to English

from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda")
segments, _ = model.transcribe("japanese.mp3", task="translate")

for segment in segments:
    print(segment.text)

CLI:

whisper japanese.mp3 --model large-v3 --task translate

Subtitle Generation

SRT Format

from faster_whisper import WhisperModel

def format_timestamp(seconds):
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    millis = int((seconds - int(seconds)) * 1000)
    return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"

model = WhisperModel("large-v3", device="cuda")
segments, _ = model.transcribe("video.mp4")

with open("subtitles.srt", "w") as f:
    for i, segment in enumerate(segments, 1):
        f.write(f"{i}\n")
        f.write(f"{format_timestamp(segment.start)} --> {format_timestamp(segment.end)}\n")
        f.write(f"{segment.text.strip()}\n\n")

VTT Format

whisper video.mp4 --model large-v3 --output_format vtt

Word-Level Timestamps

from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda")

segments, _ = model.transcribe("audio.mp3", word_timestamps=True)

for segment in segments:
    for word in segment.words:
        print(f"[{word.start:.2f}s] {word.word}")

Speaker Diarization

Who said what (requires pyannote):

pip install pyannote.audio

from faster_whisper import WhisperModel
from pyannote.audio import Pipeline

# Diarization
diarization = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token="YOUR_HF_TOKEN"
)
diarization_result = diarization("audio.mp3")

# Transcription
whisper = WhisperModel("large-v3", device="cuda")
segments, _ = whisper.transcribe("audio.mp3")

# Combine
for segment in segments:
    # Find speaker at segment time
    speaker = None
    for turn, _, spk in diarization_result.itertracks(yield_label=True):
        if turn.start <= segment.start <= turn.end:
            speaker = spk
            break
    print(f"[{speaker}] {segment.text}")

REST API Server

Create a transcription API:

from fastapi import FastAPI, UploadFile
from faster_whisper import WhisperModel
import tempfile

app = FastAPI()
model = WhisperModel("large-v3", device="cuda", compute_type="float16")

@app.post("/transcribe")
async def transcribe(file: UploadFile):
    with tempfile.NamedTemporaryFile(delete=False) as tmp:
        tmp.write(await file.read())
        tmp_path = tmp.name

    segments, info = model.transcribe(tmp_path)

    return {
        "language": info.language,
        "text": " ".join([s.text for s in segments]),
        "segments": [
            {"start": s.start, "end": s.end, "text": s.text}
            for s in segments
        ]
    }

# Run: uvicorn server:app --host 0.0.0.0 --port 8000

Performance Benchmarks

Model

GPU

1hr Audio

large-v3

RTX 3090

~5 min

large-v3

RTX 4090

~3 min

large-v3

A100

~2 min

medium

RTX 3090

~2 min

Memory-Efficient Processing

For very long audio:

from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="int8")

# Process in chunks
segments, _ = model.transcribe(
    "long_audio.mp3",
    vad_filter=True,  # Skip silence
    vad_parameters=dict(min_silence_duration_ms=500)
)

Download Results

# Download transcripts
scp -P <port> -r root@<proxy>:/workspace/transcripts/ ./

# Download subtitles
scp -P <port> root@<proxy>:/workspace/subtitles.srt ./

Troubleshooting

CUDA out of memory

Use smaller model (medium instead of large)
Use compute_type="int8" for faster-whisper
Process shorter audio segments

HTTP 502 on http_pub URL

The service is still starting. Wait 1-2 minutes and retry:

curl https://your-http-pub.clorecloud.net/

Poor accuracy

Use larger model
Specify language: --language English
Increase beam_size for faster-whisper

Slow processing

Ensure GPU is used: nvidia-smi
Use faster-whisper instead of original
Enable VAD to skip silence

Cost Estimate

Typical CLORE.AI marketplace rates:

GPU

VRAM

Price/day

Good For

RTX 3060

12GB

$0.15–0.30

small/medium models

RTX 3090

24GB

$0.30–1.00

large-v3

RTX 4090

24GB

$0.50–2.00

large-v3, fast

A100

40GB

$1.50–3.00

batch processing

Prices in USD/day. Rates vary by provider — check CLORE.AI Marketplace for current rates.

PreviousOverview NextWhisperX with Diarization

Last updated 25 days ago

Was this helpful?

hashtagServer Requirements

hashtagWhat is Whisper?

hashtagModel Sizes

hashtagUsing large-v3-turbo

hashtagWhisperX: Enhanced Alternative

hashtagQuick Deploy (Recommended)

hashtagVerify It's Working

hashtagTranscribe via API

hashtagComplete API Reference (Faster-Whisper-Server)

hashtagEndpoints

hashtagList Available Models

hashtagSwagger Documentation

hashtagTranscription Options

hashtagResponse Formats

hashtagAlternative: Manual Installation

hashtagBasic Usage (SSH)

hashtagTranscribe with Timestamps

hashtagUpload Audio Files

hashtagPython API

hashtagFaster-Whisper (Recommended)

hashtagLanguage Options

hashtagTranslation to English

hashtagSubtitle Generation

hashtagSRT Format

hashtagVTT Format

hashtagWord-Level Timestamps

hashtagSpeaker Diarization

hashtagREST API Server

hashtagPerformance Benchmarks

hashtagMemory-Efficient Processing

hashtagDownload Results

hashtagTroubleshooting

hashtagHTTP 502 on http_pub URL

hashtagPoor accuracy

hashtagSlow processing

hashtagCost Estimate

Server Requirements

What is Whisper?

Model Sizes

Using large-v3-turbo

WhisperX: Enhanced Alternative

Quick Deploy (Recommended)

Verify It's Working

Transcribe via API

Complete API Reference (Faster-Whisper-Server)

Endpoints

List Available Models

Swagger Documentation

Transcription Options

Response Formats

Alternative: Manual Installation

Basic Usage (SSH)

Transcribe with Timestamps

Upload Audio Files

Python API

Faster-Whisper (Recommended)

Language Options

Translation to English

Subtitle Generation

SRT Format

VTT Format

Word-Level Timestamps

Speaker Diarization

REST API Server

Performance Benchmarks

Memory-Efficient Processing

Download Results

Troubleshooting

HTTP 502 on http_pub URL

Poor accuracy

Slow processing

Cost Estimate