For the complete documentation index, see llms.txt. This page is also available as Markdown.

Whisper Transcription

Transcribe audio and video with OpenAI Whisper on Clore.ai GPUs

Transcribe audio and video files using OpenAI's Whisper on CLORE.AI GPUs.

Server Requirements

Parameter
Minimum
Recommended

RAM

8GB

16GB+

VRAM

4GB (small)

10GB+ (large-v3)

Network

200Mbps

500Mbps+

Startup Time

~1-2 minutes

-

What is Whisper?

OpenAI Whisper is a speech recognition model that can:

  • Transcribe audio in 99 languages

  • Translate to English

  • Generate timestamps

  • Handle noisy audio

Model Sizes

Model
VRAM
Speed
Quality
Notes

tiny

1GB

~32x realtime

Basic

Fastest, lowest accuracy

base

1GB

~16x realtime

Good

Good balance for quick tasks

small

2GB

~6x realtime

Better

Recommended for most use cases

medium

5GB

~2x realtime

Great

High accuracy, moderate speed

large-v3

10GB

~1x realtime

Best

Highest accuracy

large-v3-turbo

6GB

~8x realtime

Best

8x faster than large-v3, similar quality

💡 Recommendation: Use large-v3-turbo for the best speed/quality tradeoff. It delivers comparable accuracy to large-v3 at 8x the speed with lower VRAM requirements.

Using large-v3-turbo

With Faster-Whisper:


WhisperX: Enhanced Alternative

For word-level timestamps, speaker diarization, and up to 70x faster processing, consider WhisperX:

➡️ See the full WhisperX guide for speaker diarization and advanced features.

Use the pre-built Faster-Whisper server for instant deployment:

Docker Image:

Ports:

No command needed - server starts automatically.

Verify It's Working

After deployment, find your http_pub URL in My Orders and test:

Transcribe via API

Complete API Reference (Faster-Whisper-Server)

Endpoints

Endpoint
Method
Description

/v1/audio/transcriptions

POST

Transcribe audio (OpenAI-compatible)

/v1/audio/translations

POST

Translate audio to English

/v1/models

GET

List all available models

/v1/models/{model_name}

GET

Get specific model info

/api/ps

GET

List currently loaded models

/api/ps/{model_name}

GET

Check if specific model is loaded

/api/pull/{model_name}

POST

Download and load a model

/health

GET

Health check endpoint

/docs

GET

Swagger UI documentation

/openapi.json

GET

OpenAPI specification

List Available Models

Response:

Swagger Documentation

Open in browser for interactive API testing:

Transcription Options

Parameter
Type
Description

file

File

Audio file to transcribe

model

String

Model to use (default: Systran/faster-whisper-large-v3)

language

String

Force specific language (e.g., en, ja, ru)

response_format

String

json, text, srt, vtt, verbose_json

temperature

Float

Sampling temperature (0.0-1.0)

timestamp_granularities[]

Array

word or segment for timestamps

Response Formats

JSON (default):

Verbose JSON:

SRT:

Alternative: Manual Installation

If you need more control, deploy with manual installation:

Docker Image:

Ports:

Command:

Manual installation takes 3-5 minutes. The pre-built image above is recommended for faster startup.

Basic Usage (SSH)

Transcribe with Timestamps

Upload Audio Files

Python API

Faster-Whisper is 4x faster with lower VRAM usage:

Language Options

Translation to English

CLI:

Subtitle Generation

SRT Format

VTT Format

Word-Level Timestamps

Speaker Diarization

Who said what (requires pyannote):

REST API Server

Create a transcription API:

Performance Benchmarks

Model
GPU
1hr Audio

large-v3

RTX 3090

~5 min

large-v3

RTX 4090

~3 min

large-v3

A100

~2 min

medium

RTX 3090

~2 min

Memory-Efficient Processing

For very long audio:

Download Results

Troubleshooting

  • Use smaller model (medium instead of large)

  • Use compute_type="int8" for faster-whisper

  • Process shorter audio segments

HTTP 502 on http_pub URL

The service is still starting. Wait 1-2 minutes and retry:

Poor accuracy

  • Use larger model

  • Specify language: --language English

  • Increase beam_size for faster-whisper

Slow processing

  • Ensure GPU is used: nvidia-smi

  • Use faster-whisper instead of original

  • Enable VAD to skip silence

Cost Estimate

Typical CLORE.AI marketplace rates:

GPU
VRAM
Price/day
Good For

RTX 3060

12GB

$0.15–0.30

small/medium models

RTX 3090

24GB

$0.30–1.00

large-v3

RTX 4090

24GB

$0.50–2.00

large-v3, fast

A100

40GB

$1.50–3.00

batch processing

Prices in USD/day. Rates vary by provider — check CLORE.AI Marketplace for current rates.

Last updated

Was this helpful?