Whisper Transcription

Transcribe audio and video files using OpenAI's Whisper on CLORE.AI GPUs.

circle-check

Server Requirements

Parameter
Minimum
Recommended

RAM

8GB

16GB+

VRAM

4GB (small)

10GB+ (large-v3)

Network

200Mbps

500Mbps+

Startup Time

~1-2 minutes

-

What is Whisper?

OpenAI Whisper is a speech recognition model that can:

  • Transcribe audio in 99 languages

  • Translate to English

  • Generate timestamps

  • Handle noisy audio

Model Sizes

Model
VRAM
Speed
Quality

tiny

1GB

~32x realtime

Basic

base

1GB

~16x realtime

Good

small

2GB

~6x realtime

Better

medium

5GB

~2x realtime

Great

large-v3

10GB

~1x realtime

Best

Use the pre-built Faster-Whisper server for instant deployment:

Docker Image:

Ports:

No command needed - server starts automatically.

Verify It's Working

After deployment, find your http_pub URL in My Orders and test:

circle-exclamation

Transcribe via API

Complete API Reference (Faster-Whisper-Server)

Endpoints

Endpoint
Method
Description

/v1/audio/transcriptions

POST

Transcribe audio (OpenAI-compatible)

/v1/audio/translations

POST

Translate audio to English

/v1/models

GET

List all available models

/v1/models/{model_name}

GET

Get specific model info

/api/ps

GET

List currently loaded models

/api/ps/{model_name}

GET

Check if specific model is loaded

/api/pull/{model_name}

POST

Download and load a model

/health

GET

Health check endpoint

/docs

GET

Swagger UI documentation

/openapi.json

GET

OpenAPI specification

List Available Models

Response:

Swagger Documentation

Open in browser for interactive API testing:

Transcription Options

Parameter
Type
Description

file

File

Audio file to transcribe

model

String

Model to use (default: Systran/faster-whisper-large-v3)

language

String

Force specific language (e.g., en, ja, ru)

response_format

String

json, text, srt, vtt, verbose_json

temperature

Float

Sampling temperature (0.0-1.0)

timestamp_granularities[]

Array

word or segment for timestamps

Response Formats

JSON (default):

Verbose JSON:

SRT:

Alternative: Manual Installation

If you need more control, deploy with manual installation:

Docker Image:

Ports:

Command:

circle-info

Manual installation takes 3-5 minutes. The pre-built image above is recommended for faster startup.

Basic Usage (SSH)

Transcribe with Timestamps

Upload Audio Files

Python API

Faster-Whisper is 4x faster with lower VRAM usage:

Language Options

Translation to English

CLI:

Subtitle Generation

SRT Format

VTT Format

Word-Level Timestamps

Speaker Diarization

Who said what (requires pyannote):

REST API Server

Create a transcription API:

Performance Benchmarks

Model
GPU
1hr Audio

large-v3

RTX 3090

~5 min

large-v3

RTX 4090

~3 min

large-v3

A100

~2 min

medium

RTX 3090

~2 min

Memory-Efficient Processing

For very long audio:

Download Results

Troubleshooting

triangle-exclamation
  • Use smaller model (medium instead of large)

  • Use compute_type="int8" for faster-whisper

  • Process shorter audio segments

HTTP 502 on http_pub URL

The service is still starting. Wait 1-2 minutes and retry:

Poor accuracy

  • Use larger model

  • Specify language: --language English

  • Increase beam_size for faster-whisper

Slow processing

  • Ensure GPU is used: nvidia-smi

  • Use faster-whisper instead of original

  • Enable VAD to skip silence

Cost Estimate

Typical CLORE.AI marketplace rates:

GPU
CLORE/day
Approx USD/hr
Good For

RTX 3060

~80

~$0.02

small/medium models

RTX 3090

~150

~$0.03

large-v3

RTX 4090

~200

~$0.04

large-v3, fast

A100 40GB

~400

~$0.08

batch processing

Prices vary by provider. Check CLORE.AI Marketplacearrow-up-right for current rates. Pay with CLORE tokens for best value.

Last updated

Was this helpful?