WhisperX with Diarization

Run WhisperX for fast speech transcription with word-level timestamps and speaker diarization on Clore.ai GPUs.

WhisperX extends OpenAI's Whisper with three critical upgrades: word-level timestamps via forced phoneme alignment, speaker diarization using pyannote.audio, and up to 70× real-time speed through batched inference with faster-whisper. It is the go-to tool for production transcription pipelines that need precise timing and speaker identification.

GitHub: m-bain/whisperX PyPI: whisperx License: BSD-4-Clause Paper: arxiv.org/abs/2303.00747

Key Features

Word-level timestamps — ±50 ms accuracy via wav2vec2 forced alignment (vs ±500 ms in vanilla Whisper)
Speaker diarization — identify who said what via pyannote.audio 3.1
Batched inference — up to 70× real-time speed on RTX 4090
VAD pre-filtering — Silero VAD removes silence before transcription
All Whisper models — tiny through large-v3-turbo
Multiple output formats — JSON, SRT, VTT, TXT, TSV
Automatic language detection — or force a specific language for faster processing

Requirements

Component

Minimum

Recommended

GPU

RTX 3060 12 GB

RTX 4090 24 GB

VRAM

4 GB (small model)

10 GB+ (large-v3-turbo)

RAM

8 GB

16 GB+

Disk

5 GB

20 GB (model cache)

Python

3.9+

3.11

CUDA

11.8+

12.1+

HuggingFace token required for speaker diarization — accept the license at pyannote/speaker-diarization-3.1.

Clore.ai recommendation: RTX 3090 (~~$0.30–1.00/day) for the large-v3-turbo model with batch size 16. RTX 4090 (~~$0.50–2.00/day) for maximum throughput at batch size 32.

Installation

# Install WhisperX
pip install whisperx

# Verify GPU
python -c "import torch; print(torch.cuda.get_device_name(0))"

If you hit CUDA version conflicts:

pip install torch==2.5.1+cu124 torchaudio==2.5.1+cu124 --index-url https://download.pytorch.org/whl/cu124
pip install whisperx

Quick Start

import whisperx
import json

device = "cuda"
compute_type = "float16"  # "int8" for lower VRAM
batch_size = 16            # reduce to 4-8 if VRAM is tight

# 1. Load model
model = whisperx.load_model("large-v3-turbo", device, compute_type=compute_type)

# 2. Load and transcribe audio
audio = whisperx.load_audio("interview.mp3")
result = model.transcribe(audio, batch_size=batch_size)
print(f"Language: {result['language']}")

# 3. Align for word-level timestamps
model_a, metadata = whisperx.load_align_model(
    language_code=result["language"], device=device
)
result = whisperx.align(
    result["segments"], model_a, metadata, audio, device,
    return_char_alignments=False,
)

# 4. Print results
for seg in result["segments"]:
    print(f"[{seg['start']:.2f}s → {seg['end']:.2f}s] {seg['text']}")
    for w in seg.get("words", []):
        print(f"  '{w['word']}' @ {w.get('start', 0):.2f}s")

# 5. Save
with open("transcript.json", "w") as f:
    json.dump(result, f, indent=2, ensure_ascii=False)

Usage Examples

Transcription with Speaker Diarization

import whisperx
import gc
import torch

device = "cuda"
HF_TOKEN = "hf_your_token_here"  # from huggingface.co/settings/tokens

# Step 1: Transcribe
model = whisperx.load_model("large-v3-turbo", device, compute_type="float16")
audio = whisperx.load_audio("meeting.mp3")
result = model.transcribe(audio, batch_size=16)

# Free GPU memory before loading alignment model
del model; gc.collect(); torch.cuda.empty_cache()

# Step 2: Align
model_a, metadata = whisperx.load_align_model(
    language_code=result["language"], device=device
)
result = whisperx.align(result["segments"], model_a, metadata, audio, device)
del model_a; gc.collect(); torch.cuda.empty_cache()

# Step 3: Diarize
diarize_model = whisperx.DiarizationPipeline(
    use_auth_token=HF_TOKEN, device=device
)
diarize_segments = diarize_model(audio, min_speakers=2, max_speakers=6)

# Step 4: Assign speakers to words
result = whisperx.assign_word_speakers(diarize_segments, result)

for seg in result["segments"]:
    speaker = seg.get("speaker", "UNKNOWN")
    print(f"[{speaker}] [{seg['start']:.1f}s → {seg['end']:.1f}s] {seg['text']}")

Command-Line Usage

# Basic transcription
whisperx audio.mp3 --model large-v3-turbo --device cuda

# Force language (faster, skips detection)
whisperx audio.mp3 --model large-v3-turbo --language en --device cuda

# With speaker diarization
whisperx audio.mp3 --model large-v3-turbo --diarize --hf_token hf_your_token

# SRT subtitle output
whisperx audio.mp3 --model large-v3-turbo --output_format srt --output_dir ./subs/

# Low-VRAM mode
whisperx audio.mp3 --model medium --compute_type int8 --batch_size 4 --device cuda

# Batch process a directory
for f in /data/audio/*.mp3; do
  whisperx "$f" --model large-v3-turbo --output_dir /data/transcripts/
done

SRT Generation Script

import whisperx

def transcribe_to_srt(audio_path, output_path, model_name="large-v3-turbo"):
    device = "cuda"
    model = whisperx.load_model(model_name, device, compute_type="float16")
    audio = whisperx.load_audio(audio_path)
    result = model.transcribe(audio, batch_size=16)

    model_a, metadata = whisperx.load_align_model(
        language_code=result["language"], device=device
    )
    result = whisperx.align(result["segments"], model_a, metadata, audio, device)

    with open(output_path, "w") as f:
        for i, seg in enumerate(result["segments"], 1):
            start = format_ts(seg["start"])
            end = format_ts(seg["end"])
            f.write(f"{i}\n{start} --> {end}\n{seg['text'].strip()}\n\n")

    print(f"SRT saved to {output_path}")

def format_ts(seconds):
    h = int(seconds // 3600)
    m = int((seconds % 3600) // 60)
    s = int(seconds % 60)
    ms = int((seconds % 1) * 1000)
    return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

transcribe_to_srt("podcast.mp3", "podcast.srt")

Performance Benchmarks

Method

Model

1h Audio

GPU

Approx. Speed

Vanilla Whisper

large-v3

~60 min

RTX 3090

1×

faster-whisper

large-v3

~5 min

RTX 3090

~12×

WhisperX

large-v3-turbo

~1 min

RTX 3090

~60×

WhisperX

large-v3-turbo

~50 sec

RTX 4090

~70×

Batch Size

Speed (RTX 4090)

VRAM

~30× real-time

6 GB

~45× real-time

8 GB

~60× real-time

10 GB

~70× real-time

14 GB

Tips for Clore.ai Users

Free VRAM between steps — delete models and call torch.cuda.empty_cache() between transcription, alignment, and diarization
HuggingFace token — you must accept the pyannote model license before diarization works; set HF_TOKEN as an environment variable
Batch size tuning — start with batch_size=16, reduce to 4–8 on 12 GB cards, increase to 32 on 24 GB cards
int8 compute — use compute_type="int8" to halve VRAM usage with minimal quality loss
Docker image — pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime
Persistent model cache — mount /root/.cache/huggingface to avoid re-downloading models on each container restart

Troubleshooting

Problem

Solution

CUDA out of memory

Reduce batch_size, use compute_type="int8", or use a smaller model (medium, small)

Diarization returns UNKNOWN

Ensure HuggingFace token is valid and you accepted the pyannote license

No module named 'whisperx'

pip install whisperx — ensure no typo (it's whisperx, not whisper-x)

Poor word timestamps

Check that whisperx.align() is called after transcribe() — raw Whisper output lacks word precision

Language detection wrong

Force language with --language en or language="en" in Python API

Slow processing

Increase batch_size, use large-v3-turbo instead of large-v3, ensure GPU is not shared

PreviousWhisper Transcription NextBark TTS

Last updated 23 days ago

Was this helpful?

hashtagKey Features

hashtagRequirements

hashtagInstallation

hashtagQuick Start

hashtagUsage Examples

hashtagTranscription with Speaker Diarization

hashtagCommand-Line Usage

hashtagSRT Generation Script

hashtagPerformance Benchmarks

hashtagTips for Clore.ai Users

hashtagTroubleshooting