WhisperX with Diarization

Run WhisperX for fast speech transcription with word-level timestamps and speaker diarization on Clore.ai GPUs.

WhisperX extends OpenAI's Whisper with three critical upgrades: word-level timestamps via forced phoneme alignment, speaker diarization using pyannote.audio, and up to 70× real-time speed through batched inference with faster-whisper. It is the go-to tool for production transcription pipelines that need precise timing and speaker identification.

GitHub: m-bain/whisperXarrow-up-right PyPI: whisperxarrow-up-right License: BSD-4-Clause Paper: arxiv.org/abs/2303.00747arrow-up-right

Key Features

  • Word-level timestamps — ±50 ms accuracy via wav2vec2 forced alignment (vs ±500 ms in vanilla Whisper)

  • Speaker diarization — identify who said what via pyannote.audio 3.1

  • Batched inference — up to 70× real-time speed on RTX 4090

  • VAD pre-filtering — Silero VAD removes silence before transcription

  • All Whisper models — tiny through large-v3-turbo

  • Multiple output formats — JSON, SRT, VTT, TXT, TSV

  • Automatic language detection — or force a specific language for faster processing

Requirements

Component
Minimum
Recommended

GPU

RTX 3060 12 GB

RTX 4090 24 GB

VRAM

4 GB (small model)

10 GB+ (large-v3-turbo)

RAM

8 GB

16 GB+

Disk

5 GB

20 GB (model cache)

Python

3.9+

3.11

CUDA

11.8+

12.1+

HuggingFace token required for speaker diarization — accept the license at pyannote/speaker-diarization-3.1arrow-up-right.

Clore.ai recommendation: RTX 3090 ($0.30–1.00/day) for the large-v3-turbo model with batch size 16. RTX 4090 ($0.50–2.00/day) for maximum throughput at batch size 32.

Installation

If you hit CUDA version conflicts:

Quick Start

Usage Examples

Transcription with Speaker Diarization

Command-Line Usage

SRT Generation Script

Performance Benchmarks

Method
Model
1h Audio
GPU
Approx. Speed

Vanilla Whisper

large-v3

~60 min

RTX 3090

faster-whisper

large-v3

~5 min

RTX 3090

~12×

WhisperX

large-v3-turbo

~1 min

RTX 3090

~60×

WhisperX

large-v3-turbo

~50 sec

RTX 4090

~70×

Batch Size
Speed (RTX 4090)
VRAM

4

~30× real-time

6 GB

8

~45× real-time

8 GB

16

~60× real-time

10 GB

32

~70× real-time

14 GB

Tips for Clore.ai Users

  • Free VRAM between steps — delete models and call torch.cuda.empty_cache() between transcription, alignment, and diarization

  • HuggingFace token — you must accept the pyannote model license before diarization works; set HF_TOKEN as an environment variable

  • Batch size tuning — start with batch_size=16, reduce to 4–8 on 12 GB cards, increase to 32 on 24 GB cards

  • int8 compute — use compute_type="int8" to halve VRAM usage with minimal quality loss

  • Docker imagepytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime

  • Persistent model cache — mount /root/.cache/huggingface to avoid re-downloading models on each container restart

Troubleshooting

Problem
Solution

CUDA out of memory

Reduce batch_size, use compute_type="int8", or use a smaller model (medium, small)

Diarization returns UNKNOWN

Ensure HuggingFace token is valid and you accepted the pyannote license

No module named 'whisperx'

pip install whisperx — ensure no typo (it's whisperx, not whisper-x)

Poor word timestamps

Check that whisperx.align() is called after transcribe() — raw Whisper output lacks word precision

Language detection wrong

Force language with --language en or language="en" in Python API

Slow processing

Increase batch_size, use large-v3-turbo instead of large-v3, ensure GPU is not shared

Last updated

Was this helpful?