# WhisperX with Diarization

WhisperX extends OpenAI's Whisper with three critical upgrades: **word-level timestamps** via forced phoneme alignment, **speaker diarization** using pyannote.audio, and **up to 70× real-time speed** through batched inference with faster-whisper. It is the go-to tool for production transcription pipelines that need precise timing and speaker identification.

**GitHub:** [m-bain/whisperX](https://github.com/m-bain/whisperX) **PyPI:** [whisperx](https://pypi.org/project/whisperx/) **License:** BSD-4-Clause **Paper:** [arxiv.org/abs/2303.00747](https://arxiv.org/abs/2303.00747)

## Key Features

* **Word-level timestamps** — ±50 ms accuracy via wav2vec2 forced alignment (vs ±500 ms in vanilla Whisper)
* **Speaker diarization** — identify who said what via pyannote.audio 3.1
* **Batched inference** — up to 70× real-time speed on RTX 4090
* **VAD pre-filtering** — Silero VAD removes silence before transcription
* **All Whisper models** — tiny through large-v3-turbo
* **Multiple output formats** — JSON, SRT, VTT, TXT, TSV
* **Automatic language detection** — or force a specific language for faster processing

## Requirements

| Component | Minimum            | Recommended             |
| --------- | ------------------ | ----------------------- |
| GPU       | RTX 3060 12 GB     | RTX 4090 24 GB          |
| VRAM      | 4 GB (small model) | 10 GB+ (large-v3-turbo) |
| RAM       | 8 GB               | 16 GB+                  |
| Disk      | 5 GB               | 20 GB (model cache)     |
| Python    | 3.9+               | 3.11                    |
| CUDA      | 11.8+              | 12.1+                   |

**HuggingFace token required** for speaker diarization — accept the license at [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1).

**Clore.ai recommendation:** RTX 3090 (~~$0.30–1.00/day) for the large-v3-turbo model with batch size 16. RTX 4090 (~~$0.50–2.00/day) for maximum throughput at batch size 32.

## Installation

```bash
# Install WhisperX
pip install whisperx

# Verify GPU
python -c "import torch; print(torch.cuda.get_device_name(0))"
```

If you hit CUDA version conflicts:

```bash
pip install torch==2.5.1+cu124 torchaudio==2.5.1+cu124 --index-url https://download.pytorch.org/whl/cu124
pip install whisperx
```

## Quick Start

```python
import whisperx
import json

device = "cuda"
compute_type = "float16"  # "int8" for lower VRAM
batch_size = 16            # reduce to 4-8 if VRAM is tight

# 1. Load model
model = whisperx.load_model("large-v3-turbo", device, compute_type=compute_type)

# 2. Load and transcribe audio
audio = whisperx.load_audio("interview.mp3")
result = model.transcribe(audio, batch_size=batch_size)
print(f"Language: {result['language']}")

# 3. Align for word-level timestamps
model_a, metadata = whisperx.load_align_model(
    language_code=result["language"], device=device
)
result = whisperx.align(
    result["segments"], model_a, metadata, audio, device,
    return_char_alignments=False,
)

# 4. Print results
for seg in result["segments"]:
    print(f"[{seg['start']:.2f}s → {seg['end']:.2f}s] {seg['text']}")
    for w in seg.get("words", []):
        print(f"  '{w['word']}' @ {w.get('start', 0):.2f}s")

# 5. Save
with open("transcript.json", "w") as f:
    json.dump(result, f, indent=2, ensure_ascii=False)
```

## Usage Examples

### Transcription with Speaker Diarization

```python
import whisperx
import gc
import torch

device = "cuda"
HF_TOKEN = "hf_your_token_here"  # from huggingface.co/settings/tokens

# Step 1: Transcribe
model = whisperx.load_model("large-v3-turbo", device, compute_type="float16")
audio = whisperx.load_audio("meeting.mp3")
result = model.transcribe(audio, batch_size=16)

# Free GPU memory before loading alignment model
del model; gc.collect(); torch.cuda.empty_cache()

# Step 2: Align
model_a, metadata = whisperx.load_align_model(
    language_code=result["language"], device=device
)
result = whisperx.align(result["segments"], model_a, metadata, audio, device)
del model_a; gc.collect(); torch.cuda.empty_cache()

# Step 3: Diarize
diarize_model = whisperx.DiarizationPipeline(
    use_auth_token=HF_TOKEN, device=device
)
diarize_segments = diarize_model(audio, min_speakers=2, max_speakers=6)

# Step 4: Assign speakers to words
result = whisperx.assign_word_speakers(diarize_segments, result)

for seg in result["segments"]:
    speaker = seg.get("speaker", "UNKNOWN")
    print(f"[{speaker}] [{seg['start']:.1f}s → {seg['end']:.1f}s] {seg['text']}")
```

### Command-Line Usage

```bash
# Basic transcription
whisperx audio.mp3 --model large-v3-turbo --device cuda

# Force language (faster, skips detection)
whisperx audio.mp3 --model large-v3-turbo --language en --device cuda

# With speaker diarization
whisperx audio.mp3 --model large-v3-turbo --diarize --hf_token hf_your_token

# SRT subtitle output
whisperx audio.mp3 --model large-v3-turbo --output_format srt --output_dir ./subs/

# Low-VRAM mode
whisperx audio.mp3 --model medium --compute_type int8 --batch_size 4 --device cuda

# Batch process a directory
for f in /data/audio/*.mp3; do
  whisperx "$f" --model large-v3-turbo --output_dir /data/transcripts/
done
```

### SRT Generation Script

```python
import whisperx

def transcribe_to_srt(audio_path, output_path, model_name="large-v3-turbo"):
    device = "cuda"
    model = whisperx.load_model(model_name, device, compute_type="float16")
    audio = whisperx.load_audio(audio_path)
    result = model.transcribe(audio, batch_size=16)

    model_a, metadata = whisperx.load_align_model(
        language_code=result["language"], device=device
    )
    result = whisperx.align(result["segments"], model_a, metadata, audio, device)

    with open(output_path, "w") as f:
        for i, seg in enumerate(result["segments"], 1):
            start = format_ts(seg["start"])
            end = format_ts(seg["end"])
            f.write(f"{i}\n{start} --> {end}\n{seg['text'].strip()}\n\n")

    print(f"SRT saved to {output_path}")

def format_ts(seconds):
    h = int(seconds // 3600)
    m = int((seconds % 3600) // 60)
    s = int(seconds % 60)
    ms = int((seconds % 1) * 1000)
    return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

transcribe_to_srt("podcast.mp3", "podcast.srt")
```

## Performance Benchmarks

| Method          | Model              | 1h Audio     | GPU          | Approx. Speed |
| --------------- | ------------------ | ------------ | ------------ | ------------- |
| Vanilla Whisper | large-v3           | \~60 min     | RTX 3090     | 1×            |
| faster-whisper  | large-v3           | \~5 min      | RTX 3090     | \~12×         |
| **WhisperX**    | **large-v3-turbo** | **\~1 min**  | **RTX 3090** | **\~60×**     |
| **WhisperX**    | **large-v3-turbo** | **\~50 sec** | **RTX 4090** | **\~70×**     |

| Batch Size | Speed (RTX 4090) | VRAM  |
| ---------- | ---------------- | ----- |
| 4          | \~30× real-time  | 6 GB  |
| 8          | \~45× real-time  | 8 GB  |
| 16         | \~60× real-time  | 10 GB |
| 32         | \~70× real-time  | 14 GB |

## Tips for Clore.ai Users

* **Free VRAM between steps** — delete models and call `torch.cuda.empty_cache()` between transcription, alignment, and diarization
* **HuggingFace token** — you must accept the pyannote model license before diarization works; set `HF_TOKEN` as an environment variable
* **Batch size tuning** — start with `batch_size=16`, reduce to 4–8 on 12 GB cards, increase to 32 on 24 GB cards
* **`int8` compute** — use `compute_type="int8"` to halve VRAM usage with minimal quality loss
* **Docker image** — `pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime`
* **Persistent model cache** — mount `/root/.cache/huggingface` to avoid re-downloading models on each container restart

## Troubleshooting

| Problem                       | Solution                                                                                               |
| ----------------------------- | ------------------------------------------------------------------------------------------------------ |
| `CUDA out of memory`          | Reduce `batch_size`, use `compute_type="int8"`, or use a smaller model (medium, small)                 |
| Diarization returns `UNKNOWN` | Ensure HuggingFace token is valid and you accepted the pyannote license                                |
| `No module named 'whisperx'`  | `pip install whisperx` — ensure no typo (it's `whisperx`, not `whisper-x`)                             |
| Poor word timestamps          | Check that `whisperx.align()` is called after `transcribe()` — raw Whisper output lacks word precision |
| Language detection wrong      | Force language with `--language en` or `language="en"` in Python API                                   |
| Slow processing               | Increase `batch_size`, use `large-v3-turbo` instead of `large-v3`, ensure GPU is not shared            |
