# WhisperX with Diarization

WhisperX extends OpenAI's Whisper with three critical upgrades: **word-level timestamps** via forced phoneme alignment, **speaker diarization** using pyannote.audio, and **up to 70× real-time speed** through batched inference with faster-whisper. It is the go-to tool for production transcription pipelines that need precise timing and speaker identification.

**GitHub:** [m-bain/whisperX](https://github.com/m-bain/whisperX) **PyPI:** [whisperx](https://pypi.org/project/whisperx/) **License:** BSD-4-Clause **Paper:** [arxiv.org/abs/2303.00747](https://arxiv.org/abs/2303.00747)

## Key Features

* **Word-level timestamps** — ±50 ms accuracy via wav2vec2 forced alignment (vs ±500 ms in vanilla Whisper)
* **Speaker diarization** — identify who said what via pyannote.audio 3.1
* **Batched inference** — up to 70× real-time speed on RTX 4090
* **VAD pre-filtering** — Silero VAD removes silence before transcription
* **All Whisper models** — tiny through large-v3-turbo
* **Multiple output formats** — JSON, SRT, VTT, TXT, TSV
* **Automatic language detection** — or force a specific language for faster processing

## Requirements

| Component | Minimum            | Recommended             |
| --------- | ------------------ | ----------------------- |
| GPU       | RTX 3060 12 GB     | RTX 4090 24 GB          |
| VRAM      | 4 GB (small model) | 10 GB+ (large-v3-turbo) |
| RAM       | 8 GB               | 16 GB+                  |
| Disk      | 5 GB               | 20 GB (model cache)     |
| Python    | 3.9+               | 3.11                    |
| CUDA      | 11.8+              | 12.1+                   |

**HuggingFace token required** for speaker diarization — accept the license at [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1).

**Clore.ai recommendation:** RTX 3090 (~~$0.30–1.00/day) for the large-v3-turbo model with batch size 16. RTX 4090 (~~$0.50–2.00/day) for maximum throughput at batch size 32.

## Installation

```bash
# Install WhisperX
pip install whisperx

# Verify GPU
python -c "import torch; print(torch.cuda.get_device_name(0))"
```

If you hit CUDA version conflicts:

```bash
pip install torch==2.5.1+cu124 torchaudio==2.5.1+cu124 --index-url https://download.pytorch.org/whl/cu124
pip install whisperx
```

## Quick Start

```python
import whisperx
import json

device = "cuda"
compute_type = "float16"  # "int8" for lower VRAM
batch_size = 16            # reduce to 4-8 if VRAM is tight

# 1. Load model
model = whisperx.load_model("large-v3-turbo", device, compute_type=compute_type)

# 2. Load and transcribe audio
audio = whisperx.load_audio("interview.mp3")
result = model.transcribe(audio, batch_size=batch_size)
print(f"Language: {result['language']}")

# 3. Align for word-level timestamps
model_a, metadata = whisperx.load_align_model(
    language_code=result["language"], device=device
)
result = whisperx.align(
    result["segments"], model_a, metadata, audio, device,
    return_char_alignments=False,
)

# 4. Print results
for seg in result["segments"]:
    print(f"[{seg['start']:.2f}s → {seg['end']:.2f}s] {seg['text']}")
    for w in seg.get("words", []):
        print(f"  '{w['word']}' @ {w.get('start', 0):.2f}s")

# 5. Save
with open("transcript.json", "w") as f:
    json.dump(result, f, indent=2, ensure_ascii=False)
```

## Usage Examples

### Transcription with Speaker Diarization

```python
import whisperx
import gc
import torch

device = "cuda"
HF_TOKEN = "hf_your_token_here"  # from huggingface.co/settings/tokens

# Step 1: Transcribe
model = whisperx.load_model("large-v3-turbo", device, compute_type="float16")
audio = whisperx.load_audio("meeting.mp3")
result = model.transcribe(audio, batch_size=16)

# Free GPU memory before loading alignment model
del model; gc.collect(); torch.cuda.empty_cache()

# Step 2: Align
model_a, metadata = whisperx.load_align_model(
    language_code=result["language"], device=device
)
result = whisperx.align(result["segments"], model_a, metadata, audio, device)
del model_a; gc.collect(); torch.cuda.empty_cache()

# Step 3: Diarize
diarize_model = whisperx.DiarizationPipeline(
    use_auth_token=HF_TOKEN, device=device
)
diarize_segments = diarize_model(audio, min_speakers=2, max_speakers=6)

# Step 4: Assign speakers to words
result = whisperx.assign_word_speakers(diarize_segments, result)

for seg in result["segments"]:
    speaker = seg.get("speaker", "UNKNOWN")
    print(f"[{speaker}] [{seg['start']:.1f}s → {seg['end']:.1f}s] {seg['text']}")
```

### Command-Line Usage

```bash
# Basic transcription
whisperx audio.mp3 --model large-v3-turbo --device cuda

# Force language (faster, skips detection)
whisperx audio.mp3 --model large-v3-turbo --language en --device cuda

# With speaker diarization
whisperx audio.mp3 --model large-v3-turbo --diarize --hf_token hf_your_token

# SRT subtitle output
whisperx audio.mp3 --model large-v3-turbo --output_format srt --output_dir ./subs/

# Low-VRAM mode
whisperx audio.mp3 --model medium --compute_type int8 --batch_size 4 --device cuda

# Batch process a directory
for f in /data/audio/*.mp3; do
  whisperx "$f" --model large-v3-turbo --output_dir /data/transcripts/
done
```

### SRT Generation Script

```python
import whisperx

def transcribe_to_srt(audio_path, output_path, model_name="large-v3-turbo"):
    device = "cuda"
    model = whisperx.load_model(model_name, device, compute_type="float16")
    audio = whisperx.load_audio(audio_path)
    result = model.transcribe(audio, batch_size=16)

    model_a, metadata = whisperx.load_align_model(
        language_code=result["language"], device=device
    )
    result = whisperx.align(result["segments"], model_a, metadata, audio, device)

    with open(output_path, "w") as f:
        for i, seg in enumerate(result["segments"], 1):
            start = format_ts(seg["start"])
            end = format_ts(seg["end"])
            f.write(f"{i}\n{start} --> {end}\n{seg['text'].strip()}\n\n")

    print(f"SRT saved to {output_path}")

def format_ts(seconds):
    h = int(seconds // 3600)
    m = int((seconds % 3600) // 60)
    s = int(seconds % 60)
    ms = int((seconds % 1) * 1000)
    return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

transcribe_to_srt("podcast.mp3", "podcast.srt")
```

## Performance Benchmarks

| Method          | Model              | 1h Audio     | GPU          | Approx. Speed |
| --------------- | ------------------ | ------------ | ------------ | ------------- |
| Vanilla Whisper | large-v3           | \~60 min     | RTX 3090     | 1×            |
| faster-whisper  | large-v3           | \~5 min      | RTX 3090     | \~12×         |
| **WhisperX**    | **large-v3-turbo** | **\~1 min**  | **RTX 3090** | **\~60×**     |
| **WhisperX**    | **large-v3-turbo** | **\~50 sec** | **RTX 4090** | **\~70×**     |

| Batch Size | Speed (RTX 4090) | VRAM  |
| ---------- | ---------------- | ----- |
| 4          | \~30× real-time  | 6 GB  |
| 8          | \~45× real-time  | 8 GB  |
| 16         | \~60× real-time  | 10 GB |
| 32         | \~70× real-time  | 14 GB |

## Tips for Clore.ai Users

* **Free VRAM between steps** — delete models and call `torch.cuda.empty_cache()` between transcription, alignment, and diarization
* **HuggingFace token** — you must accept the pyannote model license before diarization works; set `HF_TOKEN` as an environment variable
* **Batch size tuning** — start with `batch_size=16`, reduce to 4–8 on 12 GB cards, increase to 32 on 24 GB cards
* **`int8` compute** — use `compute_type="int8"` to halve VRAM usage with minimal quality loss
* **Docker image** — `pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime`
* **Persistent model cache** — mount `/root/.cache/huggingface` to avoid re-downloading models on each container restart

## Troubleshooting

| Problem                       | Solution                                                                                               |
| ----------------------------- | ------------------------------------------------------------------------------------------------------ |
| `CUDA out of memory`          | Reduce `batch_size`, use `compute_type="int8"`, or use a smaller model (medium, small)                 |
| Diarization returns `UNKNOWN` | Ensure HuggingFace token is valid and you accepted the pyannote license                                |
| `No module named 'whisperx'`  | `pip install whisperx` — ensure no typo (it's `whisperx`, not `whisper-x`)                             |
| Poor word timestamps          | Check that `whisperx.align()` is called after `transcribe()` — raw Whisper output lacks word precision |
| Language detection wrong      | Force language with `--language en` or `language="en"` in Python API                                   |
| Slow processing               | Increase `batch_size`, use `large-v3-turbo` instead of `large-v3`, ensure GPU is not shared            |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/audio-and-voice/whisperx.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
