# 带说话人分离的 WhisperX

WhisperX 在 OpenAI 的 Whisper 基础上进行了三项关键升级： **词级时间戳** 通过强制音素对齐， **说话人分离（说话人鉴别）** 使用 pyannote.audio，且 **最高达 70× 实时速度** 通过使用 faster-whisper 的批量推理实现。它是需要精确时序和说话人识别的生产转录管道的首选工具。

**GitHub：** [m-bain/whisperX](https://github.com/m-bain/whisperX) **PyPI：** [whisperx](https://pypi.org/project/whisperx/) **许可：** BSD-4-Clause **论文：** [arxiv.org/abs/2303.00747](https://arxiv.org/abs/2303.00747)

## 主要特性

* **词级时间戳** — 通过 wav2vec2 强制对齐实现 ±50 毫秒精度（相比原生 Whisper 的 ±500 毫秒）
* **说话人分离（说话人鉴别）** — 通过 pyannote.audio 3.1 识别谁说了什么
* **批量推理** — 在 RTX 4090 上最高达 70× 实时速度
* **VAD 预过滤** — Silero VAD 在转录前去除静音段
* **支持所有 Whisper 模型** — 从 tiny 到 large-v3-turbo
* **多种输出格式** — JSON、SRT、VTT、TXT、TSV
* **自动语言检测** — 或强制指定某种语言以加快处理速度

## 要求

| 组件     | 最低             | 推荐                     |
| ------ | -------------- | ---------------------- |
| GPU    | RTX 3060 12 GB | RTX 4090 24 GB         |
| 显存     | 4 GB（小模型）      | 10 GB+（large-v3-turbo） |
| 内存     | 8 GB           | 16 GB+                 |
| 磁盘     | 5 GB           | 20 GB（模型缓存）            |
| Python | 3.9+           | 3.11                   |
| CUDA   | 11.8+          | 12.1+                  |

**需要 HuggingFace 令牌** 用于说话人分离 — 在 [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1).

**Clore.ai 建议：** RTX 3090（~~（$0.30–1.00/天）针对 batch size 16 的 large-v3-turbo 模型。RTX 4090（~~（$0.50–2.00/天）针对 batch size 32 的最大吞吐量。

## 安装

```bash
# 安装 WhisperX
pip install whisperx

# 验证 GPU
python -c "import torch; print(torch.cuda.get_device_name(0))"
```

如果遇到 CUDA 版本冲突：

```bash
pip install torch==2.5.1+cu124 torchaudio==2.5.1+cu124 --index-url https://download.pytorch.org/whl/cu124
pip install whisperx
```

## 快速开始

```python
import whisperx
import json

device = "cuda"
compute_type = "float16"  # 对于更低显存可用 "int8"
batch_size = 16            # 如果显存吃紧，减到 4-8

# 1. 加载模型
model = whisperx.load_model("large-v3-turbo", device, compute_type=compute_type)

# 2. 加载并转录音频
audio = whisperx.load_audio("interview.mp3")
result = model.transcribe(audio, batch_size=batch_size)
print(f"Language: {result['language']}")

# 3. 对齐以获得词级时间戳
model_a, metadata = whisperx.load_align_model(
    language_code=result["language"], device=device
)
result = whisperx.align(
    result["segments"], model_a, metadata, audio, device,
    return_char_alignments=False,
)

# 4. 打印结果
for seg in result["segments"]:
    print(f"[{seg['start']:.2f}s → {seg['end']:.2f}s] {seg['text']}")
    for w in seg.get("words", []):
        print(f"  '{w['word']}' @ {w.get('start', 0):.2f}s")

# 5. 保存
with open("transcript.json", "w") as f:
    json.dump(result, f, indent=2, ensure_ascii=False)
```

## 使用示例

### 带说话人分离的转录

```python
import whisperx
import gc
import torch

device = "cuda"
HF_TOKEN = "hf_your_token_here"  # 来自 huggingface.co/settings/tokens

# 第 1 步：转录
model = whisperx.load_model("large-v3-turbo", device, compute_type="float16")
audio = whisperx.load_audio("meeting.mp3")
result = model.transcribe(audio, batch_size=16)

# 在加载对齐模型前释放 GPU 内存
del model; gc.collect(); torch.cuda.empty_cache()

# 第 2 步：对齐
model_a, metadata = whisperx.load_align_model(
    language_code=result["language"], device=device
)
result = whisperx.align(result["segments"], model_a, metadata, audio, device)
del model_a; gc.collect(); torch.cuda.empty_cache()

# 第 3 步：说话人分离
diarize_model = whisperx.DiarizationPipeline(
    use_auth_token=HF_TOKEN, device=device
)
diarize_segments = diarize_model(audio, min_speakers=2, max_speakers=6)

# 第 4 步：将说话人分配给词
result = whisperx.assign_word_speakers(diarize_segments, result)

for seg in result["segments"]:
    speaker = seg.get("speaker", "UNKNOWN")
    print(f"[{speaker}] [{seg['start']:.1f}s → {seg['end']:.1f}s] {seg['text']}")
```

### 命令行用法

```bash
# 基本转录
whisperx audio.mp3 --model large-v3-turbo --device cuda

# 强制语言（更快，跳过检测）
whisperx audio.mp3 --model large-v3-turbo --language en --device cuda

# 启用说话人分离
whisperx audio.mp3 --model large-v3-turbo --diarize --hf_token hf_your_token

# 输出 SRT 字幕
whisperx audio.mp3 --model large-v3-turbo --output_format srt --output_dir ./subs/

# 低显存模式
whisperx audio.mp3 --model medium --compute_type int8 --batch_size 4 --device cuda

# 批量处理一个目录
for f in /data/audio/*.mp3; do
  whisperx "$f" --model large-v3-turbo --output_dir /data/transcripts/
done
```

### SRT 生成脚本

```python
import whisperx

def transcribe_to_srt(audio_path, output_path, model_name="large-v3-turbo"):
    device = "cuda"
    model = whisperx.load_model(model_name, device, compute_type="float16")
    audio = whisperx.load_audio(audio_path)
    result = model.transcribe(audio, batch_size=16)

    model_a, metadata = whisperx.load_align_model(
        language_code=result["language"], device=device
    )
    result = whisperx.align(result["segments"], model_a, metadata, audio, device)

    with open(output_path, "w") as f:
        for i, seg in enumerate(result["segments"], 1):
            start = format_ts(seg["start"])
            end = format_ts(seg["end"])
            f.write(f"{i}\n{start} --> {end}\n{seg['text'].strip()}\n\n")

    print(f"SRT 已保存到 {output_path}")

def format_ts(seconds):
    h = int(seconds // 3600)
    m = int((seconds % 3600) // 60)
    s = int(seconds % 60)
    ms = int((seconds % 1) * 1000)
    return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

transcribe_to_srt("podcast.mp3", "podcast.srt")
```

## 性能基准

| 方法             | A100               | 1 小时音频     | GPU         | 大致速度      |
| -------------- | ------------------ | ---------- | ----------- | --------- |
| 原生 Whisper     | large-v3           | \~60 分钟    | 速度          | 1×        |
| faster-whisper | large-v3           | 约 5 分钟     | 速度          | \~12×     |
| **WhisperX**   | **large-v3-turbo** | **\~1 分钟** | **速度**      | **\~60×** |
| **WhisperX**   | **large-v3-turbo** | **\~50 秒** | **512x512** | **\~70×** |

| 批量大小 | 速度（RTX 4090） | 显存    |
| ---- | ------------ | ----- |
| 4    | \~30× 实时     | 6 GB  |
| 8    | \~45× 实时     | 8 GB  |
| 16   | \~60× 实时     | 10 GB |
| 32   | \~70× 实时     | 14 GB |

## 给 Clore.ai 用户的提示

* **在步骤之间释放显存** — 删除模型并调用 `torch.cuda.empty_cache()` 在转录、对齐和说话人分离之间
* **HuggingFace 令牌** — 在说话人分离生效前，你必须接受 pyannote 模型的许可证；设置 `HF_TOKEN` 作为环境变量
* **批量大小调优** — 从以下值开始 `batch_size=16`，在 12 GB 显存卡上减到 4–8，在 24 GB 显存卡上增到 32
* **`int8` 计算** — 使用 `compute_type="int8"` 以在质量损失最小的情况下将显存使用量减半
* **Docker 镜像** — `pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime`
* **持久化模型缓存** — 挂载 `/root/.cache/huggingface` 以避免在每次容器重启时重新下载模型

## # 使用固定种子以获得一致结果

| 问题                         | 解决方案                                                                         |
| -------------------------- | ---------------------------------------------------------------------------- |
| `CUDA 内存不足（out of memory）` | 减少 `batch_size`，使用 `compute_type="int8"`，或使用更小的模型（medium、small）              |
| 说话人分离返回 `UNKNOWN`          | 确保 HuggingFace 令牌有效并且你已接受 pyannote 的许可证                                      |
| `没有名为 'whisperx' 的模块`      | `pip install whisperx` — 确保没有拼写错误（它不是 `whisperx`，而不是 `whisper-x`)            |
| 词级时间戳不准确                   | 检查是否 `whisperx.align()` 在 `transcribe()` 之后被调用                               |
| — 原生 Whisper 的输出缺乏词级精度     | 语言检测错误 `使用强制语言参数` 或 `language="en"` --language en                            |
| 处理缓慢                       | 增加 `batch_size`，使用 `large-v3-turbo` 而不是 `large-v3`在 Python API 中，确保 GPU 未被共享 |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-zh/yin-pin-yu-yu-yin/whisperx.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.