# Whisper Transcription

Transcribe audio and video files using OpenAI's Whisper on CLORE.AI GPUs.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Server Requirements

| Parameter    | Minimum       | Recommended      |
| ------------ | ------------- | ---------------- |
| RAM          | 8GB           | 16GB+            |
| VRAM         | 4GB (small)   | 10GB+ (large-v3) |
| Network      | 200Mbps       | 500Mbps+         |
| Startup Time | \~1-2 minutes | -                |

## What is Whisper?

OpenAI Whisper is a speech recognition model that can:

* Transcribe audio in 99 languages
* Translate to English
* Generate timestamps
* Handle noisy audio

## Model Sizes

| Model              | VRAM    | Speed             | Quality  | Notes                                        |
| ------------------ | ------- | ----------------- | -------- | -------------------------------------------- |
| tiny               | 1GB     | \~32x realtime    | Basic    | Fastest, lowest accuracy                     |
| base               | 1GB     | \~16x realtime    | Good     | Good balance for quick tasks                 |
| small              | 2GB     | \~6x realtime     | Better   | Recommended for most use cases               |
| medium             | 5GB     | \~2x realtime     | Great    | High accuracy, moderate speed                |
| large-v3           | 10GB    | \~1x realtime     | Best     | Highest accuracy                             |
| **large-v3-turbo** | **6GB** | **\~8x realtime** | **Best** | **8x faster than large-v3, similar quality** |

> **💡 Recommendation:** Use `large-v3-turbo` for the best speed/quality tradeoff. It delivers comparable accuracy to `large-v3` at 8x the speed with lower VRAM requirements.

### Using large-v3-turbo

```python
import whisper

# Load large-v3-turbo (8x faster than large-v3, similar quality)
model = whisper.load_model("large-v3-turbo", device="cuda")

result = model.transcribe("audio.mp3")
print(result["text"])
```

With Faster-Whisper:

```python
from faster_whisper import WhisperModel

# large-v3-turbo via faster-whisper
model = WhisperModel("deepdml/faster-whisper-large-v3-turbo-ct2", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3")

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
```

***

## WhisperX: Enhanced Alternative

For **word-level timestamps**, **speaker diarization**, and **up to 70x faster** processing, consider [WhisperX](/guides/audio-and-voice/whisperx.md):

```bash
pip install whisperx
```

```python
import whisperx

model = whisperx.load_model("large-v3-turbo", device="cuda", compute_type="float16")
audio = whisperx.load_audio("audio.mp3")
result = model.transcribe(audio, batch_size=16)

# Word-level alignment
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device="cuda")
result = whisperx.align(result["segments"], model_a, metadata, audio, device="cuda")

# result["word_segments"] contains word-level timestamps
```

➡️ See the full [WhisperX guide](/guides/audio-and-voice/whisperx.md) for speaker diarization and advanced features.

## Quick Deploy (Recommended)

Use the pre-built Faster-Whisper server for instant deployment:

**Docker Image:**

```
fedirz/faster-whisper-server:latest-cuda
```

**Ports:**

```
22/tcp
8000/http
```

**No command needed** - server starts automatically.

### Verify It's Working

After deployment, find your `http_pub` URL in **My Orders** and test:

```bash
curl https://your-http-pub.clorecloud.net/

# Expected: server info JSON response
```

{% hint style="warning" %}
If you get HTTP 502, wait 1-2 minutes - the service is still starting.
{% endhint %}

### Transcribe via API

```bash
# Transcribe audio file
curl -X POST https://your-http-pub.clorecloud.net/v1/audio/transcriptions \
  -F "file=@audio.mp3" \
  -F "model=Systran/faster-whisper-large-v3" \
  -F "response_format=json"

# With timestamps
curl -X POST https://your-http-pub.clorecloud.net/v1/audio/transcriptions \
  -F "file=@audio.mp3" \
  -F "model=Systran/faster-whisper-large-v3" \
  -F "response_format=verbose_json" \
  -F "timestamp_granularities[]=word"
```

## Complete API Reference (Faster-Whisper-Server)

### Endpoints

| Endpoint                   | Method | Description                          |
| -------------------------- | ------ | ------------------------------------ |
| `/v1/audio/transcriptions` | POST   | Transcribe audio (OpenAI-compatible) |
| `/v1/audio/translations`   | POST   | Translate audio to English           |
| `/v1/models`               | GET    | List all available models            |
| `/v1/models/{model_name}`  | GET    | Get specific model info              |
| `/api/ps`                  | GET    | List currently loaded models         |
| `/api/ps/{model_name}`     | GET    | Check if specific model is loaded    |
| `/api/pull/{model_name}`   | POST   | Download and load a model            |
| `/health`                  | GET    | Health check endpoint                |
| `/docs`                    | GET    | Swagger UI documentation             |
| `/openapi.json`            | GET    | OpenAPI specification                |

#### List Available Models

```bash
curl https://your-http-pub.clorecloud.net/v1/models
```

Response:

```json
{
  "data": [
    {"id": "Systran/faster-whisper-large-v3", "object": "model"},
    {"id": "Systran/faster-whisper-medium", "object": "model"}
  ]
}
```

#### Swagger Documentation

Open in browser for interactive API testing:

```
https://your-http-pub.clorecloud.net/docs
```

### Transcription Options

| Parameter                   | Type   | Description                                               |
| --------------------------- | ------ | --------------------------------------------------------- |
| `file`                      | File   | Audio file to transcribe                                  |
| `model`                     | String | Model to use (default: `Systran/faster-whisper-large-v3`) |
| `language`                  | String | Force specific language (e.g., `en`, `ja`, `ru`)          |
| `response_format`           | String | `json`, `text`, `srt`, `vtt`, `verbose_json`              |
| `temperature`               | Float  | Sampling temperature (0.0-1.0)                            |
| `timestamp_granularities[]` | Array  | `word` or `segment` for timestamps                        |

#### Response Formats

**JSON (default):**

```json
{"text": "Transcribed text here..."}
```

**Verbose JSON:**

```json
{
  "text": "Full transcription...",
  "segments": [
    {"start": 0.0, "end": 2.5, "text": "First segment"},
    {"start": 2.5, "end": 5.0, "text": "Second segment"}
  ],
  "language": "en"
}
```

**SRT:**

```
1
00:00:00,000 --> 00:00:02,500
First segment

2
00:00:02,500 --> 00:00:05,000
Second segment
```

## Alternative: Manual Installation

If you need more control, deploy with manual installation:

**Docker Image:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime
```

**Ports:**

```
22/tcp
```

**Command:**

```bash
pip install openai-whisper faster-whisper
```

{% hint style="info" %}
Manual installation takes 3-5 minutes. The pre-built image above is recommended for faster startup.
{% endhint %}

## Basic Usage (SSH)

```bash
ssh -p <port> root@<proxy>

# Transcribe audio file
whisper audio.mp3 --model large-v3 --device cuda

# Output to specific format
whisper audio.mp3 --model large-v3 --output_format txt

# Specify language
whisper audio.mp3 --model large-v3 --language Japanese
```

### Transcribe with Timestamps

```bash
whisper audio.mp3 --model large-v3 --word_timestamps True
```

## Upload Audio Files

```bash
# Upload single file
scp -P <port> interview.mp3 root@<proxy>:/workspace/

# Upload folder
scp -P <port> -r ./audio_files/ root@<proxy>:/workspace/
```

## Python API

```python
import whisper

# Load model (downloads on first use)
model = whisper.load_model("large-v3", device="cuda")

# Transcribe
result = model.transcribe("audio.mp3")

# Print text
print(result["text"])

# Print with timestamps
for segment in result["segments"]:
    print(f"[{segment['start']:.2f}s -> {segment['end']:.2f}s] {segment['text']}")
```

## Faster-Whisper (Recommended)

Faster-Whisper is 4x faster with lower VRAM usage:

```bash
pip install faster-whisper
```

```python
from faster_whisper import WhisperModel

# Load model with optimizations
model = WhisperModel("large-v3", device="cuda", compute_type="float16")

# Transcribe
segments, info = model.transcribe("audio.mp3")

print(f"Detected language: {info.language}")
for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
```

## Language Options

```python
from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda")

# Auto-detect language
segments, info = model.transcribe("audio.mp3")
print(f"Language: {info.language} ({info.language_probability:.0%})")

# Force specific language
segments, _ = model.transcribe("audio.mp3", language="ja")
```

## Translation to English

```python
from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda")
segments, _ = model.transcribe("japanese.mp3", task="translate")

for segment in segments:
    print(segment.text)
```

CLI:

```bash
whisper japanese.mp3 --model large-v3 --task translate
```

## Subtitle Generation

### SRT Format

```python
from faster_whisper import WhisperModel

def format_timestamp(seconds):
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    millis = int((seconds - int(seconds)) * 1000)
    return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"

model = WhisperModel("large-v3", device="cuda")
segments, _ = model.transcribe("video.mp4")

with open("subtitles.srt", "w") as f:
    for i, segment in enumerate(segments, 1):
        f.write(f"{i}\n")
        f.write(f"{format_timestamp(segment.start)} --> {format_timestamp(segment.end)}\n")
        f.write(f"{segment.text.strip()}\n\n")
```

### VTT Format

```bash
whisper video.mp4 --model large-v3 --output_format vtt
```

## Word-Level Timestamps

```python
from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda")

segments, _ = model.transcribe("audio.mp3", word_timestamps=True)

for segment in segments:
    for word in segment.words:
        print(f"[{word.start:.2f}s] {word.word}")
```

## Speaker Diarization

Who said what (requires pyannote):

```bash
pip install pyannote.audio
```

```python
from faster_whisper import WhisperModel
from pyannote.audio import Pipeline

# Diarization
diarization = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token="YOUR_HF_TOKEN"
)
diarization_result = diarization("audio.mp3")

# Transcription
whisper = WhisperModel("large-v3", device="cuda")
segments, _ = whisper.transcribe("audio.mp3")

# Combine
for segment in segments:
    # Find speaker at segment time
    speaker = None
    for turn, _, spk in diarization_result.itertracks(yield_label=True):
        if turn.start <= segment.start <= turn.end:
            speaker = spk
            break
    print(f"[{speaker}] {segment.text}")
```

## REST API Server

Create a transcription API:

```python
from fastapi import FastAPI, UploadFile
from faster_whisper import WhisperModel
import tempfile

app = FastAPI()
model = WhisperModel("large-v3", device="cuda", compute_type="float16")

@app.post("/transcribe")
async def transcribe(file: UploadFile):
    with tempfile.NamedTemporaryFile(delete=False) as tmp:
        tmp.write(await file.read())
        tmp_path = tmp.name

    segments, info = model.transcribe(tmp_path)

    return {
        "language": info.language,
        "text": " ".join([s.text for s in segments]),
        "segments": [
            {"start": s.start, "end": s.end, "text": s.text}
            for s in segments
        ]
    }

# Run: uvicorn server:app --host 0.0.0.0 --port 8000
```

## Performance Benchmarks

| Model    | GPU      | 1hr Audio |
| -------- | -------- | --------- |
| large-v3 | RTX 3090 | \~5 min   |
| large-v3 | RTX 4090 | \~3 min   |
| large-v3 | A100     | \~2 min   |
| medium   | RTX 3090 | \~2 min   |

## Memory-Efficient Processing

For very long audio:

```python
from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="int8")

# Process in chunks
segments, _ = model.transcribe(
    "long_audio.mp3",
    vad_filter=True,  # Skip silence
    vad_parameters=dict(min_silence_duration_ms=500)
)
```

## Download Results

```bash
# Download transcripts
scp -P <port> -r root@<proxy>:/workspace/transcripts/ ./

# Download subtitles
scp -P <port> root@<proxy>:/workspace/subtitles.srt ./
```

## Troubleshooting

{% hint style="danger" %}
**CUDA out of memory**
{% endhint %}

* Use smaller model (medium instead of large)
* Use `compute_type="int8"` for faster-whisper
* Process shorter audio segments

### HTTP 502 on http\_pub URL

The service is still starting. Wait 1-2 minutes and retry:

```bash
curl https://your-http-pub.clorecloud.net/
```

### Poor accuracy

* Use larger model
* Specify language: `--language English`
* Increase beam\_size for faster-whisper

### Slow processing

* Ensure GPU is used: `nvidia-smi`
* Use faster-whisper instead of original
* Enable VAD to skip silence

## Cost Estimate

Typical CLORE.AI marketplace rates:

| GPU      | VRAM | Price/day  | Good For            |
| -------- | ---- | ---------- | ------------------- |
| RTX 3060 | 12GB | $0.15–0.30 | small/medium models |
| RTX 3090 | 24GB | $0.30–1.00 | large-v3            |
| RTX 4090 | 24GB | $0.50–2.00 | large-v3, fast      |
| A100     | 40GB | $1.50–3.00 | batch processing    |

*Prices in USD/day. Rates vary by provider — check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/audio-and-voice/whisper-transcription.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
