# Whisper Transcription

Transcribe audio and video files using OpenAI's Whisper on CLORE.AI GPUs.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Server Requirements

| Parameter    | Minimum       | Recommended      |
| ------------ | ------------- | ---------------- |
| RAM          | 8GB           | 16GB+            |
| VRAM         | 4GB (small)   | 10GB+ (large-v3) |
| Network      | 200Mbps       | 500Mbps+         |
| Startup Time | \~1-2 minutes | -                |

## What is Whisper?

OpenAI Whisper is a speech recognition model that can:

* Transcribe audio in 99 languages
* Translate to English
* Generate timestamps
* Handle noisy audio

## Model Sizes

| Model              | VRAM    | Speed             | Quality  | Notes                                        |
| ------------------ | ------- | ----------------- | -------- | -------------------------------------------- |
| tiny               | 1GB     | \~32x realtime    | Basic    | Fastest, lowest accuracy                     |
| base               | 1GB     | \~16x realtime    | Good     | Good balance for quick tasks                 |
| small              | 2GB     | \~6x realtime     | Better   | Recommended for most use cases               |
| medium             | 5GB     | \~2x realtime     | Great    | High accuracy, moderate speed                |
| large-v3           | 10GB    | \~1x realtime     | Best     | Highest accuracy                             |
| **large-v3-turbo** | **6GB** | **\~8x realtime** | **Best** | **8x faster than large-v3, similar quality** |

> **💡 Recommendation:** Use `large-v3-turbo` for the best speed/quality tradeoff. It delivers comparable accuracy to `large-v3` at 8x the speed with lower VRAM requirements.

### Using large-v3-turbo

```python
import whisper

# Load large-v3-turbo (8x faster than large-v3, similar quality)
model = whisper.load_model("large-v3-turbo", device="cuda")

result = model.transcribe("audio.mp3")
print(result["text"])
```

With Faster-Whisper:

```python
from faster_whisper import WhisperModel

# large-v3-turbo via faster-whisper
model = WhisperModel("deepdml/faster-whisper-large-v3-turbo-ct2", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3")

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
```

***

## WhisperX: Enhanced Alternative

For **word-level timestamps**, **speaker diarization**, and **up to 70x faster** processing, consider [WhisperX](https://docs.clore.ai/guides/audio-and-voice/whisperx):

```bash
pip install whisperx
```

```python
import whisperx

model = whisperx.load_model("large-v3-turbo", device="cuda", compute_type="float16")
audio = whisperx.load_audio("audio.mp3")
result = model.transcribe(audio, batch_size=16)

# Word-level alignment
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device="cuda")
result = whisperx.align(result["segments"], model_a, metadata, audio, device="cuda")

# result["word_segments"] contains word-level timestamps
```

➡️ See the full [WhisperX guide](https://docs.clore.ai/guides/audio-and-voice/whisperx) for speaker diarization and advanced features.

## Quick Deploy (Recommended)

Use the pre-built Faster-Whisper server for instant deployment:

**Docker Image:**

```
fedirz/faster-whisper-server:latest-cuda
```

**Ports:**

```
22/tcp
8000/http
```

**No command needed** - server starts automatically.

### Verify It's Working

After deployment, find your `http_pub` URL in **My Orders** and test:

```bash
curl https://your-http-pub.clorecloud.net/

# Expected: server info JSON response
```

{% hint style="warning" %}
If you get HTTP 502, wait 1-2 minutes - the service is still starting.
{% endhint %}

### Transcribe via API

```bash
# Transcribe audio file
curl -X POST https://your-http-pub.clorecloud.net/v1/audio/transcriptions \
  -F "file=@audio.mp3" \
  -F "model=Systran/faster-whisper-large-v3" \
  -F "response_format=json"

# With timestamps
curl -X POST https://your-http-pub.clorecloud.net/v1/audio/transcriptions \
  -F "file=@audio.mp3" \
  -F "model=Systran/faster-whisper-large-v3" \
  -F "response_format=verbose_json" \
  -F "timestamp_granularities[]=word"
```

## Complete API Reference (Faster-Whisper-Server)

### Endpoints

| Endpoint                   | Method | Description                          |
| -------------------------- | ------ | ------------------------------------ |
| `/v1/audio/transcriptions` | POST   | Transcribe audio (OpenAI-compatible) |
| `/v1/audio/translations`   | POST   | Translate audio to English           |
| `/v1/models`               | GET    | List all available models            |
| `/v1/models/{model_name}`  | GET    | Get specific model info              |
| `/api/ps`                  | GET    | List currently loaded models         |
| `/api/ps/{model_name}`     | GET    | Check if specific model is loaded    |
| `/api/pull/{model_name}`   | POST   | Download and load a model            |
| `/health`                  | GET    | Health check endpoint                |
| `/docs`                    | GET    | Swagger UI documentation             |
| `/openapi.json`            | GET    | OpenAPI specification                |

#### List Available Models

```bash
curl https://your-http-pub.clorecloud.net/v1/models
```

Response:

```json
{
  "data": [
    {"id": "Systran/faster-whisper-large-v3", "object": "model"},
    {"id": "Systran/faster-whisper-medium", "object": "model"}
  ]
}
```

#### Swagger Documentation

Open in browser for interactive API testing:

```
https://your-http-pub.clorecloud.net/docs
```

### Transcription Options

| Parameter                   | Type   | Description                                               |
| --------------------------- | ------ | --------------------------------------------------------- |
| `file`                      | File   | Audio file to transcribe                                  |
| `model`                     | String | Model to use (default: `Systran/faster-whisper-large-v3`) |
| `language`                  | String | Force specific language (e.g., `en`, `ja`, `ru`)          |
| `response_format`           | String | `json`, `text`, `srt`, `vtt`, `verbose_json`              |
| `temperature`               | Float  | Sampling temperature (0.0-1.0)                            |
| `timestamp_granularities[]` | Array  | `word` or `segment` for timestamps                        |

#### Response Formats

**JSON (default):**

```json
{"text": "Transcribed text here..."}
```

**Verbose JSON:**

```json
{
  "text": "Full transcription...",
  "segments": [
    {"start": 0.0, "end": 2.5, "text": "First segment"},
    {"start": 2.5, "end": 5.0, "text": "Second segment"}
  ],
  "language": "en"
}
```

**SRT:**

```
1
00:00:00,000 --> 00:00:02,500
First segment

2
00:00:02,500 --> 00:00:05,000
Second segment
```

## Alternative: Manual Installation

If you need more control, deploy with manual installation:

**Docker Image:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime
```

**Ports:**

```
22/tcp
```

**Command:**

```bash
pip install openai-whisper faster-whisper
```

{% hint style="info" %}
Manual installation takes 3-5 minutes. The pre-built image above is recommended for faster startup.
{% endhint %}

## Basic Usage (SSH)

```bash
ssh -p <port> root@<proxy>

# Transcribe audio file
whisper audio.mp3 --model large-v3 --device cuda

# Output to specific format
whisper audio.mp3 --model large-v3 --output_format txt

# Specify language
whisper audio.mp3 --model large-v3 --language Japanese
```

### Transcribe with Timestamps

```bash
whisper audio.mp3 --model large-v3 --word_timestamps True
```

## Upload Audio Files

```bash
# Upload single file
scp -P <port> interview.mp3 root@<proxy>:/workspace/

# Upload folder
scp -P <port> -r ./audio_files/ root@<proxy>:/workspace/
```

## Python API

```python
import whisper

# Load model (downloads on first use)
model = whisper.load_model("large-v3", device="cuda")

# Transcribe
result = model.transcribe("audio.mp3")

# Print text
print(result["text"])

# Print with timestamps
for segment in result["segments"]:
    print(f"[{segment['start']:.2f}s -> {segment['end']:.2f}s] {segment['text']}")
```

## Faster-Whisper (Recommended)

Faster-Whisper is 4x faster with lower VRAM usage:

```bash
pip install faster-whisper
```

```python
from faster_whisper import WhisperModel

# Load model with optimizations
model = WhisperModel("large-v3", device="cuda", compute_type="float16")

# Transcribe
segments, info = model.transcribe("audio.mp3")

print(f"Detected language: {info.language}")
for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
```

## Language Options

```python
from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda")

# Auto-detect language
segments, info = model.transcribe("audio.mp3")
print(f"Language: {info.language} ({info.language_probability:.0%})")

# Force specific language
segments, _ = model.transcribe("audio.mp3", language="ja")
```

## Translation to English

```python
from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda")
segments, _ = model.transcribe("japanese.mp3", task="translate")

for segment in segments:
    print(segment.text)
```

CLI:

```bash
whisper japanese.mp3 --model large-v3 --task translate
```

## Subtitle Generation

### SRT Format

```python
from faster_whisper import WhisperModel

def format_timestamp(seconds):
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    millis = int((seconds - int(seconds)) * 1000)
    return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"

model = WhisperModel("large-v3", device="cuda")
segments, _ = model.transcribe("video.mp4")

with open("subtitles.srt", "w") as f:
    for i, segment in enumerate(segments, 1):
        f.write(f"{i}\n")
        f.write(f"{format_timestamp(segment.start)} --> {format_timestamp(segment.end)}\n")
        f.write(f"{segment.text.strip()}\n\n")
```

### VTT Format

```bash
whisper video.mp4 --model large-v3 --output_format vtt
```

## Word-Level Timestamps

```python
from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda")

segments, _ = model.transcribe("audio.mp3", word_timestamps=True)

for segment in segments:
    for word in segment.words:
        print(f"[{word.start:.2f}s] {word.word}")
```

## Speaker Diarization

Who said what (requires pyannote):

```bash
pip install pyannote.audio
```

```python
from faster_whisper import WhisperModel
from pyannote.audio import Pipeline

# Diarization
diarization = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token="YOUR_HF_TOKEN"
)
diarization_result = diarization("audio.mp3")

# Transcription
whisper = WhisperModel("large-v3", device="cuda")
segments, _ = whisper.transcribe("audio.mp3")

# Combine
for segment in segments:
    # Find speaker at segment time
    speaker = None
    for turn, _, spk in diarization_result.itertracks(yield_label=True):
        if turn.start <= segment.start <= turn.end:
            speaker = spk
            break
    print(f"[{speaker}] {segment.text}")
```

## REST API Server

Create a transcription API:

```python
from fastapi import FastAPI, UploadFile
from faster_whisper import WhisperModel
import tempfile

app = FastAPI()
model = WhisperModel("large-v3", device="cuda", compute_type="float16")

@app.post("/transcribe")
async def transcribe(file: UploadFile):
    with tempfile.NamedTemporaryFile(delete=False) as tmp:
        tmp.write(await file.read())
        tmp_path = tmp.name

    segments, info = model.transcribe(tmp_path)

    return {
        "language": info.language,
        "text": " ".join([s.text for s in segments]),
        "segments": [
            {"start": s.start, "end": s.end, "text": s.text}
            for s in segments
        ]
    }

# Run: uvicorn server:app --host 0.0.0.0 --port 8000
```

## Performance Benchmarks

| Model    | GPU      | 1hr Audio |
| -------- | -------- | --------- |
| large-v3 | RTX 3090 | \~5 min   |
| large-v3 | RTX 4090 | \~3 min   |
| large-v3 | A100     | \~2 min   |
| medium   | RTX 3090 | \~2 min   |

## Memory-Efficient Processing

For very long audio:

```python
from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="int8")

# Process in chunks
segments, _ = model.transcribe(
    "long_audio.mp3",
    vad_filter=True,  # Skip silence
    vad_parameters=dict(min_silence_duration_ms=500)
)
```

## Download Results

```bash
# Download transcripts
scp -P <port> -r root@<proxy>:/workspace/transcripts/ ./

# Download subtitles
scp -P <port> root@<proxy>:/workspace/subtitles.srt ./
```

## Troubleshooting

{% hint style="danger" %}
**CUDA out of memory**
{% endhint %}

* Use smaller model (medium instead of large)
* Use `compute_type="int8"` for faster-whisper
* Process shorter audio segments

### HTTP 502 on http\_pub URL

The service is still starting. Wait 1-2 minutes and retry:

```bash
curl https://your-http-pub.clorecloud.net/
```

### Poor accuracy

* Use larger model
* Specify language: `--language English`
* Increase beam\_size for faster-whisper

### Slow processing

* Ensure GPU is used: `nvidia-smi`
* Use faster-whisper instead of original
* Enable VAD to skip silence

## Cost Estimate

Typical CLORE.AI marketplace rates:

| GPU      | VRAM | Price/day  | Good For            |
| -------- | ---- | ---------- | ------------------- |
| RTX 3060 | 12GB | $0.15–0.30 | small/medium models |
| RTX 3090 | 24GB | $0.30–1.00 | large-v3            |
| RTX 4090 | 24GB | $0.50–2.00 | large-v3, fast      |
| A100     | 40GB | $1.50–3.00 | batch processing    |

*Prices in USD/day. Rates vary by provider — check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*
