Local Transcription Guide
Why Local?
Cloud transcription services charge per minute:
- Google Speech-to-Text: ~$0.006/15 seconds = $1.44/hour
- AWS Transcribe: ~$0.024/minute = $1.44/hour
- OpenAI Whisper API: $0.006/minute = $0.36/hour
At scale (10K+ hours), this becomes significant:
- 10,000 hours × $0.36 = $3,600 (API)
- 10,000 hours × $0 = $0 (local)
Local transcription also means:
- No data leaving your machine
- No rate limits
- Works offline
- One-time hardware investment
Hardware Requirements
Minimum (CPU-only)
- Any modern CPU
- 8GB RAM
- Whisper "tiny" or "base" model
- Speed: ~0.5x real-time (2 hours to transcribe 1 hour)
Recommended (GPU)
- NVIDIA GPU with 6GB+ VRAM
- 16GB RAM
- Whisper "medium" or "large" model
- Speed: 10-30x real-time (1 hour transcribed in 2-6 minutes)
Apple Silicon
- M1/M2/M3 Mac
- 16GB unified memory
- Uses Metal acceleration
- Speed: 5-15x real-time
Setup: Whisper
Installation
# Basic
pip install openai-whisper
# With GPU support (NVIDIA)
pip install openai-whisper torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Basic Usage
# Transcribe single file
whisper audio.mp3 --model medium
# Specify output format
whisper audio.mp3 --model medium --output_format json
# Specify language (faster if known)
whisper audio.mp3 --model medium --language en
Batch Processing
#!/bin/bash
# transcribe_all.sh
for file in ./audio/*.mp3; do
echo "Processing: $file"
whisper "$file" --model medium --output_dir ./transcripts
done
Setup: Faster-Whisper
4x faster than standard Whisper with same accuracy.
pip install faster-whisper
from faster_whisper import WhisperModel
model = WhisperModel("medium", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3")
for segment in segments:
print(f"[{segment.start:.2f} -> {segment.end:.2f}] {segment.text}")
Setup: MacWhisper (macOS GUI)
- Download from goodsnooze.gumroad.com/l/macwhisper
- Drag to Applications
- Drop audio files to transcribe
- Uses Apple Silicon acceleration
Good for:
- One-off transcriptions
- Non-technical users
- Quick manual processing
Model Selection
| Model | Size | VRAM | Quality | Speed | Use Case |
|---|---|---|---|---|---|
| tiny | 39M | 1GB | Basic | 32x | Quick drafts |
| base | 74M | 1GB | Good | 16x | Casual use |
| small | 244M | 2GB | Better | 6x | General purpose |
| medium | 769M | 5GB | Great | 2x | Recommended |
| large | 1.5G | 10GB | Best | 1x | Accuracy critical |
| large-v3 | 1.5G | 10GB | Best+ | 1x | Latest, most accurate |
Recommendation: Start with "medium". Use "large" only if accuracy is critical.
Optimization Tips
1. Audio Preprocessing
# Convert to 16kHz mono (optimal for Whisper)
ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav
2. Parallel Processing
from concurrent.futures import ProcessPoolExecutor
import whisper
model = whisper.load_model("medium")
def transcribe(file):
return whisper.transcribe(model, file)
with ProcessPoolExecutor(max_workers=4) as executor:
results = executor.map(transcribe, audio_files)
3. GPU Memory Management
import torch
torch.cuda.empty_cache() # Clear GPU memory between files
4. Skip Already Processed
import os
for audio_file in audio_files:
transcript_path = audio_file.replace('.mp3', '.json')
if os.path.exists(transcript_path):
continue # Skip already processed
# ... transcribe
Common Issues
Out of Memory (GPU)
- Use smaller model
- Process shorter segments
- Clear cache between files
Slow Processing
- Check GPU is being used:
nvidia-smi - Use faster-whisper instead
- Reduce audio quality (16kHz mono is sufficient)
Poor Accuracy
- Use larger model
- Specify language if known
- Check audio quality (noise, multiple speakers)
Output Formats
JSON (Recommended)
{
"text": "Full transcript...",
"segments": [
{"start": 0.0, "end": 5.2, "text": "..."},
{"start": 5.2, "end": 10.1, "text": "..."}
],
"language": "en"
}
SRT (Subtitles)
1
00:00:00,000 --> 00:00:05,200
Welcome to the tutorial
2
00:00:05,200 --> 00:00:10,100
Today we'll cover...
VTT (Web Video)
WEBVTT
00:00:00.000 --> 00:00:05.200
Welcome to the tutorial
00:00:05.200 --> 00:00:10.100
Today we'll cover...
Contribute your setup, benchmarks, or optimization tips.