Local Transcription Guide

Why Local?

Cloud transcription services charge per minute:

Google Speech-to-Text: ~$0.006/15 seconds = $1.44/hour
AWS Transcribe: ~$0.024/minute = $1.44/hour
OpenAI Whisper API: $0.006/minute = $0.36/hour

At scale (10K+ hours), this becomes significant:

10,000 hours × $0.36 = $3,600 (API)
10,000 hours × $0 = $0 (local)

Local transcription also means:

No data leaving your machine
No rate limits
Works offline
One-time hardware investment

Hardware Requirements

Minimum (CPU-only)

Any modern CPU
8GB RAM
Whisper "tiny" or "base" model
Speed: ~0.5x real-time (2 hours to transcribe 1 hour)

Recommended (GPU)

NVIDIA GPU with 6GB+ VRAM
16GB RAM
Whisper "medium" or "large" model
Speed: 10-30x real-time (1 hour transcribed in 2-6 minutes)

Apple Silicon

M1/M2/M3 Mac
16GB unified memory
Uses Metal acceleration
Speed: 5-15x real-time

Setup: Whisper

Installation

# Basic
pip install openai-whisper

# With GPU support (NVIDIA)
pip install openai-whisper torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Basic Usage

# Transcribe single file
whisper audio.mp3 --model medium

# Specify output format
whisper audio.mp3 --model medium --output_format json

# Specify language (faster if known)
whisper audio.mp3 --model medium --language en

Batch Processing

#!/bin/bash
# transcribe_all.sh

for file in ./audio/*.mp3; do
  echo "Processing: $file"
  whisper "$file" --model medium --output_dir ./transcripts
done

Setup: Faster-Whisper

4x faster than standard Whisper with same accuracy.

pip install faster-whisper

from faster_whisper import WhisperModel

model = WhisperModel("medium", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3")

for segment in segments:
    print(f"[{segment.start:.2f} -> {segment.end:.2f}] {segment.text}")

Setup: MacWhisper (macOS GUI)

Download from goodsnooze.gumroad.com/l/macwhisper
Drag to Applications
Drop audio files to transcribe
Uses Apple Silicon acceleration

Good for:

One-off transcriptions
Non-technical users
Quick manual processing

Model Selection

Model	Size	VRAM	Quality	Speed	Use Case
tiny	39M	1GB	Basic	32x	Quick drafts
base	74M	1GB	Good	16x	Casual use
small	244M	2GB	Better	6x	General purpose
medium	769M	5GB	Great	2x	Recommended
large	1.5G	10GB	Best	1x	Accuracy critical
large-v3	1.5G	10GB	Best+	1x	Latest, most accurate

Recommendation: Start with "medium". Use "large" only if accuracy is critical.

Optimization Tips

1. Audio Preprocessing

# Convert to 16kHz mono (optimal for Whisper)
ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav

2. Parallel Processing

from concurrent.futures import ProcessPoolExecutor
import whisper

model = whisper.load_model("medium")

def transcribe(file):
    return whisper.transcribe(model, file)

with ProcessPoolExecutor(max_workers=4) as executor:
    results = executor.map(transcribe, audio_files)

3. GPU Memory Management

import torch
torch.cuda.empty_cache()  # Clear GPU memory between files

4. Skip Already Processed

import os

for audio_file in audio_files:
    transcript_path = audio_file.replace('.mp3', '.json')
    if os.path.exists(transcript_path):
        continue  # Skip already processed
    # ... transcribe

Common Issues

Out of Memory (GPU)

Use smaller model
Process shorter segments
Clear cache between files

Slow Processing

Check GPU is being used: nvidia-smi
Use faster-whisper instead
Reduce audio quality (16kHz mono is sufficient)

Poor Accuracy

Use larger model
Specify language if known
Check audio quality (noise, multiple speakers)

Output Formats

JSON (Recommended)

{
  "text": "Full transcript...",
  "segments": [
    {"start": 0.0, "end": 5.2, "text": "..."},
    {"start": 5.2, "end": 10.1, "text": "..."}
  ],
  "language": "en"
}

SRT (Subtitles)

1
00:00:00,000 --> 00:00:05,200
Welcome to the tutorial

2
00:00:05,200 --> 00:00:10,100
Today we'll cover...

VTT (Web Video)

WEBVTT

00:00:00.000 --> 00:00:05.200
Welcome to the tutorial

00:00:05.200 --> 00:00:10.100
Today we'll cover...

Contribute your setup, benchmarks, or optimization tips.