back

Local Transcription Guide

Why Local?

Cloud transcription services charge per minute:

At scale (10K+ hours), this becomes significant:

Local transcription also means:


Hardware Requirements

Minimum (CPU-only)

Recommended (GPU)

Apple Silicon


Setup: Whisper

Installation

# Basic
pip install openai-whisper

# With GPU support (NVIDIA)
pip install openai-whisper torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Basic Usage

# Transcribe single file
whisper audio.mp3 --model medium

# Specify output format
whisper audio.mp3 --model medium --output_format json

# Specify language (faster if known)
whisper audio.mp3 --model medium --language en

Batch Processing

#!/bin/bash
# transcribe_all.sh

for file in ./audio/*.mp3; do
  echo "Processing: $file"
  whisper "$file" --model medium --output_dir ./transcripts
done

Setup: Faster-Whisper

4x faster than standard Whisper with same accuracy.

pip install faster-whisper
from faster_whisper import WhisperModel

model = WhisperModel("medium", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3")

for segment in segments:
    print(f"[{segment.start:.2f} -> {segment.end:.2f}] {segment.text}")

Setup: MacWhisper (macOS GUI)

  1. Download from goodsnooze.gumroad.com/l/macwhisper
  2. Drag to Applications
  3. Drop audio files to transcribe
  4. Uses Apple Silicon acceleration

Good for:


Model Selection

ModelSizeVRAMQualitySpeedUse Case
tiny39M1GBBasic32xQuick drafts
base74M1GBGood16xCasual use
small244M2GBBetter6xGeneral purpose
medium769M5GBGreat2xRecommended
large1.5G10GBBest1xAccuracy critical
large-v31.5G10GBBest+1xLatest, most accurate

Recommendation: Start with "medium". Use "large" only if accuracy is critical.


Optimization Tips

1. Audio Preprocessing

# Convert to 16kHz mono (optimal for Whisper)
ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav

2. Parallel Processing

from concurrent.futures import ProcessPoolExecutor
import whisper

model = whisper.load_model("medium")

def transcribe(file):
    return whisper.transcribe(model, file)

with ProcessPoolExecutor(max_workers=4) as executor:
    results = executor.map(transcribe, audio_files)

3. GPU Memory Management

import torch
torch.cuda.empty_cache()  # Clear GPU memory between files

4. Skip Already Processed

import os

for audio_file in audio_files:
    transcript_path = audio_file.replace('.mp3', '.json')
    if os.path.exists(transcript_path):
        continue  # Skip already processed
    # ... transcribe

Common Issues

Out of Memory (GPU)

Slow Processing

Poor Accuracy


Output Formats

JSON (Recommended)

{
  "text": "Full transcript...",
  "segments": [
    {"start": 0.0, "end": 5.2, "text": "..."},
    {"start": 5.2, "end": 10.1, "text": "..."}
  ],
  "language": "en"
}

SRT (Subtitles)

1
00:00:00,000 --> 00:00:05,200
Welcome to the tutorial

2
00:00:05,200 --> 00:00:10,100
Today we'll cover...

VTT (Web Video)

WEBVTT

00:00:00.000 --> 00:00:05.200
Welcome to the tutorial

00:00:05.200 --> 00:00:10.100
Today we'll cover...

Contribute your setup, benchmarks, or optimization tips.