back

Pipeline Architecture

Overview

A local-first system for converting video content into searchable knowledge.

INPUT                    PROCESS                    OUTPUT
─────────────────────────────────────────────────────────────
YouTube URL         →    yt-dlp              →    Audio file
Podcast feed        →    download            →    (.mp3/.m4a)
Local video         →                        →
                         ↓
                    Whisper/Parakeet         →    Transcript
                    (local ML)               →    + timestamps
                         ↓
                    LLM extraction           →    Topics
                    (optional)               →    Summary
                                             →    Key points
                         ↓
                    Database                 →    Searchable
                    + embeddings             →    knowledge base

Stage 1: Capture

Tool: yt-dlp

# Download audio only (smallest file)
yt-dlp -x --audio-format mp3 "https://youtube.com/watch?v=..."

# Download entire channel
yt-dlp -x --audio-format mp3 "https://youtube.com/@ChannelName"

# With metadata
yt-dlp -x --audio-format mp3 --write-info-json "URL"

Considerations


Stage 2: Transcribe

Option A: Whisper (OpenAI)

# Install
pip install openai-whisper

# Transcribe
whisper audio.mp3 --model medium --output_format json

Models by accuracy/speed:

ModelVRAMSpeedAccuracy
tiny1GB32xGood
base1GB16xBetter
small2GB6xGood+
medium5GB2xGreat
large10GB1xBest

Option B: Parakeet (NVIDIA)

Faster than Whisper on NVIDIA GPUs. Similar accuracy.

import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-1.1b")
transcript = model.transcribe(["audio.mp3"])

Option C: MacWhisper (macOS)

GUI app using Apple Silicon acceleration. Good for manual processing.

Output Format

{
  "text": "Full transcript here...",
  "segments": [
    {
      "start": 0.0,
      "end": 5.2,
      "text": "Welcome to the tutorial..."
    }
  ]
}

Stage 3: Structure (Optional)

LLM Extraction

Use local or API LLM to extract:

prompt = """
Extract from this transcript:
1. Main topics (list)
2. Key takeaways (3-5 bullets)
3. Tools/products mentioned
4. Timestamps for major sections

Transcript:
{transcript}
"""

Cost Consideration


Stage 4: Store

Schema

CREATE TABLE videos (
  id UUID PRIMARY KEY,
  url TEXT,
  title TEXT,
  channel TEXT,
  duration INTEGER,
  watched_at TIMESTAMP,
  transcript TEXT,
  topics TEXT[],
  summary TEXT
);

CREATE TABLE segments (
  id UUID PRIMARY KEY,
  video_id UUID REFERENCES videos(id),
  start_time FLOAT,
  end_time FLOAT,
  text TEXT,
  embedding VECTOR(768)
);

Embeddings

Convert segments to vectors for semantic search:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embedding = model.encode(segment_text)

Stage 5: Retrieve

Semantic Search

-- Find segments similar to query
SELECT video_id, text, start_time
FROM segments
ORDER BY embedding <-> query_embedding
LIMIT 10;

Full-Text Search

-- Keyword search
SELECT * FROM videos
WHERE to_tsvector(transcript) @@ to_tsquery('kubernetes & deployment');

Combined Interface


Performance

MetricValue
Videos processed9,996
Total transcripts15,955 files
Channels tracked91
Search latency<500ms
API cost$0 (local ML)

Replication

Minimum Setup

  1. Install yt-dlp: pip install yt-dlp
  2. Install Whisper: pip install openai-whisper
  3. Download + transcribe in one script
  4. Store in SQLite or JSON files

Full Setup

  1. Supabase for database + vector search
  2. Batch processing with queue
  3. Web interface for search
  4. Automatic channel monitoring

Contribute improvements to the pipeline or share your own architecture.