Pipeline Architecture
Overview
A local-first system for converting video content into searchable knowledge.
INPUT PROCESS OUTPUT
─────────────────────────────────────────────────────────────
YouTube URL → yt-dlp → Audio file
Podcast feed → download → (.mp3/.m4a)
Local video → →
↓
Whisper/Parakeet → Transcript
(local ML) → + timestamps
↓
LLM extraction → Topics
(optional) → Summary
→ Key points
↓
Database → Searchable
+ embeddings → knowledge base
Stage 1: Capture
Tool: yt-dlp
# Download audio only (smallest file)
yt-dlp -x --audio-format mp3 "https://youtube.com/watch?v=..."
# Download entire channel
yt-dlp -x --audio-format mp3 "https://youtube.com/@ChannelName"
# With metadata
yt-dlp -x --audio-format mp3 --write-info-json "URL"
Considerations
- Storage: Audio-only is ~10-20MB per hour vs 500MB+ for video
- Rate limiting: YouTube may throttle; add delays for large batches
- Cookies: Some content requires authentication
- Alternatives: Podcast RSS feeds, local recordings
Stage 2: Transcribe
Option A: Whisper (OpenAI)
# Install
pip install openai-whisper
# Transcribe
whisper audio.mp3 --model medium --output_format json
Models by accuracy/speed:
| Model | VRAM | Speed | Accuracy |
|---|---|---|---|
| tiny | 1GB | 32x | Good |
| base | 1GB | 16x | Better |
| small | 2GB | 6x | Good+ |
| medium | 5GB | 2x | Great |
| large | 10GB | 1x | Best |
Option B: Parakeet (NVIDIA)
Faster than Whisper on NVIDIA GPUs. Similar accuracy.
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-1.1b")
transcript = model.transcribe(["audio.mp3"])
Option C: MacWhisper (macOS)
GUI app using Apple Silicon acceleration. Good for manual processing.
Output Format
{
"text": "Full transcript here...",
"segments": [
{
"start": 0.0,
"end": 5.2,
"text": "Welcome to the tutorial..."
}
]
}
Stage 3: Structure (Optional)
LLM Extraction
Use local or API LLM to extract:
- Topic list
- Key points
- Timestamps for major sections
- Named entities (tools, people, concepts)
prompt = """
Extract from this transcript:
1. Main topics (list)
2. Key takeaways (3-5 bullets)
3. Tools/products mentioned
4. Timestamps for major sections
Transcript:
{transcript}
"""
Cost Consideration
- Local LLM: Free, slower
- API (GPT-4, Claude): ~$0.01-0.10 per transcript
- Batch processing: Queue and process overnight
Stage 4: Store
Schema
CREATE TABLE videos (
id UUID PRIMARY KEY,
url TEXT,
title TEXT,
channel TEXT,
duration INTEGER,
watched_at TIMESTAMP,
transcript TEXT,
topics TEXT[],
summary TEXT
);
CREATE TABLE segments (
id UUID PRIMARY KEY,
video_id UUID REFERENCES videos(id),
start_time FLOAT,
end_time FLOAT,
text TEXT,
embedding VECTOR(768)
);
Embeddings
Convert segments to vectors for semantic search:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embedding = model.encode(segment_text)
Stage 5: Retrieve
Semantic Search
-- Find segments similar to query
SELECT video_id, text, start_time
FROM segments
ORDER BY embedding <-> query_embedding
LIMIT 10;
Full-Text Search
-- Keyword search
SELECT * FROM videos
WHERE to_tsvector(transcript) @@ to_tsquery('kubernetes & deployment');
Combined Interface
- Natural language query → semantic search
- Keyword query → full-text search
- Filter by channel, date, topic
Performance
| Metric | Value |
|---|---|
| Videos processed | 9,996 |
| Total transcripts | 15,955 files |
| Channels tracked | 91 |
| Search latency | <500ms |
| API cost | $0 (local ML) |
Replication
Minimum Setup
- Install yt-dlp:
pip install yt-dlp - Install Whisper:
pip install openai-whisper - Download + transcribe in one script
- Store in SQLite or JSON files
Full Setup
- Supabase for database + vector search
- Batch processing with queue
- Web interface for search
- Automatic channel monitoring
Contribute improvements to the pipeline or share your own architecture.