Skip to content

YouTube RAG System - Extract YouTube video transcripts, convert to OpenAI embeddings, and query content using semantic search + GPT. Supports individual videos and playlists. Built with SQLite, cosine similarity search, and contextual AI responses with timestamp references.

Notifications You must be signed in to change notification settings

IramML/youtube-knowledge-base

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

YouTube RAG System

RAG (Retrieval-Augmented Generation) system that extracts YouTube video transcripts, converts them to embeddings, and enables intelligent queries about the content using LLMs.

Features

  • Automatic ingestion: Extracts transcripts from individual videos or complete playlists
  • OpenAI embeddings: Uses text-embedding-3-small for maximum semantic precision
  • Local database: SQLite for efficient storage and portability
  • Smart search: Finds relevant content using cosine similarity
  • Contextual responses: Generates accurate answers with GPT based on found content
  • Temporal references: Includes timestamps and direct links to specific moments

Installation

git clone <repository-url>
cd youtube-rag-system
pip install -r requirements.txt

Configuration

  1. Create your .env file:
cp .env.example .env
  1. Edit .env with your API keys:
YOUTUBE_API_KEY=your_youtube_api_key_here
OPENAI_API_KEY=your_openai_api_key_here

Getting API Keys

YouTube API Key:

  1. Go to Google Cloud Console
  2. Create project or select existing one
  3. Enable YouTube Data API v3
  4. Create credentials (API Key)

OpenAI API Key:

  1. Go to OpenAI API
  2. Create new API key

Usage

1. Feed the database

Individual video:

python ingestion.py "https://youtu.be/aircAruvnKk"

Complete playlist:

python ingestion.py "https://youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi"

Options:

  • --db-path: Database path (default: youtube_embeddings.db)

2. Query the system

Interactive mode:

python rag_query.py

Single query:

python rag_query.py --query "What is backpropagation in neural networks?"

Options:

  • --db-path: Database path
  • --top-k: Number of chunks to retrieve (default: 5)
  • --query: Query in non-interactive mode

Example queries

What is a neural network?
How does gradient descent work?
Explain the chain rule in calculus
What are the main components of a transformer?
What's the difference between supervised and unsupervised learning?

Recommended test videos

Machine Learning:

  • 3Blue1Brown Neural Networks: https://youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi
  • StatQuest: https://youtube.com/playlist?list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF

How it works

  1. Extraction: Gets automatic/manual transcripts from YouTube
  2. Chunking: Splits transcripts into ~500 character chunks
  3. Embeddings: Converts each chunk to vector using OpenAI embeddings
  4. Storage: Saves in SQLite with metadata (timestamps, titles)
  5. Search: Finds similar chunks using cosine similarity
  6. Generation: Uses GPT to generate answers based on relevant context

Troubleshooting

Error: No transcript available

  • Video doesn't have automatic transcripts enabled
  • Try another video with captions

Error: API key not found

  • Verify .env file exists
  • Confirm API keys are correctly configured

Error: Quota exceeded

  • YouTube API: 10,000 requests/day (free)
  • OpenAI API: Check limits in your account

About

YouTube RAG System - Extract YouTube video transcripts, convert to OpenAI embeddings, and query content using semantic search + GPT. Supports individual videos and playlists. Built with SQLite, cosine similarity search, and contextual AI responses with timestamp references.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages