A multi-source Retrieval-Augmented Generation pipeline that ingests AI/ML knowledge from Instagram video transcripts, ArXiv research papers, and GitHub repositories, then organizes it into a searchable knowledge system with concept-level understanding. The pipeline spans seven layers: collection, processing, storage, knowledge extraction, vector embedding, hybrid retrieval, and LLM-powered answer generation.
The architecture treats each content source as a first-class data stream with its own ingestion, processing, and normalization path, converging into a unified SQLite store with full-text search, vector embeddings, and a concept knowledge graph. Retrieval combines cosine-similarity vector search with FTS5 keyword matching through an adaptive weighting system that classifies queries and adjusts strategy in real time.
| Languages |
|
| AI / LLM |
|
| ML / Embeddings |
|
| Data & Storage |
|
| Web & API |
|
Every content source -- Instagram video transcripts, ArXiv research papers, GitHub repositories -- enters through its own specialized ingestion path but converges into a single unified schema (ai_content) with normalized metadata. Downstream systems (embedding, search, knowledge extraction) operate on this unified representation without knowledge of the original source.
Goal: Add a new content source by implementing one collector and one normalizer, with zero changes to retrieval, embedding, or knowledge graph logic.
Pure vector search misses exact-match terminology. Pure keyword search misses semantic similarity. The retrieval layer combines both through an adaptive weighting system that classifies each query (code, factual, conceptual) and adjusts the vector-to-keyword balance in real time. A feedback loop learns optimal weights from user search interactions.
Goal: Every query type -- exact code snippets, broad conceptual questions, specific factual lookups -- returns relevant results without manual tuning.
Raw documents are not just stored and embedded -- they are distilled into a structured concept graph using LLM-powered extraction. Concepts, their categories (algorithm, model, technique, framework), and weighted relationships form a NetworkX graph that supports centrality analysis, community detection, and concept-aware retrieval.
Goal: Answer questions about relationships between ideas, not just questions about individual documents.
The system includes a full evaluation framework computing precision, recall, F1, NDCG, and MRR across search strategies. Test queries are generated programmatically from knowledge graph concepts, ensuring evaluation coverage tracks the actual knowledge base.
Goal: Every retrieval change is measured against a reproducible benchmark, not validated by subjective impression.
Summarization uses the Claude Message Batches API (up to 100 items per batch, asynchronous polling), achieving approximately 50% cost reduction compared to sequential API calls. OCR uses Mistral AI for PDF text extraction with automatic chunking for large documents and PyPDF2 fallback for robustness.
Goal: Process large document collections at scale without proportional cost scaling.
The system operates as a seven-layer pipeline. Data flows from collection through processing, storage, knowledge extraction, and embedding before reaching the retrieval and generation layers.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Collection β
β Instagram (instaloader) Β· ArXiv (arxiv API) Β· GitHub (REST API) β
ββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββ
β Data Processing β
β Whisper Transcription Β· Mistral OCR Β· Claude Batch Summaries β
ββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββ
β SQLite Unified Store β
β ai_content Β· source-specific tables Β· FTS5 virtual tables β
βββββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββββ¬ββββββββββββββββββββ
β β β
βββββββββββββΌβββββββ βββββββββββΌβββββββββββ βββββββΌββββββββββββββββββ
β Knowledge β β Vector β β Hybrid Retrieval β
β Extraction β β Embedding β β Vector + FTS5 fusion β
β Concepts + β β 768-dim, overlap β β Adaptive weighting β
β Graph β β chunking β β Feedback learning β
βββββββββββββ¬βββββββ βββββββββββ¬βββββββββββ βββββββ¬ββββββββββββββββββ
β β β
βββββββββββββΌββββββββββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββ
β LLM Response Generation β
β Context selection Β· Source citations Β· Streaming responses β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The scraper uses instaloader with proxy rotation, account credential cycling, and rate-limit detection with configurable cooldown periods. Account state is tracked persistently in JSON files. Audio is extracted and transcribed locally using OpenAI Whisper, producing timestamped transcript segments.
Papers are collected via the arxiv API with configurable search queries and date ranges. PDF text is extracted using the Mistral AI OCR API with automatic chunking for large documents. A PyPDF2 fallback ensures extraction succeeds when the OCR API is unavailable. Papers enter a download-only mode for batch collection, followed by a separate processing phase.
Public repositories are collected via the GitHub REST API. Repository metadata, README content, file structure, and primary language information are normalized into the unified content schema.
Guarantee: Each source operates independently -- an Instagram rate-limit event does not block ArXiv paper processing or GitHub collection.
The concept extraction pipeline uses Claude to identify concepts from processed content, classify them by category (algorithm, model, technique, framework, concept, tool, dataset, metric), and extract weighted relationships with confidence scores. The resulting graph is built and analyzed using NetworkX.
Graph capabilities:
- PageRank centrality analysis for identifying foundational concepts
- Community detection for discovering concept clusters
- Subgraph extraction around specific topics
- Interactive Plotly visualization and static Matplotlib rendering
- GEXF and JSON export for external analysis tools
Guarantee: The knowledge graph is a queryable, structured representation of the knowledge base -- not a visualization artifact.
Text is chunked with configurable size (default 1000 characters) and overlap (200 characters), using intelligent boundary detection that respects paragraph breaks, newlines, sentence endings, and word boundaries. Embeddings are generated using multi-qa-mpnet-base-dot-v1 (768 dimensions) from sentence-transformers, with a TF-IDF hash-based fallback when the model is unavailable.
The hybrid search layer classifies each query and applies dynamic weights:
| Query Type | Vector Weight | Keyword Weight | Trigger |
|---|---|---|---|
| Code queries | 0.50 | 0.50 | Code-like tokens detected |
| Factual queries | 0.60 | 0.40 | Specific entity or fact pattern |
| Conceptual queries | 0.80 | 0.20 | Abstract or relationship question |
| Short queries (1--2 words) | -0.10 adjustment | +0.10 adjustment | Token count <= 2 |
| Exact-match (quoted) | -0.20 adjustment | +0.20 adjustment | Quoted phrase detected |
Weights are further refined by a feedback learning loop (search_query_log, search_feedback, weight_patterns tables) that tracks which weight configurations produce the best user-rated results.
Guarantee: Search quality improves over time without manual retuning, driven by observed user interactions.
Problem: A fixed vector-to-keyword weight ratio works well for some query types and poorly for others. Code queries need strong keyword matching; conceptual queries need strong semantic matching. Manual tuning does not scale.
Solution: The hybrid search system classifies each incoming query, applies a base weight configuration for the detected query type, then adjusts further based on query-specific signals (length, quoted phrases, code tokens). A feedback loop records user interactions and learns which weight patterns produce the best results for observed query distributions, progressively refining the default weights.
Problem: Video transcripts and research papers contain latent concept relationships invisible to keyword and vector search. "Attention mechanism" and "transformer architecture" are deeply related, but a document about one may never mention the other by name.
Solution: The concept extraction pipeline uses Claude to identify concepts, classify them into a controlled taxonomy, and extract explicit relationships with confidence scores and relationship types. The resulting NetworkX graph makes latent relationships queryable -- enabling graph-based retrieval that surfaces documents connected through concept chains, not just direct textual similarity.
| Domain | Responsibility | Key Modules |
|---|---|---|
| Ingestion | Source-specific collection, rate-limit handling, credential management | downloader.py, arxiv_collector.py, github_collector.py |
| Processing | Transcription, OCR, summarization, text normalization | transcriber.py, mistral_ocr.py, summarizer.py |
| Storage | Schema management, migrations, unified content table, FTS indexes | create_db.sql, db_migration.py, init_db.py |
| Knowledge | Concept extraction, graph construction, centrality analysis | concept_extractor.py, knowledge_graph.py |
| Embedding | Text chunking, vector generation, batch processing | chunking.py, embeddings.py, generate_embeddings.py |
| Retrieval | Vector search, keyword search, hybrid fusion, adaptive weighting | vector_search.py, hybrid_search.py, context_builder.py |
| Generation | LLM context assembly, response generation, source citation | llm_integration.py, context_builder.py |
| Evaluation | Retrieval metrics, answer quality, test generation | evaluation/*.py |
| Web | Flask interface, REST API, Swagger documentation | app.py, api/*.py |
| Document | Focus Area |
|---|---|
| RAG Pipeline | End-to-end RAG usage, CLI commands, query API |
| Knowledge Graph | Concept extraction, graph analysis, visualization |
| Vector and Hybrid Search | Embedding generation, search strategies, adaptive weighting |
| ArXiv Collector | Paper collection, OCR pipeline, batch processing |
| Application Guide | Installation, configuration, CLI usage, web interface |
| Pattern | Implementation |
|---|---|
| Source-Agnostic Schema | Unified ai_content table with source-specific metadata in dedicated tables; downstream consumers are source-blind |
| Adaptive Weighting | Query classification, base weights, signal adjustments, feedback-refined weights via weight_patterns |
| Concept Knowledge Graph | LLM extraction into typed nodes and weighted edges, NetworkX analysis, queryable graph structure |
| Batch LLM Processing | Claude Message Batches API with async polling, UUID tracking, 50% cost reduction over sequential calls |
| Graceful Degradation | Mistral OCR with PyPDF2 fallback; sentence-transformers with TF-IDF hash fallback; partial progress preservation |
| Evaluation-Driven Development | Programmatic test query generation from knowledge graph; precision, recall, NDCG, MRR benchmarks |
The evaluation suite generates test queries programmatically from knowledge graph concepts, ensuring coverage evolves with the knowledge base. Metrics are computed across search strategies:
| Metric | Purpose |
|---|---|
| Precision@k | Fraction of retrieved results that are relevant |
| Recall@k | Fraction of relevant results that are retrieved |
| F1@k | Harmonic mean of precision and recall |
| NDCG | Normalized discounted cumulative gain -- measures ranking quality |
| MRR | Mean reciprocal rank -- measures position of first relevant result |
Results are viewable through an interactive evaluation dashboard.
src/
βββ run.py --- CLI entry point
βββ app.py --- Flask web interface
βββ downloader.py --- Instagram scraper, proxy rotation, rate limiting
βββ transcriber.py --- Whisper audio transcription
βββ summarizer.py --- Claude batch summarization
βββ arxiv_collector.py --- ArXiv paper collection + Mistral OCR
βββ github_collector.py --- GitHub repository collection
βββ mistral_ocr.py --- Mistral AI OCR wrapper
β
βββ embeddings.py --- Sentence-transformers embedding generation
βββ generate_embeddings.py --- Batch embedding orchestration
βββ vector_search.py --- Pure vector similarity search
βββ hybrid_search.py --- Hybrid vector + keyword search
βββ chunking.py --- Text chunking with overlap
βββ context_builder.py --- RAG context selection and formatting
βββ llm_integration.py --- Claude response generation
β
βββ concept_extractor.py --- LLM-powered concept extraction
βββ knowledge_graph.py --- Graph construction, analysis, visualization
βββ concept_schema.sql --- Knowledge graph schema
β
βββ create_db.sql --- Database schema
βββ db_migration.py --- Schema migrations
βββ init_db.py --- Database initialization
β
βββ api/
β βββ api.py --- REST API endpoints
β βββ api_knowledge.py --- Knowledge graph API
β βββ swagger.py --- OpenAPI specification
β
βββ evaluation/
β βββ retrieval_metrics.py --- Precision, recall, NDCG, MRR
β βββ answer_evaluator.py --- Answer quality evaluation
β βββ test_queries.py --- Programmatic test generation
β βββ test_runner.py --- Evaluation orchestration
β βββ dashboard.py --- Interactive results dashboard
β
βββ templates/ --- Flask HTML templates
βββ data/
β βββ audio/ --- Transcribed audio files
β βββ transcripts/ --- JSON transcript output
β βββ papers/ --- ArXiv paper text
β βββ visualizations/ --- Knowledge graph renders
β βββ summaries_cache/ --- Cached Claude summaries
β
βββ requirements.txt --- Python dependencies