Skip to content

adi2355/multi-source-rag-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

25 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Terminal header

Overview

A multi-source Retrieval-Augmented Generation pipeline that ingests AI/ML knowledge from Instagram video transcripts, ArXiv research papers, and GitHub repositories, then organizes it into a searchable knowledge system with concept-level understanding. The pipeline spans seven layers: collection, processing, storage, knowledge extraction, vector embedding, hybrid retrieval, and LLM-powered answer generation.

The architecture treats each content source as a first-class data stream with its own ingestion, processing, and normalization path, converging into a unified SQLite store with full-text search, vector embeddings, and a concept knowledge graph. Retrieval combines cosine-similarity vector search with FTS5 keyword matching through an adaptive weighting system that classifies queries and adjusts strategy in real time.


Technology Stack

Languages Python SQL JavaScript
AI / LLM Claude Mistral Whisper
ML / Embeddings Sentence Transformers PyTorch NumPy scikit-learn
Data & Storage SQLite FTS5 NetworkX
Web & API Flask Swagger Plotly

Engineering Principles

1. Source-Agnostic Convergence

Every content source -- Instagram video transcripts, ArXiv research papers, GitHub repositories -- enters through its own specialized ingestion path but converges into a single unified schema (ai_content) with normalized metadata. Downstream systems (embedding, search, knowledge extraction) operate on this unified representation without knowledge of the original source.

Goal: Add a new content source by implementing one collector and one normalizer, with zero changes to retrieval, embedding, or knowledge graph logic.

2. Hybrid Retrieval Over Single-Strategy Search

Pure vector search misses exact-match terminology. Pure keyword search misses semantic similarity. The retrieval layer combines both through an adaptive weighting system that classifies each query (code, factual, conceptual) and adjusts the vector-to-keyword balance in real time. A feedback loop learns optimal weights from user search interactions.

Goal: Every query type -- exact code snippets, broad conceptual questions, specific factual lookups -- returns relevant results without manual tuning.

3. Knowledge-First Architecture

Raw documents are not just stored and embedded -- they are distilled into a structured concept graph using LLM-powered extraction. Concepts, their categories (algorithm, model, technique, framework), and weighted relationships form a NetworkX graph that supports centrality analysis, community detection, and concept-aware retrieval.

Goal: Answer questions about relationships between ideas, not just questions about individual documents.

4. Measurable Retrieval Quality

The system includes a full evaluation framework computing precision, recall, F1, NDCG, and MRR across search strategies. Test queries are generated programmatically from knowledge graph concepts, ensuring evaluation coverage tracks the actual knowledge base.

Goal: Every retrieval change is measured against a reproducible benchmark, not validated by subjective impression.

5. Cost-Efficient LLM Integration

Summarization uses the Claude Message Batches API (up to 100 items per batch, asynchronous polling), achieving approximately 50% cost reduction compared to sequential API calls. OCR uses Mistral AI for PDF text extraction with automatic chunking for large documents and PyPDF2 fallback for robustness.

Goal: Process large document collections at scale without proportional cost scaling.


Pipeline Architecture

The system operates as a seven-layer pipeline. Data flows from collection through processing, storage, knowledge extraction, and embedding before reaching the retrieval and generation layers.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                           Data Collection                               β”‚
β”‚   Instagram (instaloader)  Β·  ArXiv (arxiv API)  Β·  GitHub (REST API)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                          Data Processing                                β”‚
β”‚   Whisper Transcription  Β·  Mistral OCR  Β·  Claude Batch Summaries     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        SQLite Unified Store                             β”‚
β”‚   ai_content  Β·  source-specific tables  Β·  FTS5 virtual tables        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚                    β”‚                    β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Knowledge       β”‚  β”‚  Vector             β”‚  β”‚  Hybrid Retrieval     β”‚
β”‚  Extraction      β”‚  β”‚  Embedding          β”‚  β”‚  Vector + FTS5 fusion β”‚
β”‚  Concepts +      β”‚  β”‚  768-dim, overlap   β”‚  β”‚  Adaptive weighting   β”‚
β”‚  Graph           β”‚  β”‚  chunking           β”‚  β”‚  Feedback learning    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚                   β”‚                    β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      LLM Response Generation                            β”‚
β”‚   Context selection  Β·  Source citations  Β·  Streaming responses        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Multi-Source Ingestion

Platform Video Pipeline

The scraper uses instaloader with proxy rotation, account credential cycling, and rate-limit detection with configurable cooldown periods. Account state is tracked persistently in JSON files. Audio is extracted and transcribed locally using OpenAI Whisper, producing timestamped transcript segments.

ArXiv Research Papers

Papers are collected via the arxiv API with configurable search queries and date ranges. PDF text is extracted using the Mistral AI OCR API with automatic chunking for large documents. A PyPDF2 fallback ensures extraction succeeds when the OCR API is unavailable. Papers enter a download-only mode for batch collection, followed by a separate processing phase.

GitHub Repositories

Public repositories are collected via the GitHub REST API. Repository metadata, README content, file structure, and primary language information are normalized into the unified content schema.

Guarantee: Each source operates independently -- an Instagram rate-limit event does not block ArXiv paper processing or GitHub collection.


Knowledge Graph Engine

The concept extraction pipeline uses Claude to identify concepts from processed content, classify them by category (algorithm, model, technique, framework, concept, tool, dataset, metric), and extract weighted relationships with confidence scores. The resulting graph is built and analyzed using NetworkX.

Graph capabilities:

  • PageRank centrality analysis for identifying foundational concepts
  • Community detection for discovering concept clusters
  • Subgraph extraction around specific topics
  • Interactive Plotly visualization and static Matplotlib rendering
  • GEXF and JSON export for external analysis tools

Guarantee: The knowledge graph is a queryable, structured representation of the knowledge base -- not a visualization artifact.


Hybrid Retrieval System

Embedding Layer

Text is chunked with configurable size (default 1000 characters) and overlap (200 characters), using intelligent boundary detection that respects paragraph breaks, newlines, sentence endings, and word boundaries. Embeddings are generated using multi-qa-mpnet-base-dot-v1 (768 dimensions) from sentence-transformers, with a TF-IDF hash-based fallback when the model is unavailable.

Adaptive Search Weighting

The hybrid search layer classifies each query and applies dynamic weights:

Query Type Vector Weight Keyword Weight Trigger
Code queries 0.50 0.50 Code-like tokens detected
Factual queries 0.60 0.40 Specific entity or fact pattern
Conceptual queries 0.80 0.20 Abstract or relationship question
Short queries (1--2 words) -0.10 adjustment +0.10 adjustment Token count <= 2
Exact-match (quoted) -0.20 adjustment +0.20 adjustment Quoted phrase detected

Weights are further refined by a feedback learning loop (search_query_log, search_feedback, weight_patterns tables) that tracks which weight configurations produce the best user-rated results.

Guarantee: Search quality improves over time without manual retuning, driven by observed user interactions.


Hardest Problems Solved

1. Adaptive Retrieval Without Manual Tuning

Problem: A fixed vector-to-keyword weight ratio works well for some query types and poorly for others. Code queries need strong keyword matching; conceptual queries need strong semantic matching. Manual tuning does not scale.

Solution: The hybrid search system classifies each incoming query, applies a base weight configuration for the detected query type, then adjusts further based on query-specific signals (length, quoted phrases, code tokens). A feedback loop records user interactions and learns which weight patterns produce the best results for observed query distributions, progressively refining the default weights.

2. Structured Knowledge from Unstructured Text

Problem: Video transcripts and research papers contain latent concept relationships invisible to keyword and vector search. "Attention mechanism" and "transformer architecture" are deeply related, but a document about one may never mention the other by name.

Solution: The concept extraction pipeline uses Claude to identify concepts, classify them into a controlled taxonomy, and extract explicit relationships with confidence scores and relationship types. The resulting NetworkX graph makes latent relationships queryable -- enabling graph-based retrieval that surfaces documents connected through concept chains, not just direct textual similarity.


System Domains

Domain Responsibility Key Modules
Ingestion Source-specific collection, rate-limit handling, credential management downloader.py, arxiv_collector.py, github_collector.py
Processing Transcription, OCR, summarization, text normalization transcriber.py, mistral_ocr.py, summarizer.py
Storage Schema management, migrations, unified content table, FTS indexes create_db.sql, db_migration.py, init_db.py
Knowledge Concept extraction, graph construction, centrality analysis concept_extractor.py, knowledge_graph.py
Embedding Text chunking, vector generation, batch processing chunking.py, embeddings.py, generate_embeddings.py
Retrieval Vector search, keyword search, hybrid fusion, adaptive weighting vector_search.py, hybrid_search.py, context_builder.py
Generation LLM context assembly, response generation, source citation llm_integration.py, context_builder.py
Evaluation Retrieval metrics, answer quality, test generation evaluation/*.py
Web Flask interface, REST API, Swagger documentation app.py, api/*.py

Deep Dive: Technical Documentation

Document Focus Area
RAG Pipeline End-to-end RAG usage, CLI commands, query API
Knowledge Graph Concept extraction, graph analysis, visualization
Vector and Hybrid Search Embedding generation, search strategies, adaptive weighting
ArXiv Collector Paper collection, OCR pipeline, batch processing
Application Guide Installation, configuration, CLI usage, web interface

Architectural Patterns

Pattern Implementation
Source-Agnostic Schema Unified ai_content table with source-specific metadata in dedicated tables; downstream consumers are source-blind
Adaptive Weighting Query classification, base weights, signal adjustments, feedback-refined weights via weight_patterns
Concept Knowledge Graph LLM extraction into typed nodes and weighted edges, NetworkX analysis, queryable graph structure
Batch LLM Processing Claude Message Batches API with async polling, UUID tracking, 50% cost reduction over sequential calls
Graceful Degradation Mistral OCR with PyPDF2 fallback; sentence-transformers with TF-IDF hash fallback; partial progress preservation
Evaluation-Driven Development Programmatic test query generation from knowledge graph; precision, recall, NDCG, MRR benchmarks

Evaluation Framework

The evaluation suite generates test queries programmatically from knowledge graph concepts, ensuring coverage evolves with the knowledge base. Metrics are computed across search strategies:

Metric Purpose
Precision@k Fraction of retrieved results that are relevant
Recall@k Fraction of relevant results that are retrieved
F1@k Harmonic mean of precision and recall
NDCG Normalized discounted cumulative gain -- measures ranking quality
MRR Mean reciprocal rank -- measures position of first relevant result

Results are viewable through an interactive evaluation dashboard.


Folder Structure


src/
β”œβ”€β”€ run.py                          --- CLI entry point
β”œβ”€β”€ app.py                          --- Flask web interface
β”œβ”€β”€ downloader.py                   --- Instagram scraper, proxy rotation, rate limiting
β”œβ”€β”€ transcriber.py                  --- Whisper audio transcription
β”œβ”€β”€ summarizer.py                   --- Claude batch summarization
β”œβ”€β”€ arxiv_collector.py              --- ArXiv paper collection + Mistral OCR
β”œβ”€β”€ github_collector.py             --- GitHub repository collection
β”œβ”€β”€ mistral_ocr.py                  --- Mistral AI OCR wrapper
β”‚
β”œβ”€β”€ embeddings.py                   --- Sentence-transformers embedding generation
β”œβ”€β”€ generate_embeddings.py          --- Batch embedding orchestration
β”œβ”€β”€ vector_search.py                --- Pure vector similarity search
β”œβ”€β”€ hybrid_search.py                --- Hybrid vector + keyword search
β”œβ”€β”€ chunking.py                     --- Text chunking with overlap
β”œβ”€β”€ context_builder.py              --- RAG context selection and formatting
β”œβ”€β”€ llm_integration.py              --- Claude response generation
β”‚
β”œβ”€β”€ concept_extractor.py            --- LLM-powered concept extraction
β”œβ”€β”€ knowledge_graph.py              --- Graph construction, analysis, visualization
β”œβ”€β”€ concept_schema.sql              --- Knowledge graph schema
β”‚
β”œβ”€β”€ create_db.sql                   --- Database schema
β”œβ”€β”€ db_migration.py                 --- Schema migrations
β”œβ”€β”€ init_db.py                      --- Database initialization
β”‚
β”œβ”€β”€ api/
β”‚   β”œβ”€β”€ api.py                      --- REST API endpoints
β”‚   β”œβ”€β”€ api_knowledge.py            --- Knowledge graph API
β”‚   └── swagger.py                  --- OpenAPI specification
β”‚
β”œβ”€β”€ evaluation/
β”‚   β”œβ”€β”€ retrieval_metrics.py        --- Precision, recall, NDCG, MRR
β”‚   β”œβ”€β”€ answer_evaluator.py         --- Answer quality evaluation
β”‚   β”œβ”€β”€ test_queries.py             --- Programmatic test generation
β”‚   β”œβ”€β”€ test_runner.py              --- Evaluation orchestration
β”‚   └── dashboard.py                --- Interactive results dashboard
β”‚
β”œβ”€β”€ templates/                      --- Flask HTML templates
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ audio/                      --- Transcribed audio files
β”‚   β”œβ”€β”€ transcripts/                --- JSON transcript output
β”‚   β”œβ”€β”€ papers/                     --- ArXiv paper text
β”‚   β”œβ”€β”€ visualizations/             --- Knowledge graph renders
β”‚   └── summaries_cache/            --- Cached Claude summaries
β”‚
└── requirements.txt                --- Python dependencies

Terminal footer

Releases

No releases published

Packages

 
 
 

Contributors