Skip to content

hdviettt/mini-search-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

299 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VietSearch

A search engine built from scratch to understand how Google really works.

Live Demo · Blog Series

Canvas View (Explore) Search View
Canvas View Search View

This is a series where I learn SEO by building a mini search engine from scratch. It covers the core pipeline behind Google Search — Crawling, Indexing, Ranking — plus Neural Reranking, AI Overviews, and Sports OneBox.

As someone who works in SEO, I wanted to understand search at the engineering level. Not just what Google does, but how and why. It's no coincidence that the research problems search engines needed to solve — understanding language, ranking relevance across billions of documents — drove the breakthroughs that became modern AI. The transformer paper ("Attention Is All You Need") came out of Google. So did Word2Vec and BERT. Search is where it all started.

This project started in March 2026 and is still ongoing.

The Pipeline

This is what Google does every time you search something. I built each piece.

                        BUILD (offline)                                    QUERY (online)
        ┌──────────────────────────────────────┐        ┌──────────────────────────────────────────────┐
        │                                      │        │                                              │
        │   Crawler ──→ Pages DB ──┬──→ Indexer │        │   Search Query                               │
        │   (BFS,       (1000+     │           │        │       │                                      │
        │   robots.txt,  pages)    ├──→ PageRank│        │       ├──→ Spell Check ──→ Tokenize          │
        │   rate limit)            │           │        │       │                      │               │
        │                          └──→ Chunker │        │       │              Index Lookup ──→ BM25   │
        │                               │      │        │       │                                │     │
        │                          Embedder     │        │       ├──→ PageRank Lookup             │     │
        │                               │      │        │       │        │                       │     │
        │                               ▼      │        │       │        ▼                       │     │
        │   ┌─────────┐ ┌──────────┐ ┌───────┐ │        │       │  Combine Scores ──→ Rerank (Top 5)   │
        │   │Inverted │ │PageRank  │ │Vector │ │        │       │                        │             │
        │   │ Index   │ │ Scores   │ │ Store │ │◄───────┤       │                    Results           │
        │   └─────────┘ └──────────┘ └───────┘ │        │       │                        │             │
        └──────────────────────────────────────┘        │       ├──→ Fan-out ──→ Hybrid Retrieval      │
                    ▲                                   │       │                    │             │   │
                    │             Databases are the     │       │               AI Overview        │   │
                    │                  bridge            │       │                    │             │   │
                    └───────────────────────────────────┤       └──→ Sports Detection (OneBox)    │   │
                                                        └──────────────────────────────────────────────┘

What each piece does

Stage What it does How Numbers
Crawler Downloads web pages BFS traversal, robots.txt compliance, 1.5s rate limiting, dead page tracking ~1,000+ pages from Wikipedia, BBC Sport, ESPN, FBref, Transfermarkt
Indexer Maps every word to the pages containing it Tokenization (Porter stemmer) → stopword removal → inverted index via PostgreSQL COPY 100K+ terms, 1M+ postings
PageRank Scores page authority from link structure Iterative algorithm (d=0.85, 20 iterations), handles dangling nodes Scores for all live pages
Chunker + Embedder Prepares pages for semantic search Split into ~300-token chunks, embed with Voyage AI voyage-3-lite, store as pgvector ~15,000+ chunks (512d vectors)
BM25 Scores text relevance BM25F with 4× title weight, term frequency × inverse document frequency × length normalization k1=1.2, b=0.75
Neural Reranker Refines top results with a cross-encoder ONNX inference with ms-marco-MiniLM-L-6-v2 (22M params), runs locally on CPU Reranks top 5 candidates
Ranking Combines signals 80% BM25 + 20% PageRank, exponential freshness decay, 7-day recency bonus min-max normalized, tunable live in the UI
Spell correction Fixes typos before searching Levenshtein edit-distance ≤ 2, vocabulary from page titles + indexed stems Proper nouns protected via terms table
AI Overview Generates a summary with citations Co-occurrence fan-out → hybrid retrieval (vector + keyword) → Groq streaming with retry Llama 3.3 70B, cached 24h
AI Chat Follow-up conversation with context Multi-turn chat grounded in retrieved chunks, inline citations Groq streaming
Sports OneBox Live match cards above results Keyword detection for teams/leagues → API-Football integration Live scores, standings, fixtures

The UI

The frontend is a React Flow canvas that visualizes the entire pipeline as an interactive node graph. Search a query and watch data flow through each stage in real-time.

  • Left side: Build pipeline (crawler → indexer → stores)
  • Right side: Query pipeline (tokenize → lookup → rank → results)
  • Click any node to see real data — actual postings from the inverted index, PageRank scores, RAG chunks
  • Live WebSocket progress during crawl/index/embed jobs
  • Google-style results with score breakdowns, AI Overview with citations, and follow-up chat
  • DuckDuckGo-style hero with live dashboard charts on the landing page
  • Sports OneBox — live match cards, standings, and fixtures for sports queries

Tech Stack

Layer Tech
Frontend Next.js 16, React 19, React Flow, Tailwind v4, TypeScript
Backend FastAPI, Python 3.12+
Database PostgreSQL 16 + pgvector
Reranking ONNX Runtime (ms-marco-MiniLM-L-6-v2, 22M params, CPU)
LLM Groq API (Llama 3.3 70B via llama-3.3-70b-versatile)
Embeddings Voyage AI API (voyage-3-lite, 512d)
Sports Data API-Football
Hosting Railway

Project Structure

backend/
├── crawler/        # BFS web crawler (fetcher, parser, queue manager)
├── indexer/        # inverted index builder + tokenizer
│   └── docs/       # technical write-ups on indexing decisions
├── ranker/         # BM25F + PageRank + ONNX neural reranker
├── search/         # query engine, spell correction, pipeline explainer
├── rag/            # chunker, embedder, retriever, query fan-out
├── ai_overview/    # Groq streaming, response caching, follow-up chat
├── sports/         # sports query detection + API-Football integration
├── api/            # REST endpoints + WebSocket jobs + scheduling
└── scripts/        # CLI: crawl, index, pagerank, build_rag

frontend/
├── app/            # Next.js app router (search + explore + dashboard)
├── components/
│   ├── canvas/     # React Flow nodes, edges, detail panels
│   └── playground/ # control panels for live tuning
├── hooks/          # useSearchEngine, useWebSocket, useResizable
└── lib/            # API client, types, hooks

Run It Yourself

Prerequisites

  • Python 3.12+
  • Node.js 18+
  • PostgreSQL 16+ with pgvector
  • API keys: Groq, Voyage AI

Backend

cd backend
pip install -e .

# Start Postgres with pgvector
docker run -d --name search-pg \
  -e POSTGRES_USER=searchengine \
  -e POSTGRES_PASSWORD=searchengine \
  -e POSTGRES_DB=searchengine \
  -p 5432:5432 pgvector/pgvector:pg16

# Configure
cp .env.example .env  # add your GROQ_API_KEY and VOYAGE_API_KEY

# Initialize database
python db.py

# Build the entire search index (run in order)
python scripts/crawl.py        # ~25 min (rate limited)
python scripts/index.py        # ~2 sec
python scripts/pagerank.py     # ~1 sec
python scripts/build_rag.py    # ~5 min (API calls)

# Start
uvicorn main:app --reload

Frontend

cd frontend
npm install
npm run dev

Open localhost:3000.

Roadmap

Done

  • BFS crawler with robots.txt, rate limiting, dead page tracking
  • Inverted index with Porter stemmer, stopwords, bulk COPY ingestion
  • BM25F + PageRank hybrid ranking with min-max normalization
  • Neural reranking — ONNX cross-encoder (ms-marco-MiniLM-L-6-v2), local CPU inference on top 5
  • Scheduled auto-crawling — daily seed discovery + weekly top-PageRank refresh, resumes after restart
  • Query fan-out via index co-occurrence (no LLM needed, ~2ms)
  • Hybrid semantic retrieval — pgvector + BM25 chunks
  • AI Overviews with Groq streaming, inline citations, 24h cache
  • AI Overview retry logic (2× with 1s backoff) + "unavailable" UI state
  • AI Chat — follow-up conversation grounded in retrieved chunks
  • Sports OneBox — live match cards, standings, live scores above results
  • Spell correction + "Did you mean?" — Levenshtein edit-distance, proper noun protection via terms table
  • Autocomplete — debounced suggestions from query log, keyboard nav
  • Tokenizer: season strings ("2024/25" → "2024"), 4-digit year support
  • Ranking tuning — 80/20 BM25/PageRank, exponential freshness decay, 7-day recency bonus
  • Full error states — API failure banners, ErrorBoundary, AI Overview unavailable chip
  • DB connection leak fixes — all get_connection() wrapped in try/finally

Planned

  • Query intent detection — classify sports vs general queries to trigger structured results
  • Incremental indexing (no full rebuild)
  • Knowledge Graph — entity understanding beyond text matching

Blog Series

  1. Why I'm Building a Search Engine
  2. Designing the Web Crawler
  3. Building the Inverted Index (coming soon)
  4. Ranking with BM25 + PageRank (coming soon)

Author

Built by Hoang Duc Viet — AI Leader at SEONGON, Vietnam's largest Google Ads & SEO agency.

About

A mini Search Engine for football replicating the core technology of Google Search including AI Overviews - Crawling, Indexing, Ranking algorithms, Serving.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors