Skip to content

Faareh-Ahmed/DSA-SearchEngine

Repository files navigation

DSA Search Engine

A comprehensive search engine implementation built as part of a Data Structures and Algorithms project. This system demonstrates core search engine concepts including forward indexing, inverted indexing, barrreling, query processing, and ranked retrieval using the NELA-GT-2022 dataset.

🔍 Overview

This search engine processes approximately 100,000 news articles from the NELA-GT-2022 dataset and provides efficient search functionality through a web interface. The system implements fundamental search engine algorithms and data structures to deliver fast and relevant search results.

✨ Features

  • Document Processing: Automated text preprocessing with tokenization, stemming, and stop-word removal
  • Forward Indexing: Efficient document-to-term mapping with frequency and position information
  • Inverted Indexing: Term-to-document mapping with barreling for scalable storage
  • Query Processing: Natural language query parsing and processing
  • Ranked Retrieval: Relevant document ranking based on term frequency and document matching
  • Web Interface: User-friendly search interface with document upload capability
  • Real-time Processing: Live document upload and indexing functionality

🏗️ System Architecture

Core Components

  1. Forward Indexer (forward_indexer.py)

    • Processes raw documents and creates forward index
    • Implements text preprocessing (tokenization, stemming, stop-word removal)
    • Generates word frequency and position mappings
    • Uses NLTK library for natural language processing
  2. Inverted Indexer (inverted_indexer.py)

    • Creates inverted index from forward index
    • Implements barreling strategy for distributed storage
    • Uses 2000 barrels for efficient data distribution
    • Provides incremental index updates
  3. Query Processor (querytest.py, newquerytest.py)

    • Handles user search queries
    • Implements query preprocessing and stemming
    • Performs document retrieval and ranking
    • Returns relevant results with document URLs
  4. Web Backend (backend.py)

    • Flask-based web server
    • RESTful API endpoints for search and upload
    • Real-time document processing with WebSocket support
    • Integration with indexing modules
  5. Barreling System (newbarreling.py)

    • Distributed storage strategy for large-scale indexing
    • Hash-based barrel assignment
    • Efficient retrieval from multiple index files

🛠️ Technology Stack

  • Backend: Python 3.x, Flask, Flask-SocketIO
  • Natural Language Processing: NLTK (Natural Language Toolkit)
  • Data Storage: JSON-based file system
  • Frontend: HTML5, CSS3, JavaScript
  • Text Processing: SnowballStemmer, WordNet Lemmatizer
  • Dataset: NELA-GT-2022 (News Landscape Dataset)

About

This is the Semester Project of Data Structures and Algorithm (Search Engine)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •