A comprehensive search engine implementation built as part of a Data Structures and Algorithms project. This system demonstrates core search engine concepts including forward indexing, inverted indexing, barrreling, query processing, and ranked retrieval using the NELA-GT-2022 dataset.
This search engine processes approximately 100,000 news articles from the NELA-GT-2022 dataset and provides efficient search functionality through a web interface. The system implements fundamental search engine algorithms and data structures to deliver fast and relevant search results.
- Document Processing: Automated text preprocessing with tokenization, stemming, and stop-word removal
- Forward Indexing: Efficient document-to-term mapping with frequency and position information
- Inverted Indexing: Term-to-document mapping with barreling for scalable storage
- Query Processing: Natural language query parsing and processing
- Ranked Retrieval: Relevant document ranking based on term frequency and document matching
- Web Interface: User-friendly search interface with document upload capability
- Real-time Processing: Live document upload and indexing functionality
-
Forward Indexer (forward_indexer.py)
- Processes raw documents and creates forward index
- Implements text preprocessing (tokenization, stemming, stop-word removal)
- Generates word frequency and position mappings
- Uses NLTK library for natural language processing
-
Inverted Indexer (inverted_indexer.py)
- Creates inverted index from forward index
- Implements barreling strategy for distributed storage
- Uses 2000 barrels for efficient data distribution
- Provides incremental index updates
-
Query Processor (querytest.py, newquerytest.py)
- Handles user search queries
- Implements query preprocessing and stemming
- Performs document retrieval and ranking
- Returns relevant results with document URLs
-
Web Backend (backend.py)
- Flask-based web server
- RESTful API endpoints for search and upload
- Real-time document processing with WebSocket support
- Integration with indexing modules
-
Barreling System (newbarreling.py)
- Distributed storage strategy for large-scale indexing
- Hash-based barrel assignment
- Efficient retrieval from multiple index files
- Backend: Python 3.x, Flask, Flask-SocketIO
- Natural Language Processing: NLTK (Natural Language Toolkit)
- Data Storage: JSON-based file system
- Frontend: HTML5, CSS3, JavaScript
- Text Processing: SnowballStemmer, WordNet Lemmatizer
- Dataset: NELA-GT-2022 (News Landscape Dataset)