A full-stack search engine with multi-threaded web crawling, TF-IDF ranking, and real-time search capabilities. Built with Spring Boot, React, and PostgreSQL, this project demonstrates modern search algorithms, efficient document indexing techniques, and a responsive user interface.
A comprehensive full-stack search engine application with web crawling, document indexing, and search capabilities. This project demonstrates the implementation of modern search algorithms, efficient indexing techniques, and a responsive user interface.
- Multi-threaded Web Crawler: Efficiently crawls websites with respect for robots.txt directives
- Text Processing Engine: Implements tokenization, stemming, and stop word removal
- Search API: Uses TF-IDF algorithms for accurate and fast search results
- PostgreSQL Database: Stores documents, indexes, and search metadata
- Modern React UI: Clean, responsive interface built with React 18
- Interactive Search: Real-time suggestions and highlighting
- Results Navigation: Pagination with relevance scoring
- Voice Search: Speech recognition for hands-free searching
- Backend: Java 21, Spring Boot 3.2.x, JPA/Hibernate, PostgreSQL
- Search Technology: Custom implementation with TF-IDF, PageRank, and vector space models
- Frontend: React 18, Styled Components, Axios
- DevOps: Docker, Docker Compose, GitHub Actions
- Docker and Docker Compose
- Git
- Java 17+ (for local development only)
- Node.js 16+ (for local development only)
-
Clone the repository
git clone https://github.com/yourusername/Search-Engine.git cd Search-Engine -
Create environment variables file
cp .env.example .env # Edit .env with your preferred settings -
Build and run with Docker Compose
docker-compose up -d
-
Access the application
- Frontend: http://localhost:3000
- Backend API: http://localhost:8080
-
Configure PostgreSQL database
CREATE DATABASE searchengine; CREATE USER searchuser WITH PASSWORD 'password'; GRANT ALL PRIVILEGES ON DATABASE searchengine TO searchuser;
-
Configure application properties
cd searchengine cp src/main/resources/application.properties.example src/main/resources/application.properties # Edit application.properties with your database settings
-
Run the Spring Boot application
./mvnw spring-boot:run
-
Install dependencies
cd frontend npm install -
Configure API endpoint
cp .env.example .env # Edit .env to set REACT_APP_API_URL -
Start development server
npm start
-
Start a new crawl with specified number of threads:
curl -X POST "http://localhost:8080/crawler?thread_num=16" -
Monitor crawling progress:
curl -X GET "http://localhost:8080/crawler/status"This will return JSON with crawling statistics, including:
- Total pages crawled
- Pages in queue
- Crawling rate (pages/second)
- Elapsed time
- Estimated completion time
-
Stop an active crawl:
curl -X POST "http://localhost:8080/crawler/stop"
-
Trigger manual indexing of crawled documents:
curl -X POST "http://localhost:8080/reindex" -
Check indexing status:
curl -X GET "http://localhost:8080/index-status"This endpoint provides information about:
- Number of documents indexed
- Current indexing progress
- Index statistics (unique words, document count)
- Estimated time remaining
Access the web interface at http://localhost:3000 and enter your search query.
The API endpoint is also available for direct integration:
curl -X GET "http://localhost:8080/search?q=your+search+query&page=0&size=10"- Use quotes for exact phrases:
"artificial intelligence" - Use operators:
machine AND learning,python OR java,programming NOT javascript - Voice search: Click the microphone icon and speak your query
- Filter by domain:
site:github.com python - Filter by date:
after:2023-01-01 before:2023-12-31 machine learning
Search-Engine/
├── searchengine/ # Spring Boot backend
│ ├── src/main/java/
│ │ └── com/example/searchengine/
│ │ ├── Crawler/ # Web crawler components
│ │ ├── Indexer/ # Document indexing
│ │ └── Search/ # Search functionality
│ └── pom.xml
├── frontend/ # React frontend
│ ├── src/
│ │ ├── components/ # UI components
│ │ └── App.js # Main application
├── docker-compose.yml # Docker configuration
└── README.md # Project documentation
- Database Indexing: Custom PostgreSQL indices for optimized searches
- Connection Pooling: HikariCP for efficient database connections
- Caching: In-memory caching of frequent searches
- Pagination: All results are paginated to improve performance
-
Docker container fails to start
Check logs withdocker-compose logs backendordocker-compose logs frontend -
Search results are not appearing
Ensure you've started the crawler to populate the database with documents -
Frontend can't connect to backend
Verify the API URL configuration in the frontend's.envfile -
Database vacuum operation fails
This operation requires special permissions. If running locally, ensure your database user has appropriate rights