RSeek is a powerful web crawler and search tool written in Rust. It allows you to crawl web pages and perform full-text search on the crawled content.
- Web crawling with configurable concurrency
- Full-text search using BM25 ranking algorithm
- HTML parsing and link extraction
- Support for both HTTP and HTTPS
- Concurrent request handling with Tokio
- Command-line interface with subcommands
Since this is a Rust project, you'll need to have Rust and Cargo installed. You can install them from rustup.rs.
To build the project:
cargo build --releaseRSeek provides two main commands:
Crawl a webpage and extract its content:
rseek crawl <url> [--concurrency <number>]Options:
url: The seed URL to start crawling from--concurrencyor-c: Number of concurrent requests (default: 10)
Example:
rseek crawl https://example.com -c 20Search through the crawled content:
rseek search <query>Options:
query: The search query to look for in the crawled content
Example:
rseek search "rust programming"hyper- HTTP client and servertokio- Async runtimescraper- HTML parsingprobly-search- Full-text search functionalityclap- Command-line argument parsinghtml_parser- HTML parsing utilitiesurl- URL parsing and manipulation
src/main.rs- Main application entry pointsrc/page.rs- Page structure and parsing logic
This project is open source and available under the MIT License.
Contributions are welcome! Please feel free to submit a Pull Request.