This is a command line based program that allows you to search for all the HTML pages on a webpage. This Python script (crawler.py) is an asynchronous web crawler that visits web pages starting from a given URL, extracts and follows links within the same domain, and saves the HTML content of each page to disk. It respects robots.txt rules of the domain.
- Asynchronous requests using
aiohttp - Parsing HTML content with
BeautifulSoup - Robots.txt compliance checking
- Saving HTML content to disk
- Delayed requests to avoid server overload
Before you begin, ensure you have installed:
- Python 3.12 or later
aiohttpbs4(BeautifulSoup)
These can be installed via poetry as detailed in the installation instructions.
-
Clone the repository:
git clone https://github.com/ZetiAi/spider
-
Navigate to the cloned directory:
cd spider -
Install dependencies using poetry:
poetry install
To run the program, use the following command:
poetry run python command_line.py --url [start URL]Replace [start URL] with the URL you want to start crawling from.
- Initialization: The script starts at the provided URL.
- Robots.txt: Fetches
robots.txtfrom the domain and parses it. - Crawling: Begins crawling from the start URL, following links within the same domain.
- Saving: Saves the HTML of each visited page in a directory named
html_files. - Respectful Crawling: Includes a delay between requests and checks
robots.txtfor permissions.
The following command will remove all the HTML code from a directory into a text_files/
poetry run python filter.py The following command will concat all the files into a single file called combined_text_files
poetry run python join.pyThe script uses Python's logging module to log its activity. By default, it is set to the INFO level.
Contributions to improve this script are welcome. Please follow standard practices for code contributions.