Spider

This is a command line based program that allows you to search for all the HTML pages on a webpage. This Python script (crawler.py) is an asynchronous web crawler that visits web pages starting from a given URL, extracts and follows links within the same domain, and saves the HTML content of each page to disk. It respects robots.txt rules of the domain.

Features

Asynchronous requests using aiohttp
Parsing HTML content with BeautifulSoup
Robots.txt compliance checking
Saving HTML content to disk
Delayed requests to avoid server overload

Prerequisites

Before you begin, ensure you have installed:

Python 3.12 or later
aiohttp
bs4 (BeautifulSoup)

These can be installed via poetry as detailed in the installation instructions.

Installation

Clone the repository:

git clone https://github.com/ZetiAi/spider

Navigate to the cloned directory:
```
cd spider
```
Install dependencies using poetry:
```
poetry install
```

Usage

To run the program, use the following command:

poetry run python command_line.py --url [start URL]

Replace [start URL] with the URL you want to start crawling from.

How It Works

Initialization: The script starts at the provided URL.
Robots.txt: Fetches robots.txt from the domain and parses it.
Crawling: Begins crawling from the start URL, following links within the same domain.
Saving: Saves the HTML of each visited page in a directory named html_files.
Respectful Crawling: Includes a delay between requests and checks robots.txt for permissions.

Additional Features

The following command will remove all the HTML code from a directory into a text_files/

poetry run python filter.py

The following command will concat all the files into a single file called combined_text_files

poetry run python join.py

Logging

The script uses Python's logging module to log its activity. By default, it is set to the INFO level.

Contributing

Contributions to improve this script are welcome. Please follow standard practices for code contributions.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
tests		tests
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
analysis.py		analysis.py
combined_text_files.txt		combined_text_files.txt
command_line.py		command_line.py
crawler.py		crawler.py
filter.py		filter.py
join.py		join.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spider

Features

Prerequisites

Installation

Usage

How It Works

Additional Features

Logging

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

ZetiAi/spider

Folders and files

Latest commit

History

Repository files navigation

Spider

Features

Prerequisites

Installation

Usage

How It Works

Additional Features

Logging

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages