Crawly created by Patryk 'UltiPro' Wójtowicz using Python.
The project is a web crawler with web scrapper that implements both BFS and DFS search methods. It can be configured by selecting options such as search method, time limits, search depth, whether to generate a full graph, and optional proxy server settings. The application collects only URLs and the contents of "a" tags. However, the code can be easily adapted to specific needs in the "_process_page" function. During execution, the program launches a browser using the Playwright package. The browser navigates through web pages, if necessary, it pauses to let the user solve captchas etc. The output consists of a CSV file containing URLs and "a" tags contents, as well as an HTML page with a graph representing the connections between websites.
Dependencies:
- beautifulsoup4 4.13.3
- bs4 0.0.2
- fake-useragent 2.0.3
- greenlet 3.1.1
- narwhals 1.28.0
- networkx 3.4.2
- numpy 2.2.3
- packaging 24.2
- playwright 1.50.0
- plotly 6.0.0
- pyee 12.1.1
- soupsieve 2.6
- typing_extensions 4.12.2
Installation:
cd "/Crawly"
pip install -r requirements.txt
playwright install
python main.py [url-address] [options]
| Option | Short | Description | Default Value |
|---|---|---|---|
| --method | -m | Search method | bfs |
| --time | -t | Execution time (s) | 60 |
| --depth | -d | Maximum search depth | 10 |
| --full_graph | -fg | Generate a full graph | False |
| --proxy_server | -ps | Proxy server IP/address | — |
| --proxy_username | -pu | Proxy username | — |
| --proxy_password | -pp | Proxy password | — |


