netcrawler

A simple web crawler implemented in Python.

Returns the first 100 unique URLs found after starting the crawl at the input URL. The program works in a breadth-first manner. It first records the unique URLs at the page being parsed currently. After all the links there have been exhausted, it fetches one of the URLs that it recorded previously.

WARNING: Do not use for anything serious. This crawler does not follow robots.txt conventions.

Usage

To install, run:

$ pip install -r requirements.txt

To test, run:

$ pip -m unittest

To use, run:

$ python crawler.py 'http://www.yourwebsite.com'

Replace 'http://www.yourwebsite.com' with the URL of the website you wish to start the crawl at.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
README.md		README.md
crawler.py		crawler.py
requirements.txt		requirements.txt
tests.py		tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

netcrawler

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

netcrawler

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages