-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Summary
Implement a feature in the web crawler that automatically discovers, fetches, parses, and enforces the rules specified in a website’s robots.txt file before crawling any URLs from that domain
This includes respecting Disallow, Allow, and Crawl-delay directives, and ensuring that the crawler does not access or queue URLs that are forbidden by the site's robots.txt policy. The crawler should cache robots.txt files
Affected Area(s)
Apps:
- Url Shortener (apps/url-shortener)
- Web Crawler (apps/web-crawler)
Libraries:
- Shared (libs/shared)
Other:
- Other (please specify):
Motivation
Respecting robots.txt prevents overloading servers and avoids crawling restricted areas, aligning with industry best practices and ethical standards.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request