-
Notifications
You must be signed in to change notification settings - Fork 1
Description
crawler operators should expose the IP ranges they allocated for crawling in a standardized, machine readable format, and keep it reasonably up to date (i.e. shouldn't get older than 7 days).
-
Replace "shouldn't get older than 7 days" by "shouldn't get outdated for more than 7 days"?
The term "old" is unspecific: it could also mean that the file is required to be touched every 7 days, without keeping the information up-to-date. -
Is a 7-day update period sufficient?
There are at least the following use cases for the list of IP address ranges:- verify requests in web server access logs.
- configure IP blocking or explicitly allowing the specified IP ranges.
- IP targeting which includes also "black hat" techniques such as "cloaking".
For (i), "historic" IP ranges are a requirement, otherwise verification may raise false alarms about faked user-agent strings in past requests. This would also recommend to attach time periods to IP address ranges.
For (ii) and (iii), a 7-day update period seems overtly long. New IP ranges should be announced even ahead of time. Since we have to assume that everybody plays nice, "black hat" techniques are no counter argument for announcing IP address ahead of time. However, they might be mentioned in the section "Security Considerations".
-
An example JSON schema would help to keep variations of
crawlerips.jsonat a minimum. At least, it should be mentioned what "standardized" means – "constant" for the crawler with a published description / specification or whether it is part of a globally defined standard (maybe upcoming). -
What about reverse DNS as an alternative or augmenting technique for crawler verification? Should this be mentioned or even added as a section?