Skip to content

Design Overview Interface Reference

cgray edited this page Sep 3, 2013 · 1 revision

Crawler

Interface that defines the crawer task as a whole. Requires a HttpClient, DocumentFactory, and a UrlFilter for construction and a url to act as the starting point for doing a crawl.

Http\Client

Interface that describes the mechanism which the crawler will use to fetch the documents it needs.

DocumentFactory

Component that takes a response returned from an Http\Client and uses it to create and return a Document Object

Processor

Component that is registered to a DocumentFactory that is responsible for parsing and annotating information in a HTTP\Client Response for a given mime type.

Document

A simple value object with some accessor and mutator functions that represents a resource.

UrlFilter

A container that holds zero or more UrlFilter\Rule Objects. This is used by the crawler to limit the scope of the crawl.

UrlFilter\Rule

A class that represents a condition on which a particular url will be allowed to be crawled.

Report

A class that when bound to a results of a crawl will allow creation to some sort of (hopefully) useful formatted report.

Clone this wiki locally