Crawlers

High level crawling process

Gathering data from the across the web is a dynamic and continually evolving challenge.

To make things manageable the problem is broken up into 4 different stages:

Collecting sitemaps
Crawling ad listings
Crawling individual ads
Processing ad data and crawling ad images

Each stage of the process feeds into the next.

A sitemap is a collection of all the ad listing urls on a page. This is generally slow changing so we only check for new listings once per day.

Ad listings are pages that list links to individual ads. We depth crawl those once an hour to get new ads to crawl. Depth crawling means we will crawl ad listings, find the url for the "next" ad listings then crawl the new ad listing url. We continue the process of crawling ad listings, gather ads, then find the next set of ads until we hit a max depth or until we've already crawled "enough" ads on the page. Think of this as going back in time on sites like craigslist.

Once we have the individual ad urls we then crawl those ads. Raw ads are saved in blobstore and structured ad data is sent to the processing step.

In the final processing step we ad the structured ad data into CosmosDB and save ad images to blobstore.

To be good web citizens we also have a per-domain shared throttle so we will limit the rate which we crawl any website.

Crawling is broken up into 3 different problems:

Collecting ads
Gathering ad listings (so we can get ads)
Mapping a site to get ad listing urls

Because we want reliable and highly scaleable systems we use a queue based architecture. We've broken the problem down into several sub-apps inside this repo:

siteloader - loads domains into the site mapping crawl queue
sitemapper - fetches the page we can parse out the ad listings from
adlistings - gathers pages with ad listings. Those pages are parsed to gather ad urls
ads - gathers ad pages.

All of the parsing is done by a separate app function. These crawlers only collect and add jobs to downstream parse queues.

One final thing to note - these jobs pass a 'metadata' dictionary along each step. This is used to provide context. (For example, what domain the job came from)

It's also good to review Azure Functions Python guide

General:

Functions are split into the __init__.py and main.py functions. The __init__.py handles pulling/putting messages on queues and passing the message to the main.py function.

You don't super need to test the __init__.py but be sure the main.py is well covered.

There's also a utils/ folder for some common code (mostly networking).

General crawl issues:

TODO: docs on this

Siteloader

The siteloader is a timer based function that outputs jobs on the sitemappercrawling queue.

More details here

Sitemapper

The sitemapper gathers a page where we can gather the ad listings urls.

It takes jobs from the sitemappercrawling queue and outputs jobs onto the sitemappingparser queue.

More details here

Data Definitions

We have a living document of the definitions of all fields that are pulled out of the ads in the parsers. Please review these before creating a new parser so that you have a good sense of what the data means. If you add a new field to the parsers, please also add it to this document.

Deployment

Deployment is triggered by a push to the master branch. This runs deploy.yml a Github hook which deploys the function to Azure.

Name		Name	Last commit message	Last commit date
Latest commit History 272 Commits
.github/workflows		.github/workflows
__app__		__app__
docs		docs
tests		tests
tools		tools
.coveragerc		.coveragerc
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawlers

TOC:

High level crawling process

General:

General crawl issues:

Siteloader

Sitemapper

Data Definitions

Deployment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Crawlers

TOC:

High level crawling process

General:

General crawl issues:

Siteloader

Sitemapper

Data Definitions

Deployment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages