Week 8 - Web Scraping

Being able to scrape a website also means that you are able to recognize how web sites are built.

How Does a Webscraper Work?

We use a script to download all of the HTML that loads when we visit a URL. Then, we look through that HTML, picking and choosing the parts the are interesting to us and saving it. To do this, we need to be able to:

Identify what part of the HTML we're interested in
Tell our script what that HTML means
Save our results somewhere

Why Scrape?

Many reasons, including...

Saving yourself the trouble of visiting the same website everyday. Think live results, or daily updating graphics. Or maybe you want to monitor whether or not a website has changed from day to day.
It's faster to scrape than to manually copy hundreds or thousands or millions of rows of data.
You can't get the [fill in the blank] office to give you a clean version of the data to download.

A Different Exercise for HTML/CSS

Go to this website.
If I wanted to identify the table on the page, what is the most unique attribute you could use?

Your first web scraper

Python basics refresher, which we did last session: Getting started with Python

The full scraper you should have by the end of the day: jailscrape.py

Again, the full tutorial: http://first-web-scraper.readthedocs.org/en/latest/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Week 8 - Web Scraping

How Does a Webscraper Work?

Why Scrape?

A Different Exercise for HTML/CSS

Your first web scraper

FilesExpand file tree

week8.md

Latest commit

History

week8.md

File metadata and controls

Week 8 - Web Scraping

How Does a Webscraper Work?

Why Scrape?

A Different Exercise for HTML/CSS

Your first web scraper