Skip to content

Latest commit

 

History

History
31 lines (19 loc) · 1.55 KB

File metadata and controls

31 lines (19 loc) · 1.55 KB

Week 8 - Web Scraping

Being able to scrape a website also means that you are able to recognize how web sites are built.

How Does a Webscraper Work?

We use a script to download all of the HTML that loads when we visit a URL. Then, we look through that HTML, picking and choosing the parts the are interesting to us and saving it. To do this, we need to be able to:

  • Identify what part of the HTML we're interested in
  • Tell our script what that HTML means
  • Save our results somewhere

Why Scrape?

Many reasons, including...

  • Saving yourself the trouble of visiting the same website everyday. Think live results, or daily updating graphics. Or maybe you want to monitor whether or not a website has changed from day to day.
  • It's faster to scrape than to manually copy hundreds or thousands or millions of rows of data.
  • You can't get the [fill in the blank] office to give you a clean version of the data to download.

A Different Exercise for HTML/CSS

  1. Go to this website.
  2. If I wanted to identify the table on the page, what is the most unique attribute you could use?

Your first web scraper

Python basics refresher, which we did last session: Getting started with Python

The full scraper you should have by the end of the day: jailscrape.py

Again, the full tutorial: http://first-web-scraper.readthedocs.org/en/latest/