Being able to scrape a website also means that you are able to recognize how web sites are built.
We use a script to download all of the HTML that loads when we visit a URL. Then, we look through that HTML, picking and choosing the parts the are interesting to us and saving it. To do this, we need to be able to:
- Identify what part of the HTML we're interested in
- Tell our script what that HTML means
- Save our results somewhere
Many reasons, including...
- Saving yourself the trouble of visiting the same website everyday. Think live results, or daily updating graphics. Or maybe you want to monitor whether or not a website has changed from day to day.
- It's faster to scrape than to manually copy hundreds or thousands or millions of rows of data.
- You can't get the [fill in the blank] office to give you a clean version of the data to download.
- Go to this website.
- If I wanted to identify the table on the page, what is the most unique attribute you could use?
Python basics refresher, which we did last session: Getting started with Python
The full scraper you should have by the end of the day: jailscrape.py
Again, the full tutorial: http://first-web-scraper.readthedocs.org/en/latest/