-
Notifications
You must be signed in to change notification settings - Fork 0
This repository serves as my submission for a homework assignment for the Fall 2018 DATA 512 Human Centered Data Science course at the University of Washington.
License
EdmundTse/data-512-a1
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
DATA512 A1: Data Curation ========================= How does English Wikipedia page traffic trend over time? To answer this question, we plot Wikimedia traffic data spanning from 1 January, 2008 to 30 September 2018. We combine data from two Wikimedia APIs, each covering a different period of time, so that we can show the trend over 10 years. The first API is Legacy Pagecounts (https://wikitech.wikimedia.org/wiki/ Analytics/AQS/Legacy_Pagecounts), with aggregated data from January 2008 to July 2016. The second API is Pageviews (https://wikitech.wikimedia.org/wiki/ Analytics/AQS/Pageviews), with aggregated data starting from May 2015. Besides the difference in periods for which data is available, there is a quality difference between the two APIs: Pagecounts did not differentiate between traffic was generated by a person or by automated robots and web crawlers. The new Pageviews API does make this differentiation and lets us count traffic only from agents not identified as web spiders. The data acquired from both these two API endpoints were available in the public domain under the CC0 1.0 license, according to Wikimedia's RESTBase documentation: https://wikimedia.org/api/rest_v1/. Use of these APIs are subject to terms and conditions described at: https://www.mediawiki.org/wiki/ REST_API#Terms_and_conditions. Data dictionary --------------- Cleaned data from the Wikimedia API are saved to en-wikipedia_traffic_200712-201809.csv with the following fields: | Column Name | Description | Format | |-------------------------|-----------------------------|---------| | year | 4-digit year of the period | YYYY | | month | 2-digit month of the period | MM | | pagecount_all_views | Number of desktop views | integer | | pagecount_desktop_views | Number of desktop views | integer | | pagecount_mobile_views | Number of desktop views | integer | | pageview_all_views | Number of desktop views | integer | | pageview_desktop_views | Number of desktop views | integer | | pageview_mobile_views | Number of desktop views | integer | How to run the code ------------------- The code is supplied in a Python 3 Jupyter notebook, named: hcds-a1-data-curation.ipynb in this folder. Open this folder in a Python 3 Jupyter notebook environment, load the notebook then Run All Cells. If the raw data JSON files are present in the "data_raw" folder, the local copy will be used. Otherwise, the Wikimedia APIs will be called to download data from the source. Cleaned data is output into the "data_clean" folder as: en-wikipedia_traffic_200712-201809.csv. The results of this analysis, a plot image, is output into the "results" folder as: en-wikipedia_traffic_200712-201809.png.
About
This repository serves as my submission for a homework assignment for the Fall 2018 DATA 512 Human Centered Data Science course at the University of Washington.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published