Skip to content

This repository serves as my submission for a homework assignment for the Fall 2018 DATA 512 Human Centered Data Science course at the University of Washington.

License

Notifications You must be signed in to change notification settings

EdmundTse/data-512-a1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DATA512 A1: Data Curation
=========================

How does English Wikipedia page traffic trend over time? To answer this
question, we plot Wikimedia traffic data spanning from 1 January, 2008 to
30 September 2018. We combine data from two Wikimedia APIs, each covering a
different period of time, so that we can show the trend over 10 years.

The first API is Legacy Pagecounts (https://wikitech.wikimedia.org/wiki/
Analytics/AQS/Legacy_Pagecounts), with aggregated data from January 2008 to
July 2016. The second API is Pageviews (https://wikitech.wikimedia.org/wiki/
Analytics/AQS/Pageviews), with aggregated data starting from May 2015.
Besides the difference in periods for which data is available, there is a
quality difference between the two APIs: Pagecounts did not differentiate
between traffic was generated by a person or by automated robots and web
crawlers. The new Pageviews API does make this differentiation and lets us
count traffic only from agents not identified as web spiders.

The data acquired from both these two API endpoints were available in the
public domain under the CC0 1.0 license, according to Wikimedia's RESTBase
documentation: https://wikimedia.org/api/rest_v1/. Use of these APIs are
subject to terms and conditions described at: https://www.mediawiki.org/wiki/
REST_API#Terms_and_conditions.


Data dictionary
---------------

Cleaned data from the Wikimedia API are saved to
en-wikipedia_traffic_200712-201809.csv with the following fields:

| Column Name             | Description                 | Format  |
|-------------------------|-----------------------------|---------|
| year                    | 4-digit year of the period  | YYYY    |
| month                   | 2-digit month of the period | MM      |
| pagecount_all_views     | Number of desktop views     | integer |
| pagecount_desktop_views | Number of desktop views     | integer |
| pagecount_mobile_views  | Number of desktop views     | integer |
| pageview_all_views      | Number of desktop views     | integer |
| pageview_desktop_views  | Number of desktop views     | integer |
| pageview_mobile_views   | Number of desktop views     | integer |


How to run the code
-------------------

The code is supplied in a Python 3 Jupyter notebook, named:
hcds-a1-data-curation.ipynb in this folder. Open this folder in a Python 3
Jupyter notebook environment, load the notebook then Run All Cells.

If the raw data JSON files are present in the "data_raw" folder, the local
copy will be used. Otherwise, the Wikimedia APIs will be called to download
data from the source.

Cleaned data is output into the "data_clean" folder as:
en-wikipedia_traffic_200712-201809.csv.

The results of this analysis, a plot image, is output into the "results"
folder as: en-wikipedia_traffic_200712-201809.png.

About

This repository serves as my submission for a homework assignment for the Fall 2018 DATA 512 Human Centered Data Science course at the University of Washington.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published