Skip to content

Capstone project for the Udacity Data Engineering Nanodegree. In this project, we perform an ETL process on a set of NBA datasets.

Notifications You must be signed in to change notification settings

Action52/NBADataEngineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NBADataEngineering

Capstone project for the Udacity Data Engineering Nanodegree. In this project, we perform an ETL process on a set of NBA datasets.

As a huge NBA fan, I'd like to extract and organise some data and insights from my favourite teams and players. I will reorder the csvs in such a way that I can easily query for augmented information that combines the data from the original sources. By doing this, the resulting database will be an excellent example of an Analytics table. The data will be extracted from 5 different csvs:

  • historical_nba_performance.csv
    This dataset contains information about team performance per year.
  • nba_all_star_games.csv
    This dataset includes information about the players participating in the yearly All-Star games.
  • nba_shots_2000_to_2018.csv This dataset (the biggest one with over 1m rows) shows a detailed description of all the shots made from 2000 to 2018 on every game.
  • player_data.csv
    Information about each player.
  • players.csv More informacion about players.

Installation

First things first. To correctly run the project, please install the dependencies on an empty python environment.

# Run this on the project's source folder. I am using conda in this example.
conda create --name dec python=3.7 --no-default-packages
pip install -e .

Exploratory Data Analysis

import pandas as pd

# Setting the chunksize to 100 since we just want to take a first look into the data ;) 
historical = pd.read_csv("data/historical_nba_performance.csv", chunksize=100).get_chunk()
all_star = pd.read_csv("data/all_star.csv", chunksize=100).get_chunk()
shots = pd.read_csv("data/NBA_Shots_2000_to_2018.csv", chunksize=100).get_chunk()
player_data_1 = pd.read_csv("data/player_data.csv", chunksize=100).get_chunk()
player_data_2 = pd.read_csv("data/Players.csv", chunksize=100).get_chunk()

Historical Teams table

historical.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Year Team Record Winning Percentage Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12 Unnamed: 13 Unnamed: 14 Unnamed: 15 Unnamed: 16 Unnamed: 17 Unnamed: 18 Unnamed: 19 Unnamed: 20
0 2016-17 Celtics 25-15 0.625 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2015-16 Celtics 48-34 0.585 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2014-15 Celtics 40-42 0.488 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2013-14 Celtics 25-57 0.305 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 2012-13 Celtics 41-40 0.506 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Ok, so apparently this csv has 4 columns and a number of unnamed columns that were probably reserved for extra values. The columns are:

  • Year
  • Team
  • Record
  • Winning Percentage

All-Star teams table

all_star.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Year Player Pos HT WT Team Selection Type NBA Draft Status Nationality Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12 Unnamed: 13 Unnamed: 14 Unnamed: 15 Unnamed: 16 Unnamed: 17 Unnamed: 18 Unnamed: 19 Unnamed: 20 Unnamed: 21 Unnamed: 22 Unnamed: 23 Unnamed: 24
0 2016 Stephen Curry G 6-3 190 Golden State Warriors Western All-Star Fan Vote Selection 2009 Rnd 1 Pick 7 United States NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2016 James Harden SG 6-5 220 Houston Rockets Western All-Star Fan Vote Selection 2009 Rnd 1 Pick 3 United States NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2016 Kevin Durant SF 6-9 240 Golden State Warriors Western All-Star Fan Vote Selection 2007 Rnd 1 Pick 2 United States NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2016 Kawhi Leonard F 6-7 230 San Antonio Spurs Western All-Star Fan Vote Selection 2011 Rnd 1 Pick 15 United States NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 2016 Anthony Davis PF 6-11 253 New Orleans Pelicans Western All-Star Fan Vote Selection 2012 Rnd 1 Pick 1 United States NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
print([col for col in all_star.columns if 'Unnamed' not in col ])
['Year', 'Player', 'Pos', 'HT', 'WT', 'Team', 'Selection Type', 'NBA Draft Status', 'Nationality']

Nba Shots

shots.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Unnamed: 0 X ID Player Season Top.px. (Location) Left.px. (location) Date Team Opponent Location Quarter Game_Clock Outcome (1 if made, 0 otherwise) Shot_Value Shot_Distance.ft. Team_Score Opponent_Score
0 1 1 abdulma02 Mahmoud Abdul-Rauf 2001 250 304 110600 VAN ATL HOME 3 00:38.4 0 2 21 69 55
1 2 2 abdulma02 Mahmoud Abdul-Rauf 2001 147 241 111800 VAN DAL HOME 2 9:22 1 2 10 33 26
2 3 3 abdulma02 Mahmoud Abdul-Rauf 2001 132 403 112400 VAN DET AWAY 3 6:42 0 2 18 60 78
3 4 4 abdulma02 Mahmoud Abdul-Rauf 2001 177 129 112400 VAN DET AWAY 3 2:42 0 2 17 66 80
4 5 5 abdulma02 Mahmoud Abdul-Rauf 2001 99 390 112400 VAN DET AWAY 3 2:18 0 2 16 66 80
print([col for col in shots.columns if 'Unnamed' not in col ])
['X', 'ID', 'Player', 'Season', 'Top.px. (Location)', 'Left.px. (location)', 'Date', 'Team', 'Opponent', 'Location', 'Quarter', 'Game_Clock', 'Outcome (1 if made, 0 otherwise)', 'Shot_Value', 'Shot_Distance.ft.', 'Team_Score', 'Opponent_Score']

The info on this table makes up for a great facts table.

players_data and Players table

player_data_1.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
name year_start year_end position height weight birth_date college
0 Alaa Abdelnaby 1991 1995 F-C 6-10 240 June 24, 1968 Duke University
1 Zaid Abdul-Aziz 1969 1978 C-F 6-9 235 April 7, 1946 Iowa State University
2 Kareem Abdul-Jabbar 1970 1989 C 7-2 225 April 16, 1947 University of California, Los Angeles
3 Mahmoud Abdul-Rauf 1991 2001 G 6-1 162 March 9, 1969 Louisiana State University
4 Tariq Abdul-Wahad 1998 2003 F 6-6 223 November 3, 1974 San Jose State University
player_data_2.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Unnamed: 0 Player height weight collage born birth_city birth_state
0 0 Curly Armstrong 180 77 Indiana University 1918 NaN NaN
1 1 Cliff Barker 188 83 University of Kentucky 1921 Yorktown Indiana
2 2 Leo Barnhorst 193 86 University of Notre Dame 1924 NaN NaN
3 3 Ed Bartels 196 88 North Carolina State University 1925 NaN NaN
4 4 Ralph Beard 178 79 University of Kentucky 1927 Hardinsburg Kentucky
print([col for col in player_data_1.columns if 'Unnamed' not in col ])
print([col for col in player_data_2.columns if 'Unnamed' not in col ])
['name', 'year_start', 'year_end', 'position', 'height', 'weight', 'birth_date', 'college']
['Player', 'height', 'weight', 'collage', 'born', 'birth_city', 'birth_state']

As you can notice, both tables contain similar data, so in the architecture we will make sure to combine this tables with their useful info.

Proposed Data Model

Given the nature of the data, I am going to use a Star schema to create a RedshiftDB with 4 nodes. The idea is to host the player dimensions and the team dimensions on each node (distribution type ALL), and the shots fact table will be distributed across all 4 nodes. This will allow us to expand the database horizontally if needed, and also it will allow us to access the data faster since each node will contain the necessary information.

Note that the players table will extrapolate some info from the "all_star" original csv, in particular the nba_draft_status and the nationality of the player.

This architecture will let us extract and augment useful information regarding the shots made. Possible queries to answer:

  • Teams with the most cumulative points scored per season.
  • Players with the most cumulative points scored.
  • Players with "clutch" (this means that they are able to score in the last 2 minutes of the game, in the 4th quarter).
  • Players with the best 3 point scoring percentage.
  • Extract the colleges with the most winning players (players who have won the most games).

About

Capstone project for the Udacity Data Engineering Nanodegree. In this project, we perform an ETL process on a set of NBA datasets.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published