Capstone project for the Udacity Data Engineering Nanodegree. In this project, we perform an ETL process on a set of NBA datasets.
As a huge NBA fan, I'd like to extract and organise some data and insights from my favourite teams and players. I will reorder the csvs in such a way that I can easily query for augmented information that combines the data from the original sources. By doing this, the resulting database will be an excellent example of an Analytics table. The data will be extracted from 5 different csvs:
- historical_nba_performance.csv
This dataset contains information about team performance per year. - nba_all_star_games.csv
This dataset includes information about the players participating in the yearly All-Star games. - nba_shots_2000_to_2018.csv This dataset (the biggest one with over 1m rows) shows a detailed description of all the shots made from 2000 to 2018 on every game.
- player_data.csv
Information about each player. - players.csv More informacion about players.
First things first. To correctly run the project, please install the dependencies on an empty python environment.
# Run this on the project's source folder. I am using conda in this example.
conda create --name dec python=3.7 --no-default-packages
pip install -e .
import pandas as pd
# Setting the chunksize to 100 since we just want to take a first look into the data ;)
historical = pd.read_csv("data/historical_nba_performance.csv", chunksize=100).get_chunk()
all_star = pd.read_csv("data/all_star.csv", chunksize=100).get_chunk()
shots = pd.read_csv("data/NBA_Shots_2000_to_2018.csv", chunksize=100).get_chunk()
player_data_1 = pd.read_csv("data/player_data.csv", chunksize=100).get_chunk()
player_data_2 = pd.read_csv("data/Players.csv", chunksize=100).get_chunk()historical.head().dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| Year | Team | Record | Winning Percentage | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Unnamed: 9 | Unnamed: 10 | Unnamed: 11 | Unnamed: 12 | Unnamed: 13 | Unnamed: 14 | Unnamed: 15 | Unnamed: 16 | Unnamed: 17 | Unnamed: 18 | Unnamed: 19 | Unnamed: 20 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016-17 | Celtics | 25-15 | 0.625 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 2015-16 | Celtics | 48-34 | 0.585 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 2014-15 | Celtics | 40-42 | 0.488 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 2013-14 | Celtics | 25-57 | 0.305 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 2012-13 | Celtics | 41-40 | 0.506 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Ok, so apparently this csv has 4 columns and a number of unnamed columns that were probably reserved for extra values. The columns are:
- Year
- Team
- Record
- Winning Percentage
all_star.head().dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| Year | Player | Pos | HT | WT | Team | Selection Type | NBA Draft Status | Nationality | Unnamed: 9 | Unnamed: 10 | Unnamed: 11 | Unnamed: 12 | Unnamed: 13 | Unnamed: 14 | Unnamed: 15 | Unnamed: 16 | Unnamed: 17 | Unnamed: 18 | Unnamed: 19 | Unnamed: 20 | Unnamed: 21 | Unnamed: 22 | Unnamed: 23 | Unnamed: 24 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016 | Stephen Curry | G | 6-3 | 190 | Golden State Warriors | Western All-Star Fan Vote Selection | 2009 Rnd 1 Pick 7 | United States | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 2016 | James Harden | SG | 6-5 | 220 | Houston Rockets | Western All-Star Fan Vote Selection | 2009 Rnd 1 Pick 3 | United States | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 2016 | Kevin Durant | SF | 6-9 | 240 | Golden State Warriors | Western All-Star Fan Vote Selection | 2007 Rnd 1 Pick 2 | United States | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 2016 | Kawhi Leonard | F | 6-7 | 230 | San Antonio Spurs | Western All-Star Fan Vote Selection | 2011 Rnd 1 Pick 15 | United States | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 2016 | Anthony Davis | PF | 6-11 | 253 | New Orleans Pelicans | Western All-Star Fan Vote Selection | 2012 Rnd 1 Pick 1 | United States | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
print([col for col in all_star.columns if 'Unnamed' not in col ])['Year', 'Player', 'Pos', 'HT', 'WT', 'Team', 'Selection Type', 'NBA Draft Status', 'Nationality']
shots.head().dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| Unnamed: 0 | X | ID | Player | Season | Top.px. (Location) | Left.px. (location) | Date | Team | Opponent | Location | Quarter | Game_Clock | Outcome (1 if made, 0 otherwise) | Shot_Value | Shot_Distance.ft. | Team_Score | Opponent_Score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | abdulma02 | Mahmoud Abdul-Rauf | 2001 | 250 | 304 | 110600 | VAN | ATL | HOME | 3 | 00:38.4 | 0 | 2 | 21 | 69 | 55 |
| 1 | 2 | 2 | abdulma02 | Mahmoud Abdul-Rauf | 2001 | 147 | 241 | 111800 | VAN | DAL | HOME | 2 | 9:22 | 1 | 2 | 10 | 33 | 26 |
| 2 | 3 | 3 | abdulma02 | Mahmoud Abdul-Rauf | 2001 | 132 | 403 | 112400 | VAN | DET | AWAY | 3 | 6:42 | 0 | 2 | 18 | 60 | 78 |
| 3 | 4 | 4 | abdulma02 | Mahmoud Abdul-Rauf | 2001 | 177 | 129 | 112400 | VAN | DET | AWAY | 3 | 2:42 | 0 | 2 | 17 | 66 | 80 |
| 4 | 5 | 5 | abdulma02 | Mahmoud Abdul-Rauf | 2001 | 99 | 390 | 112400 | VAN | DET | AWAY | 3 | 2:18 | 0 | 2 | 16 | 66 | 80 |
print([col for col in shots.columns if 'Unnamed' not in col ])['X', 'ID', 'Player', 'Season', 'Top.px. (Location)', 'Left.px. (location)', 'Date', 'Team', 'Opponent', 'Location', 'Quarter', 'Game_Clock', 'Outcome (1 if made, 0 otherwise)', 'Shot_Value', 'Shot_Distance.ft.', 'Team_Score', 'Opponent_Score']
The info on this table makes up for a great facts table.
player_data_1.head().dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| name | year_start | year_end | position | height | weight | birth_date | college | |
|---|---|---|---|---|---|---|---|---|
| 0 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | 240 | June 24, 1968 | Duke University |
| 1 | Zaid Abdul-Aziz | 1969 | 1978 | C-F | 6-9 | 235 | April 7, 1946 | Iowa State University |
| 2 | Kareem Abdul-Jabbar | 1970 | 1989 | C | 7-2 | 225 | April 16, 1947 | University of California, Los Angeles |
| 3 | Mahmoud Abdul-Rauf | 1991 | 2001 | G | 6-1 | 162 | March 9, 1969 | Louisiana State University |
| 4 | Tariq Abdul-Wahad | 1998 | 2003 | F | 6-6 | 223 | November 3, 1974 | San Jose State University |
player_data_2.head().dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| Unnamed: 0 | Player | height | weight | collage | born | birth_city | birth_state | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Curly Armstrong | 180 | 77 | Indiana University | 1918 | NaN | NaN |
| 1 | 1 | Cliff Barker | 188 | 83 | University of Kentucky | 1921 | Yorktown | Indiana |
| 2 | 2 | Leo Barnhorst | 193 | 86 | University of Notre Dame | 1924 | NaN | NaN |
| 3 | 3 | Ed Bartels | 196 | 88 | North Carolina State University | 1925 | NaN | NaN |
| 4 | 4 | Ralph Beard | 178 | 79 | University of Kentucky | 1927 | Hardinsburg | Kentucky |
print([col for col in player_data_1.columns if 'Unnamed' not in col ])
print([col for col in player_data_2.columns if 'Unnamed' not in col ])['name', 'year_start', 'year_end', 'position', 'height', 'weight', 'birth_date', 'college']
['Player', 'height', 'weight', 'collage', 'born', 'birth_city', 'birth_state']
As you can notice, both tables contain similar data, so in the architecture we will make sure to combine this tables with their useful info.
Given the nature of the data, I am going to use a Star schema to create a RedshiftDB with 4 nodes. The idea is to host the player dimensions and the team dimensions on each node (distribution type ALL), and the shots fact table will be distributed across all 4 nodes. This will allow us to expand the database horizontally if needed, and also it will allow us to access the data faster since each node will contain the necessary information.
Note that the players table will extrapolate some info from the "all_star" original csv, in particular the nba_draft_status and the nationality of the player.
This architecture will let us extract and augment useful information regarding the shots made. Possible queries to answer:
- Teams with the most cumulative points scored per season.
- Players with the most cumulative points scored.
- Players with "clutch" (this means that they are able to score in the last 2 minutes of the game, in the 4th quarter).
- Players with the best 3 point scoring percentage.
- Extract the colleges with the most winning players (players who have won the most games).
