NBADataEngineering

Capstone project for the Udacity Data Engineering Nanodegree. In this project, we perform an ETL process on a set of NBA datasets.

As a huge NBA fan, I'd like to extract and organise some data and insights from my favourite teams and players. I will reorder the csvs in such a way that I can easily query for augmented information that combines the data from the original sources. By doing this, the resulting database will be an excellent example of an Analytics table. The data will be extracted from 5 different csvs:

historical_nba_performance.csv
This dataset contains information about team performance per year.
nba_all_star_games.csv
This dataset includes information about the players participating in the yearly All-Star games.
nba_shots_2000_to_2018.csv This dataset (the biggest one with over 1m rows) shows a detailed description of all the shots made from 2000 to 2018 on every game.
player_data.csv
Information about each player.
players.csv More informacion about players.

Installation

First things first. To correctly run the project, please install the dependencies on an empty python environment.

# Run this on the project's source folder. I am using conda in this example.
conda create --name dec python=3.7 --no-default-packages
pip install -e .

Exploratory Data Analysis

import pandas as pd

# Setting the chunksize to 100 since we just want to take a first look into the data ;) 
historical = pd.read_csv("data/historical_nba_performance.csv", chunksize=100).get_chunk()
all_star = pd.read_csv("data/all_star.csv", chunksize=100).get_chunk()
shots = pd.read_csv("data/NBA_Shots_2000_to_2018.csv", chunksize=100).get_chunk()
player_data_1 = pd.read_csv("data/player_data.csv", chunksize=100).get_chunk()
player_data_2 = pd.read_csv("data/Players.csv", chunksize=100).get_chunk()

Historical Teams table

historical.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Year	Team	Record	Winning Percentage	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Unnamed: 9	Unnamed: 10	Unnamed: 11	Unnamed: 12	Unnamed: 13	Unnamed: 14	Unnamed: 15	Unnamed: 16	Unnamed: 17	Unnamed: 18	Unnamed: 19	Unnamed: 20
0	2016-17	Celtics	25-15	0.625	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	2015-16	Celtics	48-34	0.585	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	2014-15	Celtics	40-42	0.488	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	2013-14	Celtics	25-57	0.305	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	2012-13	Celtics	41-40	0.506	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Ok, so apparently this csv has 4 columns and a number of unnamed columns that were probably reserved for extra values. The columns are:

Year
Team
Record
Winning Percentage

All-Star teams table

all_star.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Year	Player	Pos	HT	WT	Team	Selection Type	NBA Draft Status	Nationality	Unnamed: 9	Unnamed: 10	Unnamed: 11	Unnamed: 12	Unnamed: 13	Unnamed: 14	Unnamed: 15	Unnamed: 16	Unnamed: 17	Unnamed: 18	Unnamed: 19	Unnamed: 20	Unnamed: 21	Unnamed: 22	Unnamed: 23	Unnamed: 24
0	2016	Stephen Curry	G	6-3	190	Golden State Warriors	Western All-Star Fan Vote Selection	2009 Rnd 1 Pick 7	United States	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	2016	James Harden	SG	6-5	220	Houston Rockets	Western All-Star Fan Vote Selection	2009 Rnd 1 Pick 3	United States	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	2016	Kevin Durant	SF	6-9	240	Golden State Warriors	Western All-Star Fan Vote Selection	2007 Rnd 1 Pick 2	United States	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	2016	Kawhi Leonard	F	6-7	230	San Antonio Spurs	Western All-Star Fan Vote Selection	2011 Rnd 1 Pick 15	United States	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	2016	Anthony Davis	PF	6-11	253	New Orleans Pelicans	Western All-Star Fan Vote Selection	2012 Rnd 1 Pick 1	United States	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

print([col for col in all_star.columns if 'Unnamed' not in col ])

['Year', 'Player', 'Pos', 'HT', 'WT', 'Team', 'Selection Type', 'NBA Draft Status', 'Nationality']

Nba Shots

shots.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Unnamed: 0	X	ID	Player	Season	Top.px. (Location)	Left.px. (location)	Date	Team	Opponent	Location	Quarter	Game_Clock	Outcome (1 if made, 0 otherwise)	Shot_Value	Shot_Distance.ft.	Team_Score	Opponent_Score
0	1	1	abdulma02	Mahmoud Abdul-Rauf	2001	250	304	110600	VAN	ATL	HOME	3	00:38.4	0	2	21	69	55
1	2	2	abdulma02	Mahmoud Abdul-Rauf	2001	147	241	111800	VAN	DAL	HOME	2	9:22	1	2	10	33	26
2	3	3	abdulma02	Mahmoud Abdul-Rauf	2001	132	403	112400	VAN	DET	AWAY	3	6:42	0	2	18	60	78
3	4	4	abdulma02	Mahmoud Abdul-Rauf	2001	177	129	112400	VAN	DET	AWAY	3	2:42	0	2	17	66	80
4	5	5	abdulma02	Mahmoud Abdul-Rauf	2001	99	390	112400	VAN	DET	AWAY	3	2:18	0	2	16	66	80

print([col for col in shots.columns if 'Unnamed' not in col ])

['X', 'ID', 'Player', 'Season', 'Top.px. (Location)', 'Left.px. (location)', 'Date', 'Team', 'Opponent', 'Location', 'Quarter', 'Game_Clock', 'Outcome (1 if made, 0 otherwise)', 'Shot_Value', 'Shot_Distance.ft.', 'Team_Score', 'Opponent_Score']

The info on this table makes up for a great facts table.

players_data and Players table

player_data_1.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	name	year_start	year_end	position	height	weight	birth_date	college
0	Alaa Abdelnaby	1991	1995	F-C	6-10	240	June 24, 1968	Duke University
1	Zaid Abdul-Aziz	1969	1978	C-F	6-9	235	April 7, 1946	Iowa State University
2	Kareem Abdul-Jabbar	1970	1989	C	7-2	225	April 16, 1947	University of California, Los Angeles
3	Mahmoud Abdul-Rauf	1991	2001	G	6-1	162	March 9, 1969	Louisiana State University
4	Tariq Abdul-Wahad	1998	2003	F	6-6	223	November 3, 1974	San Jose State University

player_data_2.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Unnamed: 0	Player	height	weight	collage	born	birth_city	birth_state
0	0	Curly Armstrong	180	77	Indiana University	1918	NaN	NaN
1	1	Cliff Barker	188	83	University of Kentucky	1921	Yorktown	Indiana
2	2	Leo Barnhorst	193	86	University of Notre Dame	1924	NaN	NaN
3	3	Ed Bartels	196	88	North Carolina State University	1925	NaN	NaN
4	4	Ralph Beard	178	79	University of Kentucky	1927	Hardinsburg	Kentucky

print([col for col in player_data_1.columns if 'Unnamed' not in col ])
print([col for col in player_data_2.columns if 'Unnamed' not in col ])

['name', 'year_start', 'year_end', 'position', 'height', 'weight', 'birth_date', 'college']
['Player', 'height', 'weight', 'collage', 'born', 'birth_city', 'birth_state']

As you can notice, both tables contain similar data, so in the architecture we will make sure to combine this tables with their useful info.

Proposed Data Model

Given the nature of the data, I am going to use a Star schema to create a RedshiftDB with 4 nodes. The idea is to host the player dimensions and the team dimensions on each node (distribution type ALL), and the shots fact table will be distributed across all 4 nodes. This will allow us to expand the database horizontally if needed, and also it will allow us to access the data faster since each node will contain the necessary information.

Note that the players table will extrapolate some info from the "all_star" original csv, in particular the nba_draft_status and the nationality of the player.

This architecture will let us extract and augment useful information regarding the shots made. Possible queries to answer:

Teams with the most cumulative points scored per season.
Players with the most cumulative points scored.
Players with "clutch" (this means that they are able to score in the last 2 minutes of the game, in the 4th quarter).
Players with the best 3 point scoring percentage.
Extract the colleges with the most winning players (players who have won the most games).

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
dec		dec
.gitignore		.gitignore
Edit Headers.ipynb		Edit Headers.ipynb
README.ipynb		README.ipynb
README.md		README.md
airflow.cfg		airflow.cfg
config.json		config.json
setup.py		setup.py
start.sh		start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NBADataEngineering

Installation

Exploratory Data Analysis

Historical Teams table

All-Star teams table

Nba Shots

players_data and Players table

Proposed Data Model

About

Uh oh!

Releases

Packages

Languages

Action52/NBADataEngineering

Folders and files

Latest commit

History

Repository files navigation

NBADataEngineering

Installation

Exploratory Data Analysis

Historical Teams table

All-Star teams table

Nba Shots

players_data and Players table

Proposed Data Model

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages