Customer churn is a critical concern for telecom companies, as retaining existing customers is often more cost-effective than acquiring new ones. This project aims to analyze customer churn in a telecom company, leveraging customer demographics, account details, and service subscriptions to identify key factors influencing churn. By exploring patterns in the data, we seek to uncover insights that can help predict and reduce customer attrition, ultimately improving customer retention strategies. Fasten up the seat belt, and let’s explore this ocean!
+
+
Loading libraries
+
+
+
Load data
+
+
+Code
+
telco_data <-read.csv("https://raw.githubusercontent.com/dsrscientist/DSData/master/Telecom_customer_churn.csv")
+
+# View the first few rows
+head(telco_data)
+
+
+
customerID gender SeniorCitizen Partner Dependents tenure PhoneService
+1 7590-VHVEG Female 0 Yes No 1 No
+2 5575-GNVDE Male 0 No No 34 Yes
+3 3668-QPYBK Male 0 No No 2 Yes
+4 7795-CFOCW Male 0 No No 45 No
+5 9237-HQITU Female 0 No No 2 Yes
+6 9305-CDSKC Female 0 No No 8 Yes
+ MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection
+1 No phone service DSL No Yes No
+2 No DSL Yes No Yes
+3 No DSL Yes Yes No
+4 No phone service DSL Yes No Yes
+5 No Fiber optic No No No
+6 Yes Fiber optic No No Yes
+ TechSupport StreamingTV StreamingMovies Contract PaperlessBilling
+1 No No No Month-to-month Yes
+2 No No No One year No
+3 No No No Month-to-month Yes
+4 Yes No No One year No
+5 No No No Month-to-month Yes
+6 No Yes Yes Month-to-month Yes
+ PaymentMethod MonthlyCharges TotalCharges Churn
+1 Electronic check 29.85 29.85 No
+2 Mailed check 56.95 1889.50 No
+3 Mailed check 53.85 108.15 Yes
+4 Bank transfer (automatic) 42.30 1840.75 No
+5 Electronic check 70.70 151.65 Yes
+6 Electronic check 99.65 820.50 Yes
+
+
+
+
+
+
Data Processing
+
+
Data inspection
+
+
+
Checking Data Quality
+
+
+
+
+
+
+
+
+
+
What we want to know what percentage of our data is missing? We can use vis_miss to provide the amount of missings in each columns.
+
We see that only less than 0.01% of our data is missing.
+
Missing values are common in datasets and should be thoroughly examined and addressed early in the analysis process. This is where we can use the naniar and visdat packages come in when creating the visualization.
+
For more readings on how to visualize missing data, find it here
+
+
+Code
+
gg_miss_var(telco_data) +
+labs(title ="Missing Values in Each Column", y ="Look at all the missing ones") +
+theme_bw()
+
+
+
+
+
+
+
+
+
+
From the chart, we are able to confirm that only TotalChrges column is the only column with the missing values with more than 9 values.
+
+
+
Data Cleaning
+
+
+Code
+
# Wrangle the data
+telco_data <- telco_data |>
+# mutate(Churn = ifelse(Churn == "Yes", 1, 0)) |>
+mutate(Churn =as.factor(Churn), # Convert 'Churn' to a factor variable
+tenure =as.numeric(tenure),
+MonthlyCharges =as.numeric(MonthlyCharges),
+TotalCharges =as.numeric(TotalCharges)) |>
+replace_na(list(tenure =0, MonthlyCharges =0, TotalCharges =0)) |>
+distinct() |>
+filter(!is.na(Churn)) |># Explicitly handle NA values in 'Churn'
+select(-customerID)
+
+# Check the summary of the dataset
+summary(telco_data)
+
+
+
gender SeniorCitizen Partner Dependents
+ Length:7043 Min. :0.0000 Length:7043 Length:7043
+ Class :character 1st Qu.:0.0000 Class :character Class :character
+ Mode :character Median :0.0000 Mode :character Mode :character
+ Mean :0.1621
+ 3rd Qu.:0.0000
+ Max. :1.0000
+ tenure PhoneService MultipleLines InternetService
+ Min. : 0.00 Length:7043 Length:7043 Length:7043
+ 1st Qu.: 9.00 Class :character Class :character Class :character
+ Median :29.00 Mode :character Mode :character Mode :character
+ Mean :32.37
+ 3rd Qu.:55.00
+ Max. :72.00
+ OnlineSecurity OnlineBackup DeviceProtection TechSupport
+ Length:7043 Length:7043 Length:7043 Length:7043
+ Class :character Class :character Class :character Class :character
+ Mode :character Mode :character Mode :character Mode :character
+
+
+
+ StreamingTV StreamingMovies Contract PaperlessBilling
+ Length:7043 Length:7043 Length:7043 Length:7043
+ Class :character Class :character Class :character Class :character
+ Mode :character Mode :character Mode :character Mode :character
+
+
+
+ PaymentMethod MonthlyCharges TotalCharges Churn
+ Length:7043 Min. : 18.25 Min. : 0.0 No :5174
+ Class :character 1st Qu.: 35.50 1st Qu.: 398.6 Yes:1869
+ Mode :character Median : 70.35 Median :1394.5
+ Mean : 64.76 Mean :2279.7
+ 3rd Qu.: 89.85 3rd Qu.:3786.6
+ Max. :118.75 Max. :8684.8
+
+
+
From the summary, we can gain great insights about the data we are dealing, for example, we now have the total summary of churned customer compared to active customers to be 1869 and 5174 respectively. We have the quartiles information on the total charges information that can helps us understand the business revenue performance.
+
+
+
Plotting
+
+
Distribution of variables
+
+
+Code
+
# Check the distribution of 'Churn'
+
+# Summarized data frame with counts and percentages
+churn_summary <- telco_data |>
+group_by(Churn) |>
+summarise(count =n()) |>
+mutate(percent =round(100* count /sum(count), 1))
+
+# Plot
+ggplot(churn_summary, aes(x = Churn, y = count, fill = Churn)) +
+geom_bar(stat ="identity", color ="black") +
+geom_text(aes(label =paste0(percent, "%")),
+vjust =1.5, color ="white", size =5) +
+labs(title ="Distribution of Churn",
+x ="Churn",
+y ="Count") +
+theme_minimal() +
+theme_gray()
+
+
+
+
+
+
+
+
+
+
The above chart help us understand how distribution is between active customers in the business versus those who churned. We have 73.5% customers reported as active and 26.5% aready churned. At this point, it would be great for the management to investigate why the customers are churning of over a quarter of the customer base.
+
Aside from that, we can also investigate how the charges are distributed.
+
+
+Code
+
# Check the distribution of 'TotalCharges'
+ggplot(telco_data, aes(x = TotalCharges)) +
+geom_histogram(bins =30, fill ="blue", color ="black") +
+labs(title ="Distribution of Total Charges",
+caption ="Distubutuion charges in a telecommunicaton",
+x ="Total Charges",
+y ="Count") +
+theme_minimal()
+
+
+
+
+
+
+
+
+
+
The bove chart shows that the distribution is right skewed meaning most customers have lower total charges, while a smaller number have very high charges, stretching the tail to the right.
+
On factor to help determine patterns that influence whether a customer is likely to stay or leave is their tenure in the business. Why this is important
+
+
Short Tenure: Customers who have been with the company for a short time might have a higher churn rate, as they might still be exploring options or not fully integrated into the service.
+
+
+
+
Long Tenure: Customers who have been with the company for a long time may have a lower churn rate due to loyalty or a reluctance to change, especially if they’ve developed habits around the service.
+
+
+
+Code
+
# Check the distribution of 'tenure'
+ggplot(telco_data, aes(x = tenure)) +
+geom_histogram(bins =30, fill ="green", color ="black", alpha =0.5) +
+geom_density(aes(y = ..count..), fill ="blue", alpha =0.2) +# Add density curve
+labs(title ="Distribution of Tenure with Density", x ="Tenure (Months)", y ="Count") +
+theme_minimal()
+
+
+
+
+
+
+
+
+
+
Some of the reasons why we need these analysis include creating features such as:
+
+
Customer Lifetime Value (CLV): Calculate the expected revenue based on tenure. This is a metric that predicts the total net profit a business can expect to earn from a customer throughout their entire relationship, helping businesses understand customer value for informed decisions on acquisition, retention, and marketing strategies
+
+
+
+
Churn Risk by Tenure: Categorize customers into groups like “new,” “mid-term,” and “long-term” to examine how churn rates differ across these groups.
+
+
+
+Code
+
# Create tenure categories
+telco_data$TenureGroup <-cut(telco_data$tenure,
+breaks =c(0, 6, 12, 36, 100),
+labels =c("0-6 months", "6-12 months", "1-3 years", "3+ years"))
+
+# Tenure distribution with categories
+ggplot(telco_data, aes(x = TenureGroup, fill = TenureGroup)) +
+geom_bar(color ="black") +
+labs(title ="Customer Distribution by Tenure Group", x ="Tenure Group", y ="Count") +
+theme_minimal() +
+theme(legend.position ="none")
+
+
+
+
+
+
+
+
+
+
From the above analysis of the dicrutbution of tenure, we can conclude that customer who have been with the business for more than 3+ years are likely to stay with the business
+
To go deeper, let’s examine the proportion of customer churn to tenure. This gives us a finer insights on what should be our focus. We will create a tenure group to find at what stage are we losing more customers.
+
+
+Code
+
# Churn Rate by Tenure Group
+churn_tenure_summary <- telco_data |>
+group_by(TenureGroup, Churn) |>
+summarise(count =n(), .groups ='drop') |>
+group_by(TenureGroup) |>
+mutate(percent =round(100* count /sum(count), 1)) |>
+filter(!is.na(TenureGroup))
+
+# Plot with percentages
+ggplot(churn_tenure_summary, aes(x = TenureGroup, y = count, fill = Churn)) +
+geom_bar(stat ="identity", color ="black") +
+geom_text(aes(label =paste0(percent, "%")),
+position =position_stack(vjust =0.5), size =5, color ="white") +
+labs(title ="Churn Rate by Tenure Group",
+x ="Tenure Group",
+y ="Proportion of Churn") +
+theme_minimal() +
+theme(legend.position ="top")
+
+
+
+
+
+
+
+
+
+
From the graph, the business needs to focus more on the on boarding period of 0-12 months since it is at these periods that we have a higher churn rate compared to tenure.
+
To switch gears, we now need understand the distribution of monthly charges. It can provide key insights into customer behavior, churn, and pricing strategies. From this, we can understand:
+
+
+Code
+
# Check the distribution of 'MonthlyCharges'
+ggplot(telco_data, aes(x = MonthlyCharges)) +
+geom_histogram(bins =30, fill ="lightblue", color ="black", alpha =0.5) +
+geom_density(aes(y = ..count..), fill ="blue", alpha =0.2) +
+labs(title ="Distribution of Monthly Charges",
+x ="Monthly Charges",
+y ="Count",
+tag ="Figure 1",) +
+theme_minimal()
+
+
+
+
+
+
+
+
+
+Code
+
ggplot(telco_data, aes(x = MonthlyCharges, fill = Churn)) +
+geom_histogram(bins =30, alpha =0.6, position ="identity", color ="black") +
+labs(title ="Monthly Charges by Churn",
+x ="Monthly Charges",
+y ="Count",
+tag ="Figure 2",) +
+theme_minimal()
+
+
+
+
+
+
+
+
+
+
This will help you see if there are higher churn rates at specific monthly charge levels.
Min. 1st Qu. Median Mean 3rd Qu. Max.
+ 18.25 35.50 70.35 64.76 89.85 118.75
+
+
+
Understanding the summary statistics of the monthly charges helps us understand the central tendencies and spread of the monthly charge. From the output, we can conclude that:
+
+
The range of monthly charges is from 18.25 (minimum) to 118.75 (maximum), so there’s quite a bit of variance in pricing.
+
+
+
+
The first quartile (35.50) and third quartile (89.85) show that the middle 50% of customers have charges between 35.50 and 89.85.
+
+
+
+
The median (70.35) being higher than the mean (64.76) suggests a slight left skew in the data. This means there are some customers paying much lower charges that are pulling the average down.
+
+
+
+
The distribution appears to be right-skewed, as we can also see in Figure 2 above, with a lot of customers paying lower charges and fewer customers paying very high charges.
+
+
A large portion of customers seem to paying under 70 per month, with a few higher-paying customers. This could impact churn behavior and might suggest that low-cost customers are more price-sensitive, while higher-cost customers could be at risk if they perceive a lack of value.
+
+
+
The Correlation Analysis
+
The correlation analysis provides insights into the relationships between the numerical variables in the customer churn dataset. By calculating the correlation matrix and visualizing it using a correlation plot, we can quickly identify which variables are positively or negatively related to each other. For instance, we may observe a strong positive correlation between MonthlyCharges and TotalCharges, which makes sense since total charges are typically accumulated from monthly payments.
+
+
+Code
+
# Check the correlation between numerical variables
+correlation_matrix <-cor(telco_data |>select_if(is.numeric))
+# Plot the correlation matrix. Here we use the corrplot package
+corrplot(correlation_matrix, method ="circle", type ="upper", tl.col ="black", tl.srt =45)
+
+
+
+
+
+
+
+
+
+
We are able to notice that there is a strong relationship between TotalCharges and tenure which makes logic since total charges accumulate over time. There is a moderately positive relationship between TotalCharges and MonthlyCharges which also is true since customer who pay more will have higher total charges.
+
There is a weak correlation between SeniorCitizen and other variables (tenure, MonthlyCharges, TotalCharges). We can therefore conclude that being a senior citizen does not significantly correlate with tenure, monthly charges, or total charges hence not much of a factor in this case.
+
+
+Code
+
# Check the relationship between 'Churn' and 'TotalCharges'
+ggplot(telco_data, aes(x = TotalCharges, fill = Churn)) +
+geom_histogram(bins =30, position ="identity", alpha =0.5) +
+labs(title ="Total Charges vs Churn", x ="Total Charges", y ="Count") +
+theme_minimal()
+
+
+
+
+
+
+
+
+
+
A correlation analysis of Total Charges vs Churn is important because it helps the cmpany understand if there is a relationship between the total amount a customer has paid over time and their likelihood of leaving the service.
+
+
+
Feature Engineering
+
Feature Engineering is one way we can create new input features in order to improve the performance of machine learning models, like finding or making predictions of which customer is most likely to churn
+
+
+Code
+
# Create a new feature 'AverageCharges' as the ratio of 'TotalCharges' to 'tenure'
+telco_data <- telco_data |>mutate(AverageCharges = TotalCharges / tenure)
+# Check the distribution of 'AverageCharges'
+ggplot(telco_data, aes(x = AverageCharges)) +
+geom_histogram(bins =30, fill ="purple", color ="black") +
+labs(title ="Distribution of Average Charges", x ="Average Charges", y ="Count") +
+theme_minimal()
+
+
+
+
+
+
+
+
+
+
However, designing predictive models based on this data set may require some skills since algorithms that assume normality (e.g., logistic regression, linear regression) may perform poorly on skewed data sets. We will need to perform some transformation. But we need not worry if we will be using the tree-based models such as random forest, but that is beyond the scope of the project
+
+
+
+
+
+
+
+
+
Source Code
+
---
+title: "Customer Churn Rate"
+author: "Emmanuel Oduor Ochieng"
+date: "`r format(Sys.time(), '%B %d, %Y')`"
+execute:
+ keep-md: true
+ warning: false
+format:
+ html:
+ code-fold: true
+ code-tools: true
+---
+
+## Customer Churn Rate Analysis in a Telecom Company
+
+Customer churn is a critical concern for telecom companies, as retaining existing customers is often more cost-effective than acquiring new ones. This project aims to analyze customer churn in a telecom company, leveraging customer demographics, account details, and service subscriptions to identify key factors influencing churn. By exploring patterns in the data, we seek to uncover insights that can help predict and reduce customer attrition, ultimately improving customer retention strategies. Fasten up the seat belt, and let's explore this ocean!
+
+#### Loading libraries
+
+```{r load-library, include=FALSE, warning=FALSE}
+library(tidyverse)
+library(readr)
+library(dplyr)
+library(corrplot)
+library(ggplot2)
+library(visdat)
+library(caret)
+library(naniar)
+```
+
+#### Load data
+
+```{r}
+telco_data <-read.csv("https://raw.githubusercontent.com/dsrscientist/DSData/master/Telecom_customer_churn.csv")
+
+# View the first few rows
+head(telco_data)
+```
+
+# Data Processing
+
+#### Data inspection
+
+```{r column-types, echo=FALSE, results="hide"}
+# Check the structure of the dataset
+str(telco_data)
+```
+
+#### Checking Data Quality
+
+```{r column-class, echo=FALSE, results="hide"}
+
+# Check for missing values
+vis_miss(telco_data)
+```
+
+What we want to know what percentage of our data is missing? We can use `vis_miss` to provide the amount of missings in each columns.
+
+We see that only less than 0.01% of our data is missing.
+
+Missing values are common in datasets and should be thoroughly examined and addressed early in the analysis process. This is where we can use the `naniar` and `visdat` packages come in when creating the visualization.
+
+For more readings on how to visualize missing data, find it [here](https://cran.r-project.org/web/packages/naniar/vignettes/getting-started-w-naniar.html)
+
+```{r}
+gg_miss_var(telco_data) +
+labs(title ="Missing Values in Each Column", y ="Look at all the missing ones") +
+theme_bw()
+```
+
+From the chart, we are able to confirm that only `TotalChrges` column is the only column with the missing values with more than 9 values.
+
+#### Data Cleaning
+
+```{r}
+# Wrangle the data
+telco_data <- telco_data |>
+# mutate(Churn = ifelse(Churn == "Yes", 1, 0)) |>
+mutate(Churn =as.factor(Churn), # Convert 'Churn' to a factor variable
+tenure =as.numeric(tenure),
+MonthlyCharges =as.numeric(MonthlyCharges),
+TotalCharges =as.numeric(TotalCharges)) |>
+replace_na(list(tenure =0, MonthlyCharges =0, TotalCharges =0)) |>
+distinct() |>
+filter(!is.na(Churn)) |># Explicitly handle NA values in 'Churn'
+select(-customerID)
+
+# Check the summary of the dataset
+summary(telco_data)
+```
+
+From the summary, we can gain great insights about the data we are dealing, for example, we now have the total summary of churned customer compared to active customers to be 1869 and 5174 respectively. We have the quartiles information on the total charges information that can helps us understand the business revenue performance.
+
+## Plotting
+
+#### Distribution of variables
+
+```{r}
+# Check the distribution of 'Churn'
+
+# Summarized data frame with counts and percentages
+churn_summary <- telco_data |>
+group_by(Churn) |>
+summarise(count =n()) |>
+mutate(percent =round(100* count /sum(count), 1))
+
+# Plot
+ggplot(churn_summary, aes(x = Churn, y = count, fill = Churn)) +
+geom_bar(stat ="identity", color ="black") +
+geom_text(aes(label =paste0(percent, "%")),
+vjust =1.5, color ="white", size =5) +
+labs(title ="Distribution of Churn",
+x ="Churn",
+y ="Count") +
+theme_minimal() +
+theme_gray()
+```
+
+The above chart help us understand how distribution is between active customers in the business versus those who churned. We have 73.5% customers reported as active and 26.5% aready churned. At this point, it would be great for the management to investigate why the customers are churning of over a quarter of the customer base.
+
+Aside from that, we can also investigate how the charges are distributed.
+
+```{r}
+# Check the distribution of 'TotalCharges'
+ggplot(telco_data, aes(x = TotalCharges)) +
+geom_histogram(bins =30, fill ="blue", color ="black") +
+labs(title ="Distribution of Total Charges",
+caption ="Distubutuion charges in a telecommunicaton",
+x ="Total Charges",
+y ="Count") +
+theme_minimal()
+```
+
+The bove chart shows that the distribution is right skewed meaning most customers have lower total charges, while a smaller number have very high charges, stretching the tail to the right.
+
+On factor to help determine patterns that influence whether a customer is likely to stay or leave is their tenure in the business. Why this is important
+
+- **Short Tenure**: Customers who have been with the company for a short time might have a higher churn rate, as they might still be exploring options or not fully integrated into the service.
+
+<!-- -->
+
+- **Long Tenure**: Customers who have been with the company for a long time may have a **lower churn rate** due to loyalty or a reluctance to change, especially if they’ve developed habits around the service.
+
+```{r}
+# Check the distribution of 'tenure'
+ggplot(telco_data, aes(x = tenure)) +
+geom_histogram(bins =30, fill ="green", color ="black", alpha =0.5) +
+geom_density(aes(y = ..count..), fill ="blue", alpha =0.2) +# Add density curve
+labs(title ="Distribution of Tenure with Density", x ="Tenure (Months)", y ="Count") +
+theme_minimal()
+```
+
+Some of the reasons why we need these analysis include creating features such as:
+
+- [**Customer Lifetime Value (CLV)**](https://www.qualtrics.com/en-au/experience-management/customer/customer-lifetime-value/#:~:text=with%20our%20guide.-,What%20is%20customer%20lifetime%20value%20(CLV)?,useful%20for%20tracking%20business%20success.): Calculate the expected revenue based on tenure. This is a metric that predicts the total net profit a business can expect to earn from a customer throughout their entire relationship, helping businesses understand customer value for informed decisions on acquisition, retention, and marketing strategies
+
+<!-- -->
+
+- **Churn Risk by Tenure**: Categorize customers into groups like "new," "mid-term," and "long-term" to examine how churn rates differ across these groups.
+
+```{r}
+# Create tenure categories
+telco_data$TenureGroup <-cut(telco_data$tenure,
+breaks =c(0, 6, 12, 36, 100),
+labels =c("0-6 months", "6-12 months", "1-3 years", "3+ years"))
+
+# Tenure distribution with categories
+ggplot(telco_data, aes(x = TenureGroup, fill = TenureGroup)) +
+geom_bar(color ="black") +
+labs(title ="Customer Distribution by Tenure Group", x ="Tenure Group", y ="Count") +
+theme_minimal() +
+theme(legend.position ="none")
+
+```
+
+From the above analysis of the dicrutbution of tenure, we can conclude that customer who have been with the business for more than 3+ years are likely to stay with the business
+
+To go deeper, let's examine the proportion of customer churn to tenure. This gives us a finer insights on what should be our focus. We will create a tenure group to find at what stage are we losing more customers.
+
+```{r}
+# Churn Rate by Tenure Group
+churn_tenure_summary <- telco_data |>
+group_by(TenureGroup, Churn) |>
+summarise(count =n(), .groups ='drop') |>
+group_by(TenureGroup) |>
+mutate(percent =round(100* count /sum(count), 1)) |>
+filter(!is.na(TenureGroup))
+
+# Plot with percentages
+ggplot(churn_tenure_summary, aes(x = TenureGroup, y = count, fill = Churn)) +
+geom_bar(stat ="identity", color ="black") +
+geom_text(aes(label =paste0(percent, "%")),
+position =position_stack(vjust =0.5), size =5, color ="white") +
+labs(title ="Churn Rate by Tenure Group",
+x ="Tenure Group",
+y ="Proportion of Churn") +
+theme_minimal() +
+theme(legend.position ="top")
+```
+
+From the graph, the business needs to focus more on the on boarding period of 0-12 months since it is at these periods that we have a higher churn rate compared to tenure.
+
+To switch gears, we now need understand the distribution of monthly charges. It can provide key insights into customer behavior, churn, and pricing strategies. From this, we can understand:
+
+```{r}
+# Check the distribution of 'MonthlyCharges'
+ggplot(telco_data, aes(x = MonthlyCharges)) +
+geom_histogram(bins =30, fill ="lightblue", color ="black", alpha =0.5) +
+geom_density(aes(y = ..count..), fill ="blue", alpha =0.2) +
+labs(title ="Distribution of Monthly Charges",
+x ="Monthly Charges",
+y ="Count",
+tag ="Figure 1",) +
+theme_minimal()
+
+ggplot(telco_data, aes(x = MonthlyCharges, fill = Churn)) +
+geom_histogram(bins =30, alpha =0.6, position ="identity", color ="black") +
+labs(title ="Monthly Charges by Churn",
+x ="Monthly Charges",
+y ="Count",
+tag ="Figure 2",) +
+theme_minimal()
+
+```
+
+This will help you see if there are higher churn rates at specific monthly charge levels.
+
+#### `MonthlyCharges` Summary Statistics
+
+```{r}
+#Summary statistics
+summary(telco_data$MonthlyCharges)
+```
+
+Understanding the summary statistics of the monthly charges helps us understand the central tendencies and spread of the monthly charge. From the output, we can conclude that:
+
+- The **range** of monthly charges is from **18.25** (minimum) to **118.75** (maximum), so there's quite a bit of **variance** in pricing.
+
+<!-- -->
+
+- The **first quartile (35.50)** and **third quartile (89.85)** show that the **middle 50% of customers** have charges between **35.50 and 89.85**.
+
+<!-- -->
+
+- The **median (70.35)** being higher than the **mean (64.76)** suggests a **slight left skew** in the data. This means there are some customers paying much lower charges that are pulling the average down.
+
+<!-- -->
+
+- The distribution appears to be **right-skewed,** as we can also see in `Figure 2` above, with a lot of customers paying lower charges and fewer customers paying very high charges.
+
+A large portion of customers seem to paying **under 70** per month, with a few higher-paying customers. This could impact **churn behavior** and might suggest that low-cost customers are more price-sensitive, while higher-cost customers could be at risk if they perceive a lack of value.
+
+#### The Correlation Analysis
+
+The correlation analysis provides insights into the relationships between the numerical variables in the customer churn dataset. By calculating the correlation matrix and visualizing it using a correlation plot, we can quickly identify which variables are positively or negatively related to each other. For instance, we may observe a strong positive correlation between `MonthlyCharges` and `TotalCharges`, which makes sense since total charges are typically accumulated from monthly payments.
+
+```{r}
+# Check the correlation between numerical variables
+correlation_matrix <-cor(telco_data |>select_if(is.numeric))
+# Plot the correlation matrix. Here we use the corrplot package
+corrplot(correlation_matrix, method ="circle", type ="upper", tl.col ="black", tl.srt =45)
+```
+
+We are able to notice that there is a strong relationship between `TotalCharges` and `tenure` which makes logic since total charges accumulate over time. There is a moderately positive relationship between `TotalCharges` and `MonthlyCharges` which also is true since customer who pay more will have higher total charges.
+
+There is a weak correlation between `SeniorCitizen` and other variables (`tenure`, `MonthlyCharges`, `TotalCharges`). We can therefore conclude that being a senior citizen does not significantly correlate with tenure, monthly charges, or total charges hence not much of a factor in this case.
+
+```{r}
+# Check the relationship between 'Churn' and 'TotalCharges'
+ggplot(telco_data, aes(x = TotalCharges, fill = Churn)) +
+geom_histogram(bins =30, position ="identity", alpha =0.5) +
+labs(title ="Total Charges vs Churn", x ="Total Charges", y ="Count") +
+theme_minimal()
+
+
+
+
+```
+
+A correlation analysis of Total Charges vs Churn is important because it helps the cmpany understand if there is a relationship between the total amount a customer has paid over time and their likelihood of leaving the service.
+
+#### Feature Engineering
+
+[Feature Engineering](https://www.tmwr.org/recipes) is one way we can create new input features in order to improve the performance of machine learning models, like finding or making predictions of which customer is most likely to churn
+
+```{r}
+# Create a new feature 'AverageCharges' as the ratio of 'TotalCharges' to 'tenure'
+telco_data <- telco_data |>mutate(AverageCharges = TotalCharges / tenure)
+# Check the distribution of 'AverageCharges'
+ggplot(telco_data, aes(x = AverageCharges)) +
+geom_histogram(bins =30, fill ="purple", color ="black") +
+labs(title ="Distribution of Average Charges", x ="Average Charges", y ="Count") +
+theme_minimal()
+```
+
+However, designing predictive models based on this data set may require some skills since algorithms that assume normality (e.g., logistic regression, linear regression) may perform poorly on skewed data sets. We will need to perform some transformation. But we need not worry if we will be using the **tree-based models** such as random forest, but that is beyond the scope of the project
+
In thi snippet, we will import tables from the web and perform some analysis on the data. In order to do this, we will use the pandas library to read the tables from a given URL. We use the URL of the NBA 2018 per game statistics page as an example. The pandas read_html function is used to read the tables from the URL, and we display the number of tables read.
+
+
+
# Import libraries
+import pandas as pd
+import matplotlib.pyplot as plt
+
+
+
# NBA
+
+url ="https://www.basketball-reference.com/leagues/NBA_2018_per_game.html"
+tables = pd.read_html(url)
+
+
+
Understanding the structure of the data.
+
+
The data contains multiple tables, and we can access each table by its index. We can display the column names of the first table to understand the structure of the data. But how do we know how many tables are there in the data? We can use the len function to find out the number of tables in the data.
+
+
+
# Number of tables
+len(tables)
+
+
2
+
+
+
+
Now that we know the number of tables in the data, we can access each table by its index. Let’s display the column names of the first table to understand the structure of the data.
+
+
+
# The column names of the first table
+tables[0].columns
Now that we have an understanding of the structure of the data, we can start analyzing the data. We can perform various operations such as filtering, sorting, and visualization to gain insights from the data. Let’s start by displaying the data from the first table.
+
In the data, we have information about the players in the NBA, including their age, team, points per game, and awards. We can filter the data based on different criteria to find interesting patterns and insights.
+
As the start, who is the oldest player in the NBA?
+
+
+
# Oldest players in the NBA
+players = tables[0][["Rk", "Player", "Age", "Team"]]
+players[players["Age"] == players["Age"].max()]
+
+
+
+
+
+
+
+
+
Rk
+
Player
+
Age
+
Team
+
+
+
+
+
391
+
326.0
+
Vince Carter
+
41.0
+
SAC
+
+
+
+
+
+
+
+
+
Now we know that Vince Carter is the oldest player in the NBA, and which team he plays for. Let’s find out who the youngest player is.
+
+
+
# Youngest players in the NBA
+players = tables[0][["Rk", "Player", "Age", "Team"]]
+players[players["Age"] == players["Age"].min()]
+
+
+
+
+
+
+
+
+
Rk
+
Player
+
Age
+
Team
+
+
+
+
+
106
+
91.0
+
Jayson Tatum
+
19.0
+
BOS
+
+
+
259
+
217.0
+
Jarrett Allen
+
19.0
+
BRK
+
+
+
300
+
254.0
+
Markelle Fultz
+
19.0
+
PHI
+
+
+
325
+
275.0
+
Malik Monk
+
19.0
+
CHO
+
+
+
371
+
308.0
+
Frank Ntilikina
+
19.0
+
NYK
+
+
+
526
+
431.0
+
Terrance Ferguson
+
19.0
+
OKC
+
+
+
623
+
504.0
+
Ike Anigbogu
+
19.0
+
IND
+
+
+
+
+
+
+
+
+
Here, we find that there are multiple players with the same age, and we can sort them by points to see who has scored the most points, or even find out which positions and teams they play for. But we are not going to do that.
+
Remember in the oldest player, we only got one player, but in the youngest player, we got multiple players. We can go deeper to find out the top 10 oldest players in the NBA, and the points they have scored in all games, plus their teams.
+
+
+
# Top 10 oldest players in the NBA ranked by points in all games
+players = tables[0][["Rk", "Player", "Age", "Team", "PTS"]]
+players = players.sort_values(by=["Age"], ascending=[False]).head(10)
+players
+
+
+
+
+
+
+
+
+
Rk
+
Player
+
Age
+
Team
+
PTS
+
+
+
+
+
391
+
326.0
+
Vince Carter
+
41.0
+
SAC
+
5.4
+
+
+
517
+
424.0
+
Jason Terry
+
40.0
+
MIL
+
3.3
+
+
+
223
+
191.0
+
Manu Ginóbili
+
40.0
+
SAS
+
8.9
+
+
+
150
+
131.0
+
Dirk Nowitzki
+
39.0
+
DAL
+
12.0
+
+
+
603
+
488.0
+
Damien Wilkins
+
38.0
+
IND
+
1.7
+
+
+
646
+
525.0
+
Udonis Haslem
+
37.0
+
MIA
+
0.6
+
+
+
611
+
496.0
+
Richard Jefferson
+
37.0
+
DEN
+
1.5
+
+
+
317
+
269.0
+
David West
+
37.0
+
GSW
+
6.8
+
+
+
188
+
162.0
+
Jamal Crawford
+
37.0
+
MIN
+
10.3
+
+
+
576
+
467.0
+
Nick Collison
+
37.0
+
OKC
+
2.1
+
+
+
+
+
+
+
+
+
But wait, how many players are there in the NBA? Let’s find out.
+
+
+
# Count number of players in NBA.
+players = tables[0]
+print(f"Number of players in the NBA: {len(players)}")
+
+
Number of players in the NBA: 665
+
+
+
+
Now that we know the number of players in the NBA, we can find out the top 10 youngest players in the NBA, ranked by points in all games. Does it mean without this we couldn’t have known the number of youngest players in the NBA? No, we could have known, but we are adding some fun.
+
In this analysis, we will find out who are the top 10 youngest players in the NBA, ranked by points in all games.
+
+
+
# top 10 youngest players in the NBA ranked by points in all games
+players = tables[0][["Rk", "Player", "Age", "Team", "PTS"]]
+players[players["Age"] == players["Age"].min()].sort_values(by="PTS", ascending=False).head(10)
+
+
+
+
+
+
+
+
+
Rk
+
Player
+
Age
+
Team
+
PTS
+
+
+
+
+
106
+
91.0
+
Jayson Tatum
+
19.0
+
BOS
+
13.9
+
+
+
259
+
217.0
+
Jarrett Allen
+
19.0
+
BRK
+
8.2
+
+
+
300
+
254.0
+
Markelle Fultz
+
19.0
+
PHI
+
7.1
+
+
+
325
+
275.0
+
Malik Monk
+
19.0
+
CHO
+
6.7
+
+
+
371
+
308.0
+
Frank Ntilikina
+
19.0
+
NYK
+
5.9
+
+
+
526
+
431.0
+
Terrance Ferguson
+
19.0
+
OKC
+
3.1
+
+
+
623
+
504.0
+
Ike Anigbogu
+
19.0
+
IND
+
1.2
+
+
+
+
+
+
+
+
+
We can also detemine which players have won awards. Let’s find out who they are. We will filter the data to show only players who have won awards.
+
+
+
# Only players with awards
+players = tables[0][["Rk", "Player", "Age", "Team", "PTS", "Awards"]]
+players[players["Awards"].notnull()]
+
+
+
+
+
+
+
+
+
Rk
+
Player
+
Age
+
Team
+
PTS
+
Awards
+
+
+
+
+
0
+
1.0
+
James Harden
+
28.0
+
HOU
+
30.4
+
MVP-1,AS,NBA1
+
+
+
1
+
2.0
+
Anthony Davis
+
24.0
+
NOP
+
28.1
+
MVP-3,DPOY-3,AS,NBA1
+
+
+
2
+
3.0
+
LeBron James
+
33.0
+
CLE
+
27.5
+
MVP-2,AS,NBA1
+
+
+
3
+
4.0
+
Giannis Antetokounmpo
+
23.0
+
MIL
+
26.9
+
MVP-6,AS,NBA2
+
+
+
4
+
5.0
+
Damian Lillard
+
27.0
+
POR
+
26.9
+
MVP-4,AS,NBA1
+
+
+
5
+
6.0
+
Stephen Curry
+
29.0
+
GSW
+
26.4
+
MVP-10,AS,NBA3
+
+
+
6
+
7.0
+
Kevin Durant
+
29.0
+
GSW
+
26.4
+
MVP-7,DPOY-9,AS,NBA1
+
+
+
7
+
8.0
+
Russell Westbrook
+
29.0
+
OKC
+
25.4
+
MVP-5,AS,NBA2
+
+
+
8
+
9.0
+
DeMarcus Cousins
+
27.0
+
NOP
+
25.2
+
AS
+
+
+
10
+
11.0
+
Kyrie Irving
+
25.0
+
BOS
+
24.4
+
AS
+
+
+
11
+
12.0
+
LaMarcus Aldridge
+
32.0
+
SAS
+
23.1
+
MVP-9,AS,NBA2
+
+
+
12
+
13.0
+
Victor Oladipo
+
25.0
+
IND
+
23.1
+
MVP-13,DPOY-15,MIP-1,AS,NBA3
+
+
+
13
+
14.0
+
DeMar DeRozan
+
28.0
+
TOR
+
23.0
+
MVP-8,AS,NBA2
+
+
+
14
+
15.0
+
Joel Embiid
+
23.0
+
PHI
+
22.9
+
MVP-12,DPOY-2,AS,NBA2
+
+
+
15
+
16.0
+
Kristaps Porziņģis
+
22.0
+
NYK
+
22.7
+
AS
+
+
+
16
+
17.0
+
Bradley Beal
+
24.0
+
WAS
+
22.6
+
AS
+
+
+
17
+
18.0
+
Lou Williams
+
31.0
+
LAC
+
22.6
+
6MOY-1
+
+
+
18
+
19.0
+
Jimmy Butler
+
28.0
+
MIN
+
22.2
+
MVP-10,DPOY-12,AS,NBA3
+
+
+
19
+
20.0
+
Kemba Walker
+
27.0
+
CHO
+
22.1
+
AS
+
+
+
20
+
21.0
+
Paul George
+
27.0
+
OKC
+
21.9
+
DPOY-4,AS,NBA3
+
+
+
25
+
24.0
+
Karl-Anthony Towns
+
22.0
+
MIN
+
21.3
+
AS,NBA3
+
+
+
26
+
25.0
+
Donovan Mitchell
+
21.0
+
UTA
+
20.5
+
ROY-2
+
+
+
29
+
28.0
+
Klay Thompson
+
27.0
+
GSW
+
20.0
+
DPOY-11,AS
+
+
+
33
+
32.0
+
John Wall
+
27.0
+
WAS
+
19.4
+
AS
+
+
+
34
+
33.0
+
Jrue Holiday
+
27.0
+
NOP
+
19.0
+
DPOY-7
+
+
+
41
+
38.0
+
Eric Gordon
+
29.0
+
HOU
+
18.0
+
6MOY-2
+
+
+
49
+
44.0
+
Kevin Love
+
29.0
+
CLE
+
17.6
+
AS
+
+
+
52
+
47.0
+
Goran Dragić
+
31.0
+
MIA
+
17.3
+
AS
+
+
+
62
+
57.0
+
Kyle Lowry
+
31.0
+
TOR
+
16.2
+
AS
+
+
+
64
+
59.0
+
Kyle Kuzma
+
22.0
+
LAL
+
16.1
+
ROY-4,6MOY-9
+
+
+
66
+
61.0
+
Ben Simmons
+
21.0
+
PHI
+
15.8
+
ROY-1
+
+
+
67
+
62.0
+
Will Barton
+
27.0
+
DEN
+
15.7
+
6MOY-4
+
+
+
68
+
63.0
+
Nikola Mirotić
+
26.0
+
2TM
+
15.6
+
6MOY-7
+
+
+
77
+
68.0
+
Dennis Smith Jr.
+
20.0
+
DAL
+
15.2
+
ROY-5
+
+
+
79
+
70.0
+
Andre Drummond
+
24.0
+
DET
+
15.0
+
DPOY-15,AS
+
+
+
80
+
71.0
+
Rodney Hood
+
25.0
+
2TM
+
14.7
+
6MOY-12
+
+
+
87
+
76.0
+
Jaylen Brown
+
21.0
+
BOS
+
14.5
+
DPOY-10
+
+
+
93
+
80.0
+
Jusuf Nurkić
+
23.0
+
POR
+
14.3
+
DPOY-15
+
+
+
99
+
86.0
+
Jordan Clarkson
+
25.0
+
2TM
+
13.9
+
6MOY-7
+
+
+
102
+
87.0
+
Steven Adams
+
24.0
+
OKC
+
13.9
+
DPOY-12
+
+
+
103
+
88.0
+
Clint Capela
+
23.0
+
HOU
+
13.9
+
DPOY-14
+
+
+
106
+
91.0
+
Jayson Tatum
+
19.0
+
BOS
+
13.9
+
ROY-3
+
+
+
109
+
94.0
+
Rudy Gobert
+
25.0
+
UTA
+
13.5
+
DPOY-1
+
+
+
121
+
106.0
+
Al Horford
+
31.0
+
BOS
+
12.9
+
DPOY-5,AS
+
+
+
132
+
115.0
+
Robert Covington
+
27.0
+
PHI
+
12.6
+
DPOY-8
+
+
+
142
+
125.0
+
Marco Belinelli
+
31.0
+
2TM
+
12.1
+
6MOY-12
+
+
+
157
+
138.0
+
J.J. Barea
+
33.0
+
DAL
+
11.6
+
6MOY-12
+
+
+
164
+
145.0
+
Kelly Olynyk
+
26.0
+
MIA
+
11.5
+
6MOY-6
+
+
+
165
+
146.0
+
Dwyane Wade
+
36.0
+
2TM
+
11.4
+
6MOY-10
+
+
+
169
+
148.0
+
Terry Rozier
+
23.0
+
BOS
+
11.3
+
6MOY-10
+
+
+
170
+
149.0
+
Wayne Ellington
+
30.0
+
MIA
+
11.2
+
6MOY-5
+
+
+
173
+
152.0
+
Draymond Green
+
27.0
+
GSW
+
11.0
+
DPOY-6,AS
+
+
+
243
+
205.0
+
Fred VanVleet
+
23.0
+
TOR
+
8.6
+
6MOY-3
+
+
+
289
+
245.0
+
Luc Mbah a Moute
+
31.0
+
HOU
+
7.5
+
6MOY-12
+
+
+
296
+
252.0
+
Tomáš Satoranský
+
26.0
+
WAS
+
7.2
+
6MOY-12
+
+
+
311
+
265.0
+
Jakob Poeltl
+
22.0
+
TOR
+
6.9
+
6MOY-12
+
+
+
+
+
+
+
+
+
We see that a number of players have won different awards. We can count the number of each awards. Let’s do that!
We find out that the AS award is the most common award among the players. We can filter the data to show only players who have won the certain awards such as AS award and the 6MOY-12 award.
+
+
+
# Players with AS and 6MOY-12 awards
+players = tables[0][["Rk", "Player", "Age", "Team", "PTS", "Awards"]]
+players = players[players["Awards"].str.contains("AS|6MOY-12", regex=True, na=False)]
+
+print(players)
+
+
Rk Player Age Team PTS \
+0 1.0 James Harden 28.0 HOU 30.4
+1 2.0 Anthony Davis 24.0 NOP 28.1
+2 3.0 LeBron James 33.0 CLE 27.5
+3 4.0 Giannis Antetokounmpo 23.0 MIL 26.9
+4 5.0 Damian Lillard 27.0 POR 26.9
+5 6.0 Stephen Curry 29.0 GSW 26.4
+6 7.0 Kevin Durant 29.0 GSW 26.4
+7 8.0 Russell Westbrook 29.0 OKC 25.4
+8 9.0 DeMarcus Cousins 27.0 NOP 25.2
+10 11.0 Kyrie Irving 25.0 BOS 24.4
+11 12.0 LaMarcus Aldridge 32.0 SAS 23.1
+12 13.0 Victor Oladipo 25.0 IND 23.1
+13 14.0 DeMar DeRozan 28.0 TOR 23.0
+14 15.0 Joel Embiid 23.0 PHI 22.9
+15 16.0 Kristaps Porziņģis 22.0 NYK 22.7
+16 17.0 Bradley Beal 24.0 WAS 22.6
+18 19.0 Jimmy Butler 28.0 MIN 22.2
+19 20.0 Kemba Walker 27.0 CHO 22.1
+20 21.0 Paul George 27.0 OKC 21.9
+25 24.0 Karl-Anthony Towns 22.0 MIN 21.3
+29 28.0 Klay Thompson 27.0 GSW 20.0
+33 32.0 John Wall 27.0 WAS 19.4
+49 44.0 Kevin Love 29.0 CLE 17.6
+52 47.0 Goran Dragić 31.0 MIA 17.3
+62 57.0 Kyle Lowry 31.0 TOR 16.2
+79 70.0 Andre Drummond 24.0 DET 15.0
+80 71.0 Rodney Hood 25.0 2TM 14.7
+121 106.0 Al Horford 31.0 BOS 12.9
+142 125.0 Marco Belinelli 31.0 2TM 12.1
+157 138.0 J.J. Barea 33.0 DAL 11.6
+173 152.0 Draymond Green 27.0 GSW 11.0
+289 245.0 Luc Mbah a Moute 31.0 HOU 7.5
+296 252.0 Tomáš Satoranský 26.0 WAS 7.2
+311 265.0 Jakob Poeltl 22.0 TOR 6.9
+
+ Awards
+0 MVP-1,AS,NBA1
+1 MVP-3,DPOY-3,AS,NBA1
+2 MVP-2,AS,NBA1
+3 MVP-6,AS,NBA2
+4 MVP-4,AS,NBA1
+5 MVP-10,AS,NBA3
+6 MVP-7,DPOY-9,AS,NBA1
+7 MVP-5,AS,NBA2
+8 AS
+10 AS
+11 MVP-9,AS,NBA2
+12 MVP-13,DPOY-15,MIP-1,AS,NBA3
+13 MVP-8,AS,NBA2
+14 MVP-12,DPOY-2,AS,NBA2
+15 AS
+16 AS
+18 MVP-10,DPOY-12,AS,NBA3
+19 AS
+20 DPOY-4,AS,NBA3
+25 AS,NBA3
+29 DPOY-11,AS
+33 AS
+49 AS
+52 AS
+62 AS
+79 DPOY-15,AS
+80 6MOY-12
+121 DPOY-5,AS
+142 6MOY-12
+157 6MOY-12
+173 DPOY-6,AS
+289 6MOY-12
+296 6MOY-12
+311 6MOY-12
+
+
+
+
On awards, can we find out which players are top ranked and with awards, and which players are top ranked without awards? Let’s find out!
We can even go further to detemine the players with the highest number of awards. Let’s find out who they are. But wait, going straight to coding brings some errors. Seems like it is not that simple!
+
The best thing to do is to display the awards column to understand the structure of the data. This will help us in filtering the data to find the players with the highest number of awards.
+
+
+
# Display the awards column
+tables[0]["Awards"]
+
+
0 MVP-1,AS,NBA1
+1 MVP-3,DPOY-3,AS,NBA1
+2 MVP-2,AS,NBA1
+3 MVP-6,AS,NBA2
+4 MVP-4,AS,NBA1
+ ...
+660 NaN
+661 NaN
+662 NaN
+663 NaN
+664 NaN
+Name: Awards, Length: 665, dtype: object
+
+
+
+
Aha, we can see that the awards column contains multiple awards separated by commas. We can create a new column to count the number of awards for each player. We will name it “Awards count.” Let’s do that!
+
+
+
# The number of wards for each player
+players_awards = tables[0][["Rk", "Player", "Age", "Team", "PTS", "Awards"]].copy()
+# Replace NaN in the Awards column with an empty string
+players_awards["Awards"] = players_awards["Awards"].fillna("")
+
+players_awards["Awards count"] = players_awards["Awards"].apply(
+lambda x: 0if x ==""else x.count(",") +1)
+players_awards
We can also explore the relationship between age and points scored by the players. We can create a scatter plot to visualize this relationship.
+
+
+
# Raltionship between age and points
+players = tables[0][["Rk", "Player", "Age", "Team", "PTS", "Awards"]]
+
+plt.figure(facecolor="red")
+players.plot.scatter(x="Age", y="PTS", title="Relationship between Age and Points")
+plt.show()
+
+
<Figure size 672x480 with 0 Axes>
+
+
+
+
+
+
+
+
+
+
+
# Creating age groups, and finding the average points for each group, and plotting the results
+players_age_grp = tables[0][["Rk", "Player", "Age", "Team", "PTS", "Awards"]].copy()
+players_age_grp["Age group"] = pd.cut(
+ players_age_grp["Age"],
+ bins=[15, 20, 25, 30, 35, 40, 45],
+ labels=["15-20", "21-25", "26-30", "31-35", "36-40", "41-45"],
+ include_lowest=True
+ )
+# Group by age group and calculate average points
+avg_points_by_age = players_age_grp.groupby("Age group", observed=False)["PTS"].mean()
+
+# Plot the results
+plt.figure(facecolor="grey")
+avg_points_by_age.plot(kind="bar", title="Average Points by Age Group", color="blue")
+plt.xlabel("Age Group")
+plt.ylabel("Average Points")
+plt.tight_layout()
+plt.show()
+
+
+
+
+
+
+
+
+
+
Also, who are the top 10 players with the most awards? We can sort the data by the “Awards count” column in descending order to find out.
+
+
+
# Top 10 players with the most awards
+players_awards.sort_values(by="Awards count", ascending=False).head(10)
+
+
+
+
+
+
+
+
+
Rk
+
Player
+
Age
+
Team
+
PTS
+
Awards
+
Awards count
+
+
+
+
+
12
+
13.0
+
Victor Oladipo
+
25.0
+
IND
+
23.1
+
MVP-13,DPOY-15,MIP-1,AS,NBA3
+
5
+
+
+
18
+
19.0
+
Jimmy Butler
+
28.0
+
MIN
+
22.2
+
MVP-10,DPOY-12,AS,NBA3
+
4
+
+
+
14
+
15.0
+
Joel Embiid
+
23.0
+
PHI
+
22.9
+
MVP-12,DPOY-2,AS,NBA2
+
4
+
+
+
6
+
7.0
+
Kevin Durant
+
29.0
+
GSW
+
26.4
+
MVP-7,DPOY-9,AS,NBA1
+
4
+
+
+
1
+
2.0
+
Anthony Davis
+
24.0
+
NOP
+
28.1
+
MVP-3,DPOY-3,AS,NBA1
+
4
+
+
+
0
+
1.0
+
James Harden
+
28.0
+
HOU
+
30.4
+
MVP-1,AS,NBA1
+
3
+
+
+
13
+
14.0
+
DeMar DeRozan
+
28.0
+
TOR
+
23.0
+
MVP-8,AS,NBA2
+
3
+
+
+
11
+
12.0
+
LaMarcus Aldridge
+
32.0
+
SAS
+
23.1
+
MVP-9,AS,NBA2
+
3
+
+
+
20
+
21.0
+
Paul George
+
27.0
+
OKC
+
21.9
+
DPOY-4,AS,NBA3
+
3
+
+
+
7
+
8.0
+
Russell Westbrook
+
29.0
+
OKC
+
25.4
+
MVP-5,AS,NBA2
+
3
+
+
+
+
+
+
+
+
+
+
Summary
+
+
We have successfully imported tables from the web and performed some analysis on the data. We have explored the data to find out the oldest and youngest players in the NBA, the top 10 oldest and youngest players ranked by points, players with awards, the number of awards for each player, and the relationship between age and points scored by the players. We have also created visualizations to better understand the data. We have also found out the top 10 players with the most awards. This analysis provides valuable insights into the NBA player statistics and can be used for further analysis and decision-making.