Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
132 changes: 40 additions & 92 deletions 2110ACDS_VM13(final)_notebook.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -87,14 +87,7 @@
" <a id=\"one\"></a>\n",
"## 1. Importing Packages\n",
"<a href=#cont>Back to Table of Contents</a>\n",
"\n",
"---\n",
" \n",
"| ⚡ Description: Importing Packages ⚡ |\n",
"| :--------------------------- |\n",
"| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |\n",
"\n",
"---"
"\n"
]
},
{
Expand Down Expand Up @@ -152,14 +145,7 @@
"## 2. Loading the Data\n",
"<a class=\"anchor\" id=\"1.1\"></a>\n",
"<a href=#cont>Back to Table of Contents</a>\n",
"\n",
"---\n",
" \n",
"| ⚡ Description: Loading the data ⚡ |\n",
"| :--------------------------- |\n",
"| In this section you are required to load the data from the `df_train` file into a DataFrame. |\n",
"\n",
"---"
"\n"
]
},
{
Expand Down Expand Up @@ -988,13 +974,7 @@
"<a class=\"anchor\" id=\"1.1\"></a>\n",
"<a href=#cont>Back to Table of Contents</a>\n",
"\n",
"---\n",
" \n",
"| ⚡ Description: Exploratory data analysis ⚡ |\n",
"| :--------------------------- |\n",
"| In this section, you are required to perform an in-depth analysis of all the variables in the DataFrame. |\n",
"\n",
"---\n"
"\n"
]
},
{
Expand Down Expand Up @@ -1704,18 +1684,6 @@
},
"id": "ZJFPf6ndaIkF"
},
{
"cell_type": "code",
"execution_count": null,
"id": "2fb74182",
"metadata": {
"id": "2fb74182"
},
"outputs": [],
"source": [
"# plot relevant feature interactions"
]
},
{
"cell_type": "markdown",
"source": [
Expand Down Expand Up @@ -3122,7 +3090,7 @@
"> * the error variance (i.e. the difference between the predicted values and observed values) reducing the power of statistical tests. Kurtosis tests for outliers.\n",
"\n",
"2. Skewness: tests by how much the overall shape of a distribution deviates from the shape of the normal distribution. Variables can be negatively or postively skewed.\n",
"\n",
"*italicized text*\n",
"\n"
],
"metadata": {
Expand Down Expand Up @@ -3174,7 +3142,7 @@
{
"cell_type": "markdown",
"source": [
"The load shortfall decreased between the months of January-February, July-August and November-December. There was a huge increase May-june and another increase between October-November."
"*The load shortfall decreased between the months of January-February, July-August and November-December. There was a huge increase May-june and another increase between October-November.*"
],
"metadata": {
"id": "nSfx6i6pgE4t"
Expand Down Expand Up @@ -3935,7 +3903,7 @@
{
"cell_type": "markdown",
"source": [
"The scatter plot has a cone shape around the regression line indicating heteroscedasticity(i.e. when the residuals are observed to have unequal variance). It could be as a result wide range of values which are more prone to heteroskedasticity because the differences between the smallest and largest values are so significant."
"*The scatter plot has a cone shape around the regression line indicating heteroscedasticity(i.e. when the residuals are observed to have unequal variance). It could be as a result wide range of values which are more prone to heteroskedasticity because the differences between the smallest and largest values are so significant.*"
],
"metadata": {
"id": "7FTF2LYyD5qr"
Expand Down Expand Up @@ -4029,7 +3997,7 @@
{
"cell_type": "markdown",
"source": [
"The residuals are normally distributed meaning that the assumption of linear model is valid"
"*The residuals are normally distributed meaning that the assumption of linear model is valid*"
],
"metadata": {
"id": "nT5QbceAH3ic"
Expand All @@ -4047,14 +4015,7 @@
"## 4. Data Engineering\n",
"<a class=\"anchor\" id=\"1.1\"></a>\n",
"<a href=#cont>Back to Table of Contents</a>\n",
"\n",
"---\n",
" \n",
"| ⚡ Description: Data engineering ⚡ |\n",
"| :--------------------------- |\n",
"| In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase. |\n",
"\n",
"---"
"\n"
]
},
{
Expand Down Expand Up @@ -4900,7 +4861,7 @@
"cell_type": "markdown",
"source": [
"**Test Data**:\n",
"Formatting the df_test data to conform to df_train data"
"*Formatting the df_test data to conform to df_train data*"
],
"metadata": {
"id": "LBACTH67eT8h"
Expand Down Expand Up @@ -6094,15 +6055,7 @@
"<a id=\"five\"></a>\n",
"## 5. Modelling\n",
"<a class=\"anchor\" id=\"1.1\"></a>\n",
"<a href=#cont>Back to Table of Contents</a>\n",
"\n",
"---\n",
" \n",
"| ⚡ Description: Modelling ⚡ |\n",
"| :--------------------------- |\n",
"| In this section, you are required to create one or more regression models that are able to accurately predict the thee hour load shortfall. |\n",
"\n",
"---"
"<a href=#cont>Back to Table of Contents</a>\n"
]
},
{
Expand Down Expand Up @@ -6596,7 +6549,7 @@
{
"cell_type": "code",
"source": [
"pred_test_truerf = model_rf.predict(df_standard)"
"pred_test_truerf = model_rf.predict(df_standard) # predicting yhat from the df_test data"
],
"metadata": {
"id": "sqaDKb8HJUyH"
Expand Down Expand Up @@ -6961,6 +6914,7 @@
{
"cell_type": "code",
"source": [
"# subsetting to get only the time and predicted yhats\n",
"df_rf = df1[['time','load_shortfall_3h']]\n",
"df_rf.head()"
],
Expand Down Expand Up @@ -7167,7 +7121,7 @@
{
"cell_type": "code",
"source": [
"y_pred_gb = model_gb.predict(df_standard)"
"y_pred_gb = model_gb.predict(df_standard) # predicting yhat from the df_test data"
],
"metadata": {
"id": "IIvaPkJ-LvQG"
Expand Down Expand Up @@ -7532,6 +7486,7 @@
{
"cell_type": "code",
"source": [
"# subsetting to get only the time and predicted yhats\n",
"df_gb = df1[['time','load_shortfall_3h']]\n",
"df_gb.head()"
],
Expand Down Expand Up @@ -7723,7 +7678,7 @@
{
"cell_type": "code",
"source": [
"y_pred_rfe_gb = rfe_gb.predict(df_standard)"
"y_pred_rfe_gb = rfe_gb.predict(df_standard) # predicting yhat from the df_test data"
],
"metadata": {
"id": "G8BvCho4e8k-"
Expand All @@ -7735,6 +7690,7 @@
{
"cell_type": "code",
"source": [
"# subsetting to get only the time and predicted yhats\n",
"df1['load_shortfall_3h'] = y_pred_rfe_gb\n",
"df1.head()"
],
Expand Down Expand Up @@ -8088,6 +8044,7 @@
{
"cell_type": "code",
"source": [
"# subsetting to get only the time and predicted yhats\n",
"df_rfe_gb = df1[['time','load_shortfall_3h']]\n",
"df_rfe_gb.head()"
],
Expand Down Expand Up @@ -8290,7 +8247,7 @@
{
"cell_type": "code",
"source": [
"y_pred_xgb =model_xgb.predict(df_standard)"
"y_pred_xgb =model_xgb.predict(df_standard) # predicting yhat from the df_test data"
],
"metadata": {
"id": "IbRDjY2YuVnw"
Expand Down Expand Up @@ -8655,6 +8612,7 @@
{
"cell_type": "code",
"source": [
"# subsetting to get only the time and predicted yhats\n",
"df_xgb = df1[['time','load_shortfall_3h']]\n",
"df_xgb.head()"
],
Expand Down Expand Up @@ -8845,7 +8803,7 @@
{
"cell_type": "code",
"source": [
"y_pred_xgb1 =model_xgb1.predict(df_standard)"
"y_pred_xgb1 =model_xgb1.predict(df_standard) # predicting yhat from the df_test data"
],
"metadata": {
"id": "YtETtoEp1DuC"
Expand Down Expand Up @@ -9210,6 +9168,7 @@
{
"cell_type": "code",
"source": [
"# subsetting to get only the time and predicted yhats\n",
"df_xgb1 = df1[['time','load_shortfall_3h']]\n",
"df_xgb1.head()"
],
Expand Down Expand Up @@ -9400,7 +9359,7 @@
{
"cell_type": "code",
"source": [
"y_pred_gb2 = model_gb2.predict(df_standard)"
"y_pred_gb2 = model_gb2.predict(df_standard) # predicting yhat from the df_test data"
],
"metadata": {
"id": "_v0VOzCvZieA"
Expand Down Expand Up @@ -9765,6 +9724,7 @@
{
"cell_type": "code",
"source": [
"# subsetting to get only the time and predicted yhats\n",
"df_gb2 = df1[['time','load_shortfall_3h']]\n",
"df_gb2.head()"
],
Expand Down Expand Up @@ -10244,14 +10204,7 @@
"## 6. Model Performance\n",
"<a class=\"anchor\" id=\"1.1\"></a>\n",
"<a href=#cont>Back to Table of Contents</a>\n",
"\n",
"---\n",
" \n",
"| ⚡ Description: Model performance ⚡ |\n",
"| :--------------------------- |\n",
"| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |\n",
"\n",
"---"
"\n"
]
},
{
Expand Down Expand Up @@ -10289,6 +10242,7 @@
],
"source": [
"# Compare model performance\n",
"# loading a dataframe with the performance indicators of each model\n",
"performance_df = pd.read_csv('model_perf.csv', encoding='cp1252', names = ['Model', 'Train_MSE',\t'Train_RMSE',\t'Train_R_squared',\t'Test_MSE',\t'Test_RMSE',\t'Test_R_squared'])\n",
"performance_df = performance_df.drop(labels=[0], axis=0)\n",
"performance_df.info()"
Expand All @@ -10297,6 +10251,7 @@
{
"cell_type": "code",
"source": [
"#Converting the datatype from object to float\n",
"performance_df['Train_MSE'] = performance_df['Train_MSE'].map(lambda x: float(x))\n",
"performance_df['Train_RMSE'] = performance_df['Train_RMSE'].map(lambda x: float(x))\n",
"performance_df['Train_R_squared'] = performance_df['Train_R_squared'].map(lambda x: float(x))\n",
Expand Down Expand Up @@ -10351,6 +10306,7 @@
{
"cell_type": "code",
"source": [
"# subsetting to get only the RMSE of the train and test datasets\n",
"train_perf = performance_df[['Model','Train_RMSE','Test_RMSE']]\n",
"train_perf"
],
Expand Down Expand Up @@ -10547,7 +10503,6 @@
"# Performance based on Train_RMSE\n",
"train_perf_sorted= train_perf.sort_values('Train_RMSE', ascending=False)\n",
"train_perf_sorted.plot(kind='barh' , x='Model', y='Train_RMSE')\n",
"#plt.xticks(rotation=None, horizontalalignment=\"center\")\n",
"plt.title(\"Model Performance based on Train_RMSE\")\n",
"plt.xlabel(\"Root Mean Square Error\")\n",
"plt.ylabel(\"Model\")\n"
Expand Down Expand Up @@ -10592,7 +10547,6 @@
"sns.set_style(\"dark\")\n",
"test_perf_sorted = train_perf.sort_values('Test_RMSE', ascending=False)\n",
"test_perf_sorted.plot(kind='barh' , x='Model', y='Test_RMSE',)\n",
"#plt.xticks(rotation=None, horizontalalignment=\"center\")\n",
"plt.title(\"Model Performance based on Test_RMSE\")\n",
"plt.xlabel(\"Root Mean Square Error\")\n",
"plt.ylabel(\"Model\")"
Expand Down Expand Up @@ -10633,7 +10587,7 @@
{
"cell_type": "markdown",
"source": [
"Root Mean Square Error (RMSE) is a standard way to measure the error of a model in predicting quantitative data. The lower the RMSE the better the model. Light Gradient Model and the Random forest model have the lowest RMSE when it come to Training data however they are among the model with the Highest RMSE in the Test data indicating that there may be overfitting."
"*Root Mean Square Error (RMSE) is a standard way to measure the error of a model in predicting quantitative data. The lower the RMSE the better the model. Light Gradient Model and the Random forest model have the lowest RMSE when it come to Training data however they are among the model with the Highest RMSE in the Test data indicating that there may be overfitting.*"
],
"metadata": {
"id": "XSOOwpfsRIMZ"
Expand Down Expand Up @@ -10687,7 +10641,9 @@
{
"cell_type": "markdown",
"source": [
"The best model is the Gradient Boosting Model2(loss = 'ls', learning_rate = 0.15) as it has the higest R_square at about 30% mean that the model explains about 30% of the variation in the target variable which is the load_shortfall_3h. The model also has the lowest RMSE at 3959.45"
"The best model is the Gradient Boosting Model2 with least squares loss function and a learning rate of 15% as it has the higest R_square at about 30% mean that the model explains about 30% of the variation in the target variable which is the load_shortfall_3h. The model also has the lowest RMSE at 3959.45\n",
"The loss function is a measure indicating how good are model’s coefficients are at fitting the underlying data.\n",
"The learning rate simply means how fast the model is learning."
],
"metadata": {
"id": "xDu9__EbuDDw"
Expand All @@ -10705,32 +10661,24 @@
"## 7. Model Explanations\n",
"<a class=\"anchor\" id=\"1.1\"></a>\n",
"<a href=#cont>Back to Table of Contents</a>\n",
"\n",
"---\n",
" \n",
"| ⚡ Description: Model explanation ⚡ |\n",
"| :--------------------------- |\n",
"| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |\n",
"\n",
"---"
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5ff741c2",
"cell_type": "markdown",
"source": [
"**The best model logic**"
],
"metadata": {
"id": "5ff741c2"
"id": "RcwKxvr1rXnc"
},
"outputs": [],
"source": [
"# discuss chosen methods logic"
]
"id": "RcwKxvr1rXnc"
},
{
"cell_type": "markdown",
"source": [
"Gradient boosting method relies on the intuition that the best possible next model, when combined with previous models, minimizes the overall prediction error. It combines the efforts of multiple weak models(decision trees) to create a strong model, and each additional weak model reduces the mean squared error (which the average of the square of the difference between the true targets and the predicted values from a set of observations, such as a training or validation set)of the overall model."
"Gradient boosting method relies on the intuition that the best possible next model, when combined with previous models, minimizes the overall prediction error. It combines the efforts of multiple weak models(decision trees) to create a strong model, and each additional weak model reduces the mean squared error (which the average of the square of the difference between the true targets and the predicted values from a set of observations, such as a training or validation set)of the overall model.\n",
"A decision tree is a machine learning model that builds upon iteratively asking questions to partition data and reach a solution. "
],
"metadata": {
"id": "C5ijzUwCwNQf"
Expand Down
Loading