ucezee · Jwangombe11 · Mar 3, 2022 · Mar 3, 2022
diff --git a/2110ACDS_VM13(final)_notebook.ipynb b/2110ACDS_VM13(final)_notebook.ipynb
@@ -87,14 +87,7 @@
         " <a id=\"one\"></a>\n",
         "## 1. Importing Packages\n",
         "<a href=#cont>Back to Table of Contents</a>\n",
-        "\n",
-        "---\n",
-        "    \n",
-        "| ⚡ Description: Importing Packages ⚡ |\n",
-        "| :--------------------------- |\n",
-        "| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |\n",
-        "\n",
-        "---"
+        "\n"
       ]
     },
     {
@@ -152,14 +145,7 @@
         "## 2. Loading the Data\n",
         "<a class=\"anchor\" id=\"1.1\"></a>\n",
         "<a href=#cont>Back to Table of Contents</a>\n",
-        "\n",
-        "---\n",
-        "    \n",
-        "| ⚡ Description: Loading the data ⚡ |\n",
-        "| :--------------------------- |\n",
-        "| In this section you are required to load the data from the `df_train` file into a DataFrame. |\n",
-        "\n",
-        "---"
+        "\n"
       ]
     },
     {
@@ -988,13 +974,7 @@
         "<a class=\"anchor\" id=\"1.1\"></a>\n",
         "<a href=#cont>Back to Table of Contents</a>\n",
         "\n",
-        "---\n",
-        "    \n",
-        "| ⚡ Description: Exploratory data analysis ⚡ |\n",
-        "| :--------------------------- |\n",
-        "| In this section, you are required to perform an in-depth analysis of all the variables in the DataFrame. |\n",
-        "\n",
-        "---\n"
+        "\n"
       ]
     },
     {
@@ -1704,18 +1684,6 @@
       },
       "id": "ZJFPf6ndaIkF"
     },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "id": "2fb74182",
-      "metadata": {
-        "id": "2fb74182"
-      },
-      "outputs": [],
-      "source": [
-        "# plot relevant feature interactions"
-      ]
-    },
     {
       "cell_type": "markdown",
       "source": [
@@ -3122,7 +3090,7 @@
         "> *   the error variance (i.e. the difference between the predicted values and observed values) reducing the power of statistical tests. Kurtosis tests for outliers.\n",
         "\n",
         "2. Skewness: tests by how much the overall shape of a distribution deviates from the shape of the normal distribution. Variables can be negatively or postively skewed.\n",
-        "\n",
+        "*italicized text*\n",
         "\n"
       ],
       "metadata": {
@@ -3174,7 +3142,7 @@
     {
       "cell_type": "markdown",
       "source": [
-        "The load shortfall decreased between the months of January-February, July-August and November-December. There was a huge increase May-june and another increase between October-November."
+        "*The load shortfall decreased between the months of January-February, July-August and November-December. There was a huge increase May-june and another increase between October-November.*"
       ],
       "metadata": {
         "id": "nSfx6i6pgE4t"
@@ -3935,7 +3903,7 @@
     {
       "cell_type": "markdown",
       "source": [
-        "The scatter plot has a cone shape around the regression line indicating heteroscedasticity(i.e. when the residuals are observed to have unequal variance). It could be as a result  wide range of values which are more prone to heteroskedasticity because the differences between the smallest and largest values are so significant."
+        "*The scatter plot has a cone shape around the regression line indicating heteroscedasticity(i.e. when the residuals are observed to have unequal variance). It could be as a result  wide range of values which are more prone to heteroskedasticity because the differences between the smallest and largest values are so significant.*"
       ],
       "metadata": {
         "id": "7FTF2LYyD5qr"
@@ -4029,7 +3997,7 @@
     {
       "cell_type": "markdown",
       "source": [
-        "The residuals are normally distributed meaning that the assumption of linear model is valid"
+        "*The residuals are normally distributed meaning that the assumption of linear model is valid*"
       ],
       "metadata": {
         "id": "nT5QbceAH3ic"
@@ -4047,14 +4015,7 @@
         "## 4. Data Engineering\n",
         "<a class=\"anchor\" id=\"1.1\"></a>\n",
         "<a href=#cont>Back to Table of Contents</a>\n",
-        "\n",
-        "---\n",
-        "    \n",
-        "| ⚡ Description: Data engineering ⚡ |\n",
-        "| :--------------------------- |\n",
-        "| In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase. |\n",
-        "\n",
-        "---"
+        "\n"
       ]
     },
     {
@@ -4900,7 +4861,7 @@
       "cell_type": "markdown",
       "source": [
         "**Test Data**:\n",
-        "Formatting the df_test data to conform to df_train data"
+        "*Formatting the df_test data to conform to df_train data*"
       ],
       "metadata": {
         "id": "LBACTH67eT8h"
@@ -6094,15 +6055,7 @@
         "<a id=\"five\"></a>\n",
         "## 5. Modelling\n",
         "<a class=\"anchor\" id=\"1.1\"></a>\n",
-        "<a href=#cont>Back to Table of Contents</a>\n",
-        "\n",
-        "---\n",
-        "    \n",
-        "| ⚡ Description: Modelling ⚡ |\n",
-        "| :--------------------------- |\n",
-        "| In this section, you are required to create one or more regression models that are able to accurately predict the thee hour load shortfall. |\n",
-        "\n",
-        "---"
+        "<a href=#cont>Back to Table of Contents</a>\n"
       ]
     },
     {
@@ -6596,7 +6549,7 @@
     {
       "cell_type": "code",
       "source": [
-        "pred_test_truerf = model_rf.predict(df_standard)"
+        "pred_test_truerf = model_rf.predict(df_standard) # predicting yhat from the df_test data"
       ],
       "metadata": {
         "id": "sqaDKb8HJUyH"
@@ -6961,6 +6914,7 @@
     {
       "cell_type": "code",
       "source": [
+        "# subsetting to get only the time and predicted yhats\n",
         "df_rf = df1[['time','load_shortfall_3h']]\n",
         "df_rf.head()"
       ],
@@ -7167,7 +7121,7 @@
     {
       "cell_type": "code",
       "source": [
-        "y_pred_gb = model_gb.predict(df_standard)"
+        "y_pred_gb = model_gb.predict(df_standard) # predicting yhat from the df_test data"
       ],
       "metadata": {
         "id": "IIvaPkJ-LvQG"
@@ -7532,6 +7486,7 @@
     {
       "cell_type": "code",
       "source": [
+        "# subsetting to get only the time and predicted yhats\n",
         "df_gb = df1[['time','load_shortfall_3h']]\n",
         "df_gb.head()"
       ],
@@ -7723,7 +7678,7 @@
     {
       "cell_type": "code",
       "source": [
-        "y_pred_rfe_gb = rfe_gb.predict(df_standard)"
+        "y_pred_rfe_gb = rfe_gb.predict(df_standard)  # predicting yhat from the df_test data"
       ],
       "metadata": {
         "id": "G8BvCho4e8k-"
@@ -7735,6 +7690,7 @@
     {
       "cell_type": "code",
       "source": [
+        "# subsetting to get only the time and predicted yhats\n",
         "df1['load_shortfall_3h'] = y_pred_rfe_gb\n",
         "df1.head()"
       ],
@@ -8088,6 +8044,7 @@
     {
       "cell_type": "code",
       "source": [
+        "# subsetting to get only the time and predicted yhats\n",
         "df_rfe_gb = df1[['time','load_shortfall_3h']]\n",
         "df_rfe_gb.head()"
       ],
@@ -8290,7 +8247,7 @@
     {
       "cell_type": "code",
       "source": [
-        "y_pred_xgb =model_xgb.predict(df_standard)"
+        "y_pred_xgb =model_xgb.predict(df_standard)  # predicting yhat from the df_test data"
       ],
       "metadata": {
         "id": "IbRDjY2YuVnw"
@@ -8655,6 +8612,7 @@
     {
       "cell_type": "code",
       "source": [
+        "# subsetting to get only the time and predicted yhats\n",
         "df_xgb = df1[['time','load_shortfall_3h']]\n",
         "df_xgb.head()"
       ],
@@ -8845,7 +8803,7 @@
     {
       "cell_type": "code",
       "source": [
-        "y_pred_xgb1 =model_xgb1.predict(df_standard)"
+        "y_pred_xgb1 =model_xgb1.predict(df_standard)  # predicting yhat from the df_test data"
       ],
       "metadata": {
         "id": "YtETtoEp1DuC"
@@ -9210,6 +9168,7 @@
     {
       "cell_type": "code",
       "source": [
+        "# subsetting to get only the time and predicted yhats\n",
         "df_xgb1 = df1[['time','load_shortfall_3h']]\n",
         "df_xgb1.head()"
       ],
@@ -9400,7 +9359,7 @@
     {
       "cell_type": "code",
       "source": [
-        "y_pred_gb2 = model_gb2.predict(df_standard)"
+        "y_pred_gb2 = model_gb2.predict(df_standard)  # predicting yhat from the df_test data"
       ],
       "metadata": {
         "id": "_v0VOzCvZieA"
@@ -9765,6 +9724,7 @@
     {
       "cell_type": "code",
       "source": [
+        "# subsetting to get only the time and predicted yhats\n",
         "df_gb2 = df1[['time','load_shortfall_3h']]\n",
         "df_gb2.head()"
       ],
@@ -10244,14 +10204,7 @@
         "## 6. Model Performance\n",
         "<a class=\"anchor\" id=\"1.1\"></a>\n",
         "<a href=#cont>Back to Table of Contents</a>\n",
-        "\n",
-        "---\n",
-        "    \n",
-        "| ⚡ Description: Model performance ⚡ |\n",
-        "| :--------------------------- |\n",
-        "| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |\n",
-        "\n",
-        "---"
+        "\n"
       ]
     },
     {
@@ -10289,6 +10242,7 @@
       ],
       "source": [
         "# Compare model performance\n",
+        "# loading a dataframe with the performance indicators of each model\n",
         "performance_df = pd.read_csv('model_perf.csv', encoding='cp1252', names = ['Model', 'Train_MSE',\t'Train_RMSE',\t'Train_R_squared',\t'Test_MSE',\t'Test_RMSE',\t'Test_R_squared'])\n",
         "performance_df = performance_df.drop(labels=[0], axis=0)\n",
         "performance_df.info()"
@@ -10297,6 +10251,7 @@
     {
       "cell_type": "code",
       "source": [
+        "#Converting the datatype from object to float\n",
         "performance_df['Train_MSE'] = performance_df['Train_MSE'].map(lambda x: float(x))\n",
         "performance_df['Train_RMSE'] = performance_df['Train_RMSE'].map(lambda x: float(x))\n",
         "performance_df['Train_R_squared'] = performance_df['Train_R_squared'].map(lambda x: float(x))\n",
@@ -10351,6 +10306,7 @@
     {
       "cell_type": "code",
       "source": [
+        "# subsetting to get only the RMSE of the train and test datasets\n",
         "train_perf = performance_df[['Model','Train_RMSE','Test_RMSE']]\n",
         "train_perf"
       ],
@@ -10547,7 +10503,6 @@
         "# Performance based on Train_RMSE\n",
         "train_perf_sorted= train_perf.sort_values('Train_RMSE', ascending=False)\n",
         "train_perf_sorted.plot(kind='barh' , x='Model', y='Train_RMSE')\n",
-        "#plt.xticks(rotation=None, horizontalalignment=\"center\")\n",
         "plt.title(\"Model Performance based on Train_RMSE\")\n",
         "plt.xlabel(\"Root Mean Square Error\")\n",
         "plt.ylabel(\"Model\")\n"
@@ -10592,7 +10547,6 @@
         "sns.set_style(\"dark\")\n",
         "test_perf_sorted = train_perf.sort_values('Test_RMSE', ascending=False)\n",
         "test_perf_sorted.plot(kind='barh' , x='Model', y='Test_RMSE',)\n",
-        "#plt.xticks(rotation=None, horizontalalignment=\"center\")\n",
         "plt.title(\"Model Performance based on Test_RMSE\")\n",
         "plt.xlabel(\"Root Mean Square Error\")\n",
         "plt.ylabel(\"Model\")"
@@ -10633,7 +10587,7 @@
     {
       "cell_type": "markdown",
       "source": [
-        "Root Mean Square Error (RMSE) is a standard way to measure the error of a model in predicting quantitative data. The lower the RMSE the better the model. Light Gradient Model and the Random forest model have the lowest RMSE when it come to Training data however they are among the model with the Highest RMSE in the Test data indicating that there may be overfitting."
+        "*Root Mean Square Error (RMSE) is a standard way to measure the error of a model in predicting quantitative data. The lower the RMSE the better the model. Light Gradient Model and the Random forest model have the lowest RMSE when it come to Training data however they are among the model with the Highest RMSE in the Test data indicating that there may be overfitting.*"
       ],
       "metadata": {
         "id": "XSOOwpfsRIMZ"
@@ -10687,7 +10641,9 @@
     {
       "cell_type": "markdown",
       "source": [
-        "The best model is the Gradient Boosting Model2(loss = 'ls', learning_rate = 0.15) as it has the higest R_square at about 30% mean that the model explains about 30% of the variation in the target variable which is the load_shortfall_3h. The model also has the lowest RMSE at 3959.45"
+        "The best model is the Gradient Boosting Model2 with least squares loss function and a learning rate of 15% as it has the higest R_square at about 30% mean that the model explains about 30% of the variation in the target variable which is the load_shortfall_3h. The model also has the lowest RMSE at 3959.45\n",
+        "The loss function is a measure indicating how good are model’s coefficients are at fitting the underlying data.\n",
+        "The learning rate simply means how fast the model is learning."
       ],
       "metadata": {
         "id": "xDu9__EbuDDw"
@@ -10705,32 +10661,24 @@
         "## 7. Model Explanations\n",
         "<a class=\"anchor\" id=\"1.1\"></a>\n",
         "<a href=#cont>Back to Table of Contents</a>\n",
-        "\n",
-        "---\n",
-        "    \n",
-        "| ⚡ Description: Model explanation ⚡ |\n",
-        "| :--------------------------- |\n",
-        "| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |\n",
-        "\n",
-        "---"
+        "\n"
       ]
     },
     {
-      "cell_type": "code",
-      "execution_count": null,
-      "id": "5ff741c2",
+      "cell_type": "markdown",
+      "source": [
+        "**The best model logic**"
+      ],
       "metadata": {
-        "id": "5ff741c2"
+        "id": "RcwKxvr1rXnc"
       },
-      "outputs": [],
-      "source": [
-        "# discuss chosen methods logic"
-      ]
+      "id": "RcwKxvr1rXnc"
     },
     {
       "cell_type": "markdown",
       "source": [
-        "Gradient boosting method relies on the intuition that the best possible next model, when combined with previous models, minimizes the overall prediction error. It combines the efforts of multiple weak models(decision trees) to create a strong model, and each additional weak model reduces the mean squared error (which the average of the square of the difference between the true targets and the predicted values from a set of observations, such as a training or validation set)of the overall model."
+        "Gradient boosting method relies on the intuition that the best possible next model, when combined with previous models, minimizes the overall prediction error. It combines the efforts of multiple weak models(decision trees) to create a strong model, and each additional weak model reduces the mean squared error (which the average of the square of the difference between the true targets and the predicted values from a set of observations, such as a training or validation set)of the overall model.\n",
+        "A decision tree is a machine learning model that builds upon iteratively asking questions to partition data and reach a solution. "
       ],
       "metadata": {
         "id": "C5ijzUwCwNQf"