Workearly · LabrosKarabalis · Oct 27, 2022 · Oct 27, 2022 · Oct 28, 2022 · Oct 31, 2022
diff --git a/Assignment_solution_notebook.ipynb b/Assignment_solution_notebook.ipynb
@@ -0,0 +1,154 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Matplotlib is building the font cache; this may take a moment.\n"
+     ]
+    }
+   ],
+   "source": [
+    "import numpy as np, pandas as pd\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data=pd.DataFrame(pd.read_csv('/home/lampros/Desktop/git/Final-Assignment/Final-Assignment-Answers/booze.csv'))\n",
+    "data1=data.copy(True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "temp1=data1.get(['zip_code','store_name','category_name','bottles_sold'])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "temp1=temp1.sort_values(by=['bottles_sold'],ascending=False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "per_zip=temp1.drop_duplicates(subset='zip_code')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "store_total=(temp1.sort_values(by='store_name'))\n",
+    "temp2=store_total.set_index('store_name')\n",
+    "temp3=store_total.drop_duplicates(subset='store_name');names=temp3.iloc[:,1]\n",
+    "dic=pd.DataFrame(columns=['total','percentage'],index=names)\n",
+    "for store in names[:]:\n",
+    "    a=(temp2.loc[store,'bottles_sold'])\n",
+    "    \n",
+    "    if type(a)!=pd.Series:\n",
+    "        dic.loc[store,'total']=a\n",
+    "    else:\n",
+    "        dic.loc[store,'total']=sum(a)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 53,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "per_zip_new=per_zip.dropna()\n",
+    "per_zip_new=per_zip_new.merge(right=dic,left_on='store_name',right_on=dic.index)\n",
+    "per_zip_new['percentage']=per_zip_new['total']/sum(per_zip_new.get('total'))*100"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "new_csv=per_zip_new.to_csv('final_csv')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import matplotlib.pyplot as plt\n",
+    "plot1=plt.plot(per_zip_new['zip_code'],per_zip_new['bottles_sold'],'o')\n",
+    "plt.title('Most popular item sold per zip-code')\n",
+    "plt.xlabel('Zip_codes')\n",
+    "plt.ylabel('Number of bottles sold')\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import matplotlib.pyplot as plt\n",
+    "from matplotlib import ticker\n",
+    "per_zip_new=per_zip_new.sort_values(by='percentage',ascending=True)\n",
+    "plot2=plt.barh(per_zip_new['store_name'],per_zip_new['percentage'])\n",
+    "plt.title('Percentage of stores\\' sales out of total sales')\n",
+    "plt.xlabel('Percentage')\n",
+    "plt.ylabel('Stores')\n",
+    "plt.show()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3.10.6 ('dataenv')",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.6"
+  },
+  "orig_nbformat": 4,
+  "vscode": {
+   "interpreter": {
+    "hash": "0af086dde17faf73389197b1245c24daca1f51df388fefe03ceaf9f55e6322a1"
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/README.md b/README.md
@@ -1,51 +1,19 @@
 # Final-Assignment
-This project is designed to simulate a full workflow of a Data Analyst from getting data off the Database to manipulate it with the use of Python and Pandas module to present it through matplotlib module or Tableau.
+My solutionest of solutions: 
 
+Roughly following the indicated steps,
 
-The concept is that we are given a dataset that contains Liquor Sales in the state of Iowa in USA between 2012-2020 and we are asked to find the most popular item per zipcode and the percentage of sales per store in the period between 2016-2019.
+1) booze_info.csv is the csv product of the given sql file using SQLWorkbench
 
-We are also asked to visualize the Data and present them in either a matplotlib format or in Tableau Public.
+2) The Jupyter notebook cells indicate the steps taken on the Python/Pandas part of the assignment fairly well. Notable difficulties would involve my lack of deeper 
+knowledge concerning the Pandas library. I could probably make things much simpler had I known better. Nevertheless, not too difficult.
 
-Every calculation and transformation of Data has to happen through a Python Script. 
+3) The final .csv file depicts the dataset containing the results of the analysis.
 
-## The following steps are just a recommendation, we suggest you trying and think outside the box while working with this data and maybe take different paths.
+4) The two .png figures are the product of the Matlab related problems. Created using the above afore mentioned dataset. Results are satisfying contextually,
+but do look like a proper mess. I had trouble figuring out how to assign unique colors to every zip-code in fig1 and I spent a whole lot more time than I 
+should trying to figure out how to properly format the y-axis store names in fig2 in order to not overlap.
 
 
-- ###### Step 1.
-
-Add the Dataset provided to Workbench.
-
-- ###### Step 2.
-
-Use a Query to get all the columns of the table between the years 2016-2019
-
-- ###### Step 3.
-
-Export the data to an CSV file like shown below
-
-![image](https://user-images.githubusercontent.com/84134316/184128259-8ce76a57-d31a-4fdb-86d2-e38d46fc253c.png)
-
-- ###### Step 4.
-
-Use Python and Pandas to Aggregate the CSV data so we can get the most popular item sold based on zip code and percentage of sales per store.
-
-- ###### Step 5.
-
-Use Matplotlib or Tableau with the newly made CSV file and present your Data.
-
-- ###### Step 6.
-
-Write a report of the steps you did and what difficulties you faced.
-
-## The visualization should look similar to something like this, but you are free to experiment. :
-
-
-### 1.  If you are using Matplotlib
-
-
-<img src="https://user-images.githubusercontent.com/84134316/183881562-1bbd2503-1ebd-47a1-a396-97af4acebc46.png" width="450">
-
-### 2. If you are using Tableau
-
-
-<img src="https://user-images.githubusercontent.com/84134316/183916100-85c98b3b-5de7-40dd-bbc1-cefdaacb0619.png" width="450">
+Final thoughts: all in all, this assignment would have cost a fraction of the time were we to use PowerBI; but personally, I prefer these clunky results to looking
+straight at the blazing sun that is PowerBI's light mode interface.