Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
154 changes: 154 additions & 0 deletions Assignment_solution_notebook.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Matplotlib is building the font cache; this may take a moment.\n"
]
}
],
"source": [
"import numpy as np, pandas as pd\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"data=pd.DataFrame(pd.read_csv('/home/lampros/Desktop/git/Final-Assignment/Final-Assignment-Answers/booze.csv'))\n",
"data1=data.copy(True)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"temp1=data1.get(['zip_code','store_name','category_name','bottles_sold'])"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"temp1=temp1.sort_values(by=['bottles_sold'],ascending=False)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"per_zip=temp1.drop_duplicates(subset='zip_code')"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"store_total=(temp1.sort_values(by='store_name'))\n",
"temp2=store_total.set_index('store_name')\n",
"temp3=store_total.drop_duplicates(subset='store_name');names=temp3.iloc[:,1]\n",
"dic=pd.DataFrame(columns=['total','percentage'],index=names)\n",
"for store in names[:]:\n",
" a=(temp2.loc[store,'bottles_sold'])\n",
" \n",
" if type(a)!=pd.Series:\n",
" dic.loc[store,'total']=a\n",
" else:\n",
" dic.loc[store,'total']=sum(a)\n"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [],
"source": [
"per_zip_new=per_zip.dropna()\n",
"per_zip_new=per_zip_new.merge(right=dic,left_on='store_name',right_on=dic.index)\n",
"per_zip_new['percentage']=per_zip_new['total']/sum(per_zip_new.get('total'))*100"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"new_csv=per_zip_new.to_csv('final_csv')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"plot1=plt.plot(per_zip_new['zip_code'],per_zip_new['bottles_sold'],'o')\n",
"plt.title('Most popular item sold per zip-code')\n",
"plt.xlabel('Zip_codes')\n",
"plt.ylabel('Number of bottles sold')\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"from matplotlib import ticker\n",
"per_zip_new=per_zip_new.sort_values(by='percentage',ascending=True)\n",
"plot2=plt.barh(per_zip_new['store_name'],per_zip_new['percentage'])\n",
"plt.title('Percentage of stores\\' sales out of total sales')\n",
"plt.xlabel('Percentage')\n",
"plt.ylabel('Stores')\n",
"plt.show()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.10.6 ('dataenv')",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "0af086dde17faf73389197b1245c24daca1f51df388fefe03ceaf9f55e6322a1"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}
54 changes: 11 additions & 43 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,51 +1,19 @@
# Final-Assignment
This project is designed to simulate a full workflow of a Data Analyst from getting data off the Database to manipulate it with the use of Python and Pandas module to present it through matplotlib module or Tableau.
My solutionest of solutions:

Roughly following the indicated steps,

The concept is that we are given a dataset that contains Liquor Sales in the state of Iowa in USA between 2012-2020 and we are asked to find the most popular item per zipcode and the percentage of sales per store in the period between 2016-2019.
1) booze_info.csv is the csv product of the given sql file using SQLWorkbench

We are also asked to visualize the Data and present them in either a matplotlib format or in Tableau Public.
2) The Jupyter notebook cells indicate the steps taken on the Python/Pandas part of the assignment fairly well. Notable difficulties would involve my lack of deeper
knowledge concerning the Pandas library. I could probably make things much simpler had I known better. Nevertheless, not too difficult.

Every calculation and transformation of Data has to happen through a Python Script.
3) The final .csv file depicts the dataset containing the results of the analysis.

## The following steps are just a recommendation, we suggest you trying and think outside the box while working with this data and maybe take different paths.
4) The two .png figures are the product of the Matlab related problems. Created using the above afore mentioned dataset. Results are satisfying contextually,
but do look like a proper mess. I had trouble figuring out how to assign unique colors to every zip-code in fig1 and I spent a whole lot more time than I
should trying to figure out how to properly format the y-axis store names in fig2 in order to not overlap.


- ###### Step 1.

Add the Dataset provided to Workbench.

- ###### Step 2.

Use a Query to get all the columns of the table between the years 2016-2019

- ###### Step 3.

Export the data to an CSV file like shown below

![image](https://user-images.githubusercontent.com/84134316/184128259-8ce76a57-d31a-4fdb-86d2-e38d46fc253c.png)

- ###### Step 4.

Use Python and Pandas to Aggregate the CSV data so we can get the most popular item sold based on zip code and percentage of sales per store.

- ###### Step 5.

Use Matplotlib or Tableau with the newly made CSV file and present your Data.

- ###### Step 6.

Write a report of the steps you did and what difficulties you faced.

## The visualization should look similar to something like this, but you are free to experiment. :


### 1. If you are using Matplotlib


<img src="https://user-images.githubusercontent.com/84134316/183881562-1bbd2503-1ebd-47a1-a396-97af4acebc46.png" width="450">

### 2. If you are using Tableau


<img src="https://user-images.githubusercontent.com/84134316/183916100-85c98b3b-5de7-40dd-bbc1-cefdaacb0619.png" width="450">
Final thoughts: all in all, this assignment would have cost a fraction of the time were we to use PowerBI; but personally, I prefer these clunky results to looking
straight at the blazing sun that is PowerBI's light mode interface.
Loading