- End to end project classifying real and fake bank notes.
- Apache Spark Python API used for speed benefits.
- Resources
- Data Collection
- Data Pre-processing
- Data Warehousing
- Exploratory data analysis
- Feature Engineering
- ML/DL Model Building
- Model performance
- Model Evaluation
- Project Management (Agile/Scrum/Kanban)
- Project Evaluation
- Looking Ahead
- Questions & Contact me
PySpark, Python 3, PostgreSQL
Anaconda Packages: pandas numpy sklearn matplotlib seaborn sqlalchemy kaggle psycopg2 ipykernel pyspark
Powershell command for installing anaconda packages used for this project
pip install pandas numpy sklearn matplotlib seaborn sqlalchemy kaggle psycopg2 ipykernel pyspark
Initiating spark session
# Creating spark session
spark = SparkSession.builder.appName('P10').getOrCreate()Powershell command for data import using kaggle API
!kaggle datasets download -d ritesaluja/bank-note-authentication-uci-data -p ..\Data --unzip - Rows: 1372 / Columns: 5
- variance
- skewness
- curtosis
- entropy
- class
After I had all the data I needed, I needed to check it was ready for exploration and later modelling. I made the following changes and created the following variables:
- General NULL and data validity checks
- The datatypes needed fixing as well as renaming a column because of the use of the name 'class' in python
# Correcting data types
data = data.selectExpr("cast(variance as double) variance",
"cast(skewness as double) skewness",
"cast(curtosis as double) curtosis",
"cast(entropy as double) entropy",
"cast(class as double) class")
# Renaming class column
data = data.withColumnRenamed("class","outcome")I warehouse all data in a Postgre database for later use and reference.
- ETL in python to PostgreSQL Database.
- Formatted column headers to SQL compatibility.
# Function to warehouse data in a Postgre database and save cleaned data in Data folder - AS THIS IS PYSPARK, THERE WAS A NEED TO ADD .toPandas anywhere the dataset is called
def store_data(data,tablename):
"""
:param data: variable, enter name of dataset you'd like to warehouse
:param tablename: str, enter name of table for data
"""
# SQL table header format
tablename = tablename.lower()
tablename = tablename.replace(' ','_')
# Saving cleaned data as csv
data.toPandas().to_csv(f'../Data/{tablename}_clean.csv', index=False)
# Engine to access postgre
engine = create_engine('postgresql+psycopg2://postgres:password@localhost:5432/projectsdb')
# Loads dataframe into PostgreSQL and replaces table if it exists
data.toPandas().to_sql(f'{tablename}', engine, if_exists='replace',index=False)
# Confirmation of ETL
return("ETL successful, {num} rows loaded into table: {tb}.".format(num=len(data.toPandas().iloc[:,0]), tb=tablename))
# Calling store_data function to warehouse cleaned data
store_data(data,"P10 Bank Note Authentication")I looked at the distributions of the data and the value counts for the various categorical variables that would be fed into the model. Below are a few highlights from the analysis.
- 55.54% of the bank notes in the data are real notes.
- Boxplots were used to visualise features with outliers. These features were not scaled as this was POC project. I know that LogisticRegression models are very sensitive to the range of the data points so scaling is advised.
- I visualised the distribution of features for the fake bank notes
- The correlation matrix shows those features with strong and particularly weak correlations
There was no need to transform the categorical variable(s) into dummy variables as they are all numeric. I also split the data into train and tests sets with a test size of 20%.
- I had to random Split instead of using train_test_split from sklearn
# Splitting data into train and test data
train_data,test_data=model_data.randomSplit([0.80,0.20], seed=23)I used the LogisticRegression model and evaluated them using initially using accuracy_score and a confusions matrix.
- I fed the independent and dependent features to the model and training it on the training data
# Calling LinearRegression algorithm and applying features and outcome
regressor=LogisticRegression(featuresCol='features', labelCol='outcome')
# Training model on training data
model=regressor.fit(train_data)The Logistic Regression model performed well on the train and test sets.
- Logistic Regression : Accuracy = 100%
- The ROC Curve shows the accuracy show reflect of both the train and test datasets
- A confusion matrix showing the accuracy score of True and False predictions achieved by the model.
- The model performed perfectly with no FP or FN (False Positive or False Negatives)
- Resources used
- Jira
- Confluence
- Trello
- WWW
- The end-to-end process
- Use of Spark and Databricks
- EBI
- Look at productionising use MLlib
- What next
- More Pyspark???
For questions, feedback, and contribution requests contact me






