💵 Bank Note Authentication: Project Overview

End to end project classifying real and fake bank notes.
Apache Spark Python API used for speed benefits.

Resources Used

PySpark, Python 3, PostgreSQL

Anaconda Packages: pandas numpy sklearn matplotlib seaborn sqlalchemy kaggle psycopg2 ipykernel pyspark

Powershell command for installing anaconda packages used for this project

pip install pandas numpy sklearn matplotlib seaborn sqlalchemy kaggle psycopg2 ipykernel pyspark

Initiating spark session

# Creating spark session 
spark = SparkSession.builder.appName('P10').getOrCreate()

Data Collection

Powershell command for data import using kaggle API

!kaggle datasets download -d ritesaluja/bank-note-authentication-uci-data -p ..\Data --unzip

Data source link Data

Rows: 1372 / Columns: 5
- variance
- skewness
- curtosis
- entropy
- class

Data Pre-processing

After I had all the data I needed, I needed to check it was ready for exploration and later modelling. I made the following changes and created the following variables:

General NULL and data validity checks
The datatypes needed fixing as well as renaming a column because of the use of the name 'class' in python

# Correcting data types 
data = data.selectExpr("cast(variance as double) variance",
    "cast(skewness as double) skewness",
    "cast(curtosis as double) curtosis",
    "cast(entropy as double) entropy",
    "cast(class as double) class")

# Renaming class column 
data = data.withColumnRenamed("class","outcome")

Data Warehousing

I warehouse all data in a Postgre database for later use and reference.

ETL in python to PostgreSQL Database.
Formatted column headers to SQL compatibility.

# Function to warehouse data in a Postgre database and save cleaned data in Data folder -  AS THIS IS PYSPARK, THERE WAS A NEED TO ADD .toPandas anywhere the dataset is called 
def store_data(data,tablename):
    """
    :param data: variable, enter name of dataset you'd like to warehouse
    :param tablename: str, enter name of table for data 
    """

    # SQL table header format
    tablename = tablename.lower()
    tablename = tablename.replace(' ','_')

    # Saving cleaned data as csv
    data.toPandas().to_csv(f'../Data/{tablename}_clean.csv', index=False)

    # Engine to access postgre
    engine = create_engine('postgresql+psycopg2://postgres:password@localhost:5432/projectsdb')

    # Loads dataframe into PostgreSQL and replaces table if it exists
    data.toPandas().to_sql(f'{tablename}', engine, if_exists='replace',index=False)

    # Confirmation of ETL 
    return("ETL successful, {num} rows loaded into table: {tb}.".format(num=len(data.toPandas().iloc[:,0]), tb=tablename))
 
# Calling store_data function to warehouse cleaned data
store_data(data,"P10 Bank Note Authentication")

Exploratory data analysis

I looked at the distributions of the data and the value counts for the various categorical variables that would be fed into the model. Below are a few highlights from the analysis.

55.54% of the bank notes in the data are real notes.

Boxplots were used to visualise features with outliers. These features were not scaled as this was POC project. I know that LogisticRegression models are very sensitive to the range of the data points so scaling is advised.

I visualised the distribution of features for the fake bank notes

The correlation matrix shows those features with strong and particularly weak correlations

Feature Engineering

There was no need to transform the categorical variable(s) into dummy variables as they are all numeric. I also split the data into train and tests sets with a test size of 20%.

I had to random Split instead of using train_test_split from sklearn

# Splitting data into train and test data
train_data,test_data=model_data.randomSplit([0.80,0.20], seed=23)

ML/DL Model Building

I used the LogisticRegression model and evaluated them using initially using accuracy_score and a confusions matrix.

I fed the independent and dependent features to the model and training it on the training data

# Calling LinearRegression algorithm and applying features and outcome 
regressor=LogisticRegression(featuresCol='features', labelCol='outcome')

# Training model on training data 
model=regressor.fit(train_data)

Model performance

The Logistic Regression model performed well on the train and test sets.

Logistic Regression : Accuracy = 100%

Model Evaluation

The ROC Curve shows the accuracy show reflect of both the train and test datasets

A confusion matrix showing the accuracy score of True and False predictions achieved by the model.
The model performed perfectly with no FP or FN (False Positive or False Negatives)

Project Management (Agile/Scrum/Kanban)

Resources used
- Jira
- Confluence
- Trello

Project Evaluation

WWW
- The end-to-end process
- Use of Spark and Databricks
EBI
- Look at productionising use MLlib

Looking Ahead

What next
More Pyspark???

Questions & Contact me

For questions, feedback, and contribution requests contact me

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Code		Code
Data		Data
images		images
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

💵 Bank Note Authentication: Project Overview

Table of Contents

Resources Used

Data Collection

Data Pre-processing

Data Warehousing

Exploratory data analysis

Feature Engineering

ML/DL Model Building

Model performance

Model Evaluation

Project Management (Agile/Scrum/Kanban)

Project Evaluation

Looking Ahead

Questions & Contact me

Click here to email me

See more projects here

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

💵 Bank Note Authentication: Project Overview

Table of Contents

Resources Used

Looking Ahead

Questions & Contact me

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages