Skip to content

MattithyahuO/P10-Bank-Note-Authentication

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

💵 Bank Note Authentication: Project Overview

  • End to end project classifying real and fake bank notes.
  • Apache Spark Python API used for speed benefits.

Table of Contents

Resources Used

PySpark, Python 3, PostgreSQL

Anaconda Packages: pandas numpy sklearn matplotlib seaborn sqlalchemy kaggle psycopg2 ipykernel pyspark

Powershell command for installing anaconda packages used for this project

pip install pandas numpy sklearn matplotlib seaborn sqlalchemy kaggle psycopg2 ipykernel pyspark 



Initiating spark session

# Creating spark session 
spark = SparkSession.builder.appName('P10').getOrCreate()

Powershell command for data import using kaggle API

!kaggle datasets download -d ritesaluja/bank-note-authentication-uci-data -p ..\Data --unzip 

Data source link Data

  • Rows: 1372 / Columns: 5
    • variance
    • skewness
    • curtosis
    • entropy
    • class

After I had all the data I needed, I needed to check it was ready for exploration and later modelling. I made the following changes and created the following variables:

  • General NULL and data validity checks
  • The datatypes needed fixing as well as renaming a column because of the use of the name 'class' in python
# Correcting data types 
data = data.selectExpr("cast(variance as double) variance",
    "cast(skewness as double) skewness",
    "cast(curtosis as double) curtosis",
    "cast(entropy as double) entropy",
    "cast(class as double) class")

# Renaming class column 
data = data.withColumnRenamed("class","outcome")

I warehouse all data in a Postgre database for later use and reference.

  • ETL in python to PostgreSQL Database.
  • Formatted column headers to SQL compatibility.
# Function to warehouse data in a Postgre database and save cleaned data in Data folder -  AS THIS IS PYSPARK, THERE WAS A NEED TO ADD .toPandas anywhere the dataset is called 
def store_data(data,tablename):
    """
    :param data: variable, enter name of dataset you'd like to warehouse
    :param tablename: str, enter name of table for data 
    """

    # SQL table header format
    tablename = tablename.lower()
    tablename = tablename.replace(' ','_')

    # Saving cleaned data as csv
    data.toPandas().to_csv(f'../Data/{tablename}_clean.csv', index=False)

    # Engine to access postgre
    engine = create_engine('postgresql+psycopg2://postgres:password@localhost:5432/projectsdb')

    # Loads dataframe into PostgreSQL and replaces table if it exists
    data.toPandas().to_sql(f'{tablename}', engine, if_exists='replace',index=False)

    # Confirmation of ETL 
    return("ETL successful, {num} rows loaded into table: {tb}.".format(num=len(data.toPandas().iloc[:,0]), tb=tablename))
 
# Calling store_data function to warehouse cleaned data
store_data(data,"P10 Bank Note Authentication")

I looked at the distributions of the data and the value counts for the various categorical variables that would be fed into the model. Below are a few highlights from the analysis.

  • 55.54% of the bank notes in the data are real notes.

  • Boxplots were used to visualise features with outliers. These features were not scaled as this was POC project. I know that LogisticRegression models are very sensitive to the range of the data points so scaling is advised.

  • I visualised the distribution of features for the fake bank notes

  • The correlation matrix shows those features with strong and particularly weak correlations

There was no need to transform the categorical variable(s) into dummy variables as they are all numeric. I also split the data into train and tests sets with a test size of 20%.

  • I had to random Split instead of using train_test_split from sklearn
# Splitting data into train and test data
train_data,test_data=model_data.randomSplit([0.80,0.20], seed=23)

I used the LogisticRegression model and evaluated them using initially using accuracy_score and a confusions matrix.

  • I fed the independent and dependent features to the model and training it on the training data
# Calling LinearRegression algorithm and applying features and outcome 
regressor=LogisticRegression(featuresCol='features', labelCol='outcome')

# Training model on training data 
model=regressor.fit(train_data)

The Logistic Regression model performed well on the train and test sets.

  • Logistic Regression : Accuracy = 100%

  • The ROC Curve shows the accuracy show reflect of both the train and test datasets

  • A confusion matrix showing the accuracy score of True and False predictions achieved by the model.
  • The model performed perfectly with no FP or FN (False Positive or False Negatives)

  • Resources used
    • Jira
    • Confluence
    • Trello

  • WWW
    • The end-to-end process
    • Use of Spark and Databricks
  • EBI
    • Look at productionising use MLlib

Looking Ahead

  • What next
  • More Pyspark???

Questions & Contact me

For questions, feedback, and contribution requests contact me

About

💷 Machine Learning PySpark Bank note authentication

Topics

Resources

Stars

Watchers

Forks

Contributors