Skip to content

RayHYAN/INST0060_GroupWork

Repository files navigation

ML group project: Wine Quality

Introduction

In this project Logistic Regression (LR), Random Forest (RF), Support Vector Machines (SVM) and K-Nearest neighbors (KNN) models were developed and tested to compare the quality of different wines within wine types (red and white wine). The dataset contains specific wine information such as fixed acidity, density , pH and many more. All of these factors could influence the classification of the quality of the wine.

Software Implementation

The data used in this study is provided in winequalityN.csv file. Plots generated by the code are saved in the folder where the main.py file is stored.

The structure of the code involves 4 main files of the project as described below:

(1) processing.py: THis file contains all of the functions required to process the dataset as well as partition the dataset, creating a new derived representation and applying the feature mapping.

(2) models.py: This file contains all of the functions for the models used, it also contains the gridsearch method that was created to test the hyper parameters for each model.

(3) evaluation.py: This file contains all of the functions required for creating the confusion matrices as well as the classification report for each model including the accuracy of the model.

(4) main.py: In this file, all the previous function are imported and compiled together to process the dataset and obtain the model predictions, accuracies and results from cross evaluation metrics.

Code

The dataset was downloaded from https://www.kaggle.com/rajyellow46/wine-quality.

Reproducing the Results

To be able to run this project a working python environment is required. It is recommended to set up an environment through the Anaconda terminal using the conda packages. To run the code in full follow the steps below:

  • Install required modules using modules using Pip:pip install -r requirements.txt

  • To see how to run experiments for each model including the parameters that you can change: python main.py -h

  • To run the results for each model, state it as the third argument (see below).

Machine Learning Models

Four machine learning models were used in this project. Please note that the running times for the models are due to number of hyperparameters tested as well as the default number of folds (4) used for cross validation, specifying a lower number of folds will reduce run times.

  • Logistic Regression:

python main.py --dataset winequalityN.csv --model Logistic

(Running time: 3 minutes 30 seconds)

  • Support Vector Machine: Uses the SciKit Learn API to implement the training classifier.

python main.py --dataset winequalityN.csv --model SVM

(Running time: 7 minutes 43 seconds)

  • K-Nearest Neighbours: Uses the SciKit Learn API to implement the training classifier.

python main.py --dataset winequalityN.csv --model KNN

(Running time: 3 minutes 23 seconds)

  • Random Forest Classifier: Uses the SciKit Learn API to implement the training classifier.

python main.py --dataset winequalityN.csv --model RF

(Running time: 6 minutes 50 seconds)

Authors

Foundations of Machine Learning (INST0060) - Group A

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors