Introduction: In this project, binary classification is performed on a dataset using three different models. These models are created using different classification algorithms including Random Forest, K-Nearest Neighbor, and a deep learning algorithm called Long Short-Term Memory. The dataset used in this project contains over 5,000 entries that consist of TV show titles, ratings, age range, year, type, and streaming platforms. These three models aim to predict if a TV show is on the streaming service Netflix or not. The goal of this project is to evaluate the models using K-Fold Cross Validation and calculate the necessary metrics. In addition, the ROC curve for each model needs to be plotted. Finally, using these evaluation techniques, each model can be analyzed to decipher whih is the best classification model for the given data.
Prerequisites: For this project, I used Python version 3.10. I made sure to install necessary libraries such as Pandas, Numpy, Scikit-Learn, Matplotlib, Seaborn, TensorFlow, and Keras. For a full list of the packages needed for this project, see the requirements.txt file in the zip file. These are very easy to install if they are not already on your machine. Just simply type pip install library_name (e.g. pandas) where you want to run the code. Also, ensure the provided dataset is in the same directory as the Python script before running the program. The dataset in this project was from Kaggle and can be downloaded from here: https://www.kaggle.com/datasets/ruchi798/tv-shows-on-netflix-prime-video-hulu-and-disney The CSV file will also be provided in the zip file.