This repository contains the code, processed data, and final report for a project that conducts a comparative analysis of unsupervised machine learning models for detecting zero-day threats in Internet of Things (IoT) network traffic.
The objective of this project was to simulate a real-world security challenge: identifying malicious network activity without relying on pre-existing attack signatures. To achieve this, I evaluated three distinct families of unsupervised anomaly detection algorithms:
- Isolation Forest (Ensemble-based)
- One-Class SVM (Boundary-based)
- DBSCAN (Density-based)
These models were trained exclusively on benign network data from the CIC-IoT-2023 dataset to test their ability to flag never-before-seen attacks. A supervised Logistic Regression model was also implemented to serve as a performance benchmark.
This project demonstrates a complete machine learning workflow, from the ingestion and processing of a large-scale (16GB) dataset to model training, evaluation, and in-depth analysis of the results.
- High Recall, High Alert Fatigue: Both Isolation Forest and One-Class SVM demonstrated exceptionally high recall, successfully identifying over 99% of malicious attacks. However, this sensitivity came with a high rate of false positives, which would be impractical in a real-world SOC environment.
- DBSCAN Performance Failure: The density-based DBSCAN model failed catastrophically, missing over 99% of anomalies. This indicates that the underlying assumption of attack traffic being "sparse noise" does not apply to this dataset.
- Isolation Forest as the Viable Candidate: Due to its high detection rate and superior computational efficiency, Isolation Forest emerged as the most practical unsupervised model among the three tested.
main.py: The main entry point to run the entire analysis pipeline from start to finish.src/: A folder containing the core logic for the project.data_processor.py: Script for loading, preprocessing, and scaling the data.model_trainer.py: Script containing functions to train and evaluate all machine learning models.
processed-data/: Contains the pre-sampled and processed datasets used for the analysis.reports/: Contains the final academic report and saved figures.Report-AI-IoT-Intrusion-Detection.pdf: The detailed project report.figures/: Directory where confusion matrix plots are automatically saved.
requirements.txt: A list of all Python libraries required to run the project.
The processed datasets used for this analysis are too large to be hosted directly in this Git repository. The fulll dataset can be downloaded from CIC-IoT-2023 dataset
To replicate the analysis and results, please follow these steps:
-
Clone the Repository
-
Set up the Python Environment: It is highly recommended to use a virtual environment.
# Create and activate the virtual environment python -m venv venv source venv/bin/activate # On macOS/Linux .\venv\Scripts\activate # On Windows # Install all required libraries pip install -r requirements.txt
-
Run the Analysis: Execute the main script from the terminal. This will load the data, preprocess it, train all four models, and print their evaluation reports to the console, saving the confusion matrices to the
reports/figures/directory.python main.py
Please Note: Training the OCSVM and DBSCAN models may take several minutes to complete, depending on your system's hardware.
The scv files are not included. They are too large for this repository.