Skip to content

Neural Network from scratch, LoRA implementation, and SMS spam cleaning with MinHash deduplication

Notifications You must be signed in to change notification settings

TheODDYSEY/ML-Assignment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Machine Learning Engineering Assignment

This repository contains implementations for three core machine learning engineering tasks, demonstrating fundamental ML concepts, modern fine-tuning techniques, and data engineering best practices.

🎯 Overview

This project showcases three distinct aspects of machine learning engineering:

  1. Fundamentals: Building neural networks from scratch using only NumPy
  2. Modern Techniques: Implementing parameter-efficient fine-tuning with LoRA
  3. Data Engineering: Creating robust data cleaning and deduplication pipelines

Each task is implemented in a separate Jupyter notebook with detailed explanations and visualizations.

🚀 Installation

Prerequisites

  • Python 3.8+
  • Jupyter Notebook or JupyterLab

Setup

  1. Clone the repository:
git clone https://github.com/TheODDYSEY/ML-Assignment

cd ML-Assignment

💻 Usage

Running the Notebooks

Each task can be run independently:

# Task 1: Neural Network
jupyter notebook neural_network_from_scratch.ipynb

# Task 2: LoRA
jupyter notebook lora_implementation.ipynb

# Task 3: Data Cleaning
jupyter notebook sms_spam_cleaning.ipynb

📦 Requirements

Core Dependencies

numpy>=1.21.0
pandas>=1.3.0
matplotlib>=3.4.0
seaborn>=0.11.0
torch>=1.9.0
jupyter>=1.0.0

Task-Specific Requirements

Task 1: NumPy, Matplotlib

Task 2: PyTorch, NumPy

Task 3: Pandas, NumPy, Matplotlib, Seaborn

Install all dependencies:

pip install numpy pandas matplotlib seaborn torch jupyter

📚 Tasks

Task 1: Neural Network from Scratch

File: neural_network_from_scratch.ipynb

Objective: Implement a modular feedforward neural network using only NumPy (no PyTorch/TensorFlow/JAX).

Key Components:

  • Linear layer class with weight initialization
  • ReLU activation function
  • MSELoss for regression tasks
  • Complete forward and backward propagation
  • Gradient computation for weights, biases, and inputs

Features:

  • ✅ Fully functional backpropagation from scratch
  • ✅ Training loop with loss visualization
  • ✅ Demonstrates learning on synthetic data
  • ✅ Mathematical derivations included

Results: Successfully trains on dummy dataset with decreasing loss over 100 epochs.


Task 2: LoRA Implementation

File: lora_implementation.ipynb

Objective: Implement Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning of large language models.

Key Components:

  • LoRALinear custom PyTorch module
  • Low-rank matrix decomposition (A and B matrices)
  • Frozen pretrained weights with trainable adapters
  • Weight merging for zero-latency inference

Features:

  • ✅ Complete LoRA implementation in PyTorch
  • ✅ Parameter efficiency analysis (99.8% reduction for GPT-3 scale)
  • ✅ Proper initialization strategies (Gaussian for A, zero for B)
  • ✅ Merge functionality for deployment

Results: Demonstrates massive parameter reduction while maintaining model capacity.

Reference: LoRA: Low-Rank Adaptation of Large Language Models (Hu et al.)


Task 3: SMS Spam Data Cleaning

File: sms_spam_cleaning.ipynb

Objective: Build a comprehensive data cleaning and deduplication pipeline for the SMS Spam Collection Dataset.

Key Components:

1. Exploratory Analysis

  • Dataset loading with Latin-1 encoding
  • Label distribution analysis (Ham vs Spam)
  • Data quality assessment

2. Cleaning Pipeline

  • Unicode normalization (NFKC)
  • Phone number obfuscation (privacy protection with <PHONE> token)
  • Artifact removal (encoding issues, broken characters)
  • Whitespace normalization

3. Fuzzy Deduplication

  • MinHash algorithm implementation from scratch
  • Jaccard similarity estimation
  • Character n-gram (trigram) based comparison
  • Threshold-based duplicate detection (80% similarity)

4. Visualizations

  • Professional pipeline architecture diagram
  • Metrics dashboard with data flow visualization
  • Label distribution charts
  • Deduplication breakdown

Features:

  • ✅ Handles 5,572 messages → 4,995 after cleaning (10.4% reduction)
  • ✅ Removes 426 exact duplicates
  • ✅ Identifies 151 fuzzy duplicates
  • ✅ Privacy-preserving phone number detection
  • ✅ Publication-ready visualizations

Dataset: SMS Spam Collection Dataset


📝 Documentation

Each task includes:

  • Inline comments explaining implementation details
  • Markdown cells with theoretical background
  • Visualizations showing results and metrics
  • References to relevant papers and resources

📄 License

This project is for assessment and educational purposes.


🙏 Acknowledgments

  • SMS Spam Dataset: UCI Machine Learning Repository
  • LoRA Paper: Hu et al. (2021) - arXiv:2106.09685
  • MinHash Algorithm: Broder (1997)

About

Neural Network from scratch, LoRA implementation, and SMS spam cleaning with MinHash deduplication

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published