Machine Learning Engineering Assignment

This repository contains implementations for three core machine learning engineering tasks, demonstrating fundamental ML concepts, modern fine-tuning techniques, and data engineering best practices.

🎯 Overview

This project showcases three distinct aspects of machine learning engineering:

Fundamentals: Building neural networks from scratch using only NumPy
Modern Techniques: Implementing parameter-efficient fine-tuning with LoRA
Data Engineering: Creating robust data cleaning and deduplication pipelines

Each task is implemented in a separate Jupyter notebook with detailed explanations and visualizations.

🚀 Installation

Prerequisites

Python 3.8+
Jupyter Notebook or JupyterLab

Setup

Clone the repository:

git clone https://github.com/TheODDYSEY/ML-Assignment

cd ML-Assignment

💻 Usage

Running the Notebooks

Each task can be run independently:

# Task 1: Neural Network
jupyter notebook neural_network_from_scratch.ipynb

# Task 2: LoRA
jupyter notebook lora_implementation.ipynb

# Task 3: Data Cleaning
jupyter notebook sms_spam_cleaning.ipynb

📦 Requirements

Core Dependencies

numpy>=1.21.0
pandas>=1.3.0
matplotlib>=3.4.0
seaborn>=0.11.0
torch>=1.9.0
jupyter>=1.0.0

Task-Specific Requirements

Task 1: NumPy, Matplotlib

Task 2: PyTorch, NumPy

Task 3: Pandas, NumPy, Matplotlib, Seaborn

Install all dependencies:

pip install numpy pandas matplotlib seaborn torch jupyter

📚 Tasks

Task 1: Neural Network from Scratch

File: neural_network_from_scratch.ipynb

Objective: Implement a modular feedforward neural network using only NumPy (no PyTorch/TensorFlow/JAX).

Key Components:

Linear layer class with weight initialization
ReLU activation function
MSELoss for regression tasks
Complete forward and backward propagation
Gradient computation for weights, biases, and inputs

Features:

✅ Fully functional backpropagation from scratch
✅ Training loop with loss visualization
✅ Demonstrates learning on synthetic data
✅ Mathematical derivations included

Results: Successfully trains on dummy dataset with decreasing loss over 100 epochs.

Task 2: LoRA Implementation

File: lora_implementation.ipynb

Objective: Implement Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning of large language models.

Key Components:

LoRALinear custom PyTorch module
Low-rank matrix decomposition (A and B matrices)
Frozen pretrained weights with trainable adapters
Weight merging for zero-latency inference

Features:

✅ Complete LoRA implementation in PyTorch
✅ Parameter efficiency analysis (99.8% reduction for GPT-3 scale)
✅ Proper initialization strategies (Gaussian for A, zero for B)
✅ Merge functionality for deployment

Results: Demonstrates massive parameter reduction while maintaining model capacity.

Reference: LoRA: Low-Rank Adaptation of Large Language Models (Hu et al.)

Task 3: SMS Spam Data Cleaning

File: sms_spam_cleaning.ipynb

Objective: Build a comprehensive data cleaning and deduplication pipeline for the SMS Spam Collection Dataset.

Key Components:

1. Exploratory Analysis

Dataset loading with Latin-1 encoding
Label distribution analysis (Ham vs Spam)
Data quality assessment

2. Cleaning Pipeline

Unicode normalization (NFKC)
Phone number obfuscation (privacy protection with <PHONE> token)
Artifact removal (encoding issues, broken characters)
Whitespace normalization

3. Fuzzy Deduplication

MinHash algorithm implementation from scratch
Jaccard similarity estimation
Character n-gram (trigram) based comparison
Threshold-based duplicate detection (80% similarity)

4. Visualizations

Professional pipeline architecture diagram
Metrics dashboard with data flow visualization
Label distribution charts
Deduplication breakdown

Features:

✅ Handles 5,572 messages → 4,995 after cleaning (10.4% reduction)
✅ Removes 426 exact duplicates
✅ Identifies 151 fuzzy duplicates
✅ Privacy-preserving phone number detection
✅ Publication-ready visualizations

Dataset: SMS Spam Collection Dataset

📝 Documentation

Each task includes:

Inline comments explaining implementation details
Markdown cells with theoretical background
Visualizations showing results and metrics
References to relevant papers and resources

📄 License

This project is for assessment and educational purposes.

🙏 Acknowledgments

SMS Spam Dataset: UCI Machine Learning Repository
LoRA Paper: Hu et al. (2021) - arXiv:2106.09685
MinHash Algorithm: Broder (1997)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
task-1		task-1
task-2		task-2
task-3		task-3
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning Engineering Assignment

🎯 Overview

🚀 Installation

Prerequisites

Setup

💻 Usage

Running the Notebooks

📦 Requirements

Core Dependencies

Task-Specific Requirements

📚 Tasks

Task 1: Neural Network from Scratch

Task 2: LoRA Implementation

Task 3: SMS Spam Data Cleaning

1. Exploratory Analysis

2. Cleaning Pipeline

3. Fuzzy Deduplication

4. Visualizations

📝 Documentation

📄 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Languages

TheODDYSEY/ML-Assignment

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Engineering Assignment

🎯 Overview

🚀 Installation

Prerequisites

Setup

💻 Usage

Running the Notebooks

📦 Requirements

Core Dependencies

Task-Specific Requirements

📚 Tasks

Task 1: Neural Network from Scratch

Task 2: LoRA Implementation

Task 3: SMS Spam Data Cleaning

1. Exploratory Analysis

2. Cleaning Pipeline

3. Fuzzy Deduplication

4. Visualizations

📝 Documentation

📄 License

🙏 Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages