This repository contains implementations for three core machine learning engineering tasks, demonstrating fundamental ML concepts, modern fine-tuning techniques, and data engineering best practices.
This project showcases three distinct aspects of machine learning engineering:
- Fundamentals: Building neural networks from scratch using only NumPy
- Modern Techniques: Implementing parameter-efficient fine-tuning with LoRA
- Data Engineering: Creating robust data cleaning and deduplication pipelines
Each task is implemented in a separate Jupyter notebook with detailed explanations and visualizations.
- Python 3.8+
- Jupyter Notebook or JupyterLab
- Clone the repository:
git clone https://github.com/TheODDYSEY/ML-Assignment
cd ML-AssignmentEach task can be run independently:
# Task 1: Neural Network
jupyter notebook neural_network_from_scratch.ipynb
# Task 2: LoRA
jupyter notebook lora_implementation.ipynb
# Task 3: Data Cleaning
jupyter notebook sms_spam_cleaning.ipynbnumpy>=1.21.0
pandas>=1.3.0
matplotlib>=3.4.0
seaborn>=0.11.0
torch>=1.9.0
jupyter>=1.0.0
Task 1: NumPy, Matplotlib
Task 2: PyTorch, NumPy
Task 3: Pandas, NumPy, Matplotlib, Seaborn
Install all dependencies:
pip install numpy pandas matplotlib seaborn torch jupyterFile: neural_network_from_scratch.ipynb
Objective: Implement a modular feedforward neural network using only NumPy (no PyTorch/TensorFlow/JAX).
Key Components:
Linearlayer class with weight initializationReLUactivation functionMSELossfor regression tasks- Complete forward and backward propagation
- Gradient computation for weights, biases, and inputs
Features:
- ✅ Fully functional backpropagation from scratch
- ✅ Training loop with loss visualization
- ✅ Demonstrates learning on synthetic data
- ✅ Mathematical derivations included
Results: Successfully trains on dummy dataset with decreasing loss over 100 epochs.
File: lora_implementation.ipynb
Objective: Implement Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning of large language models.
Key Components:
LoRALinearcustom PyTorch module- Low-rank matrix decomposition (A and B matrices)
- Frozen pretrained weights with trainable adapters
- Weight merging for zero-latency inference
Features:
- ✅ Complete LoRA implementation in PyTorch
- ✅ Parameter efficiency analysis (99.8% reduction for GPT-3 scale)
- ✅ Proper initialization strategies (Gaussian for A, zero for B)
- ✅ Merge functionality for deployment
Results: Demonstrates massive parameter reduction while maintaining model capacity.
Reference: LoRA: Low-Rank Adaptation of Large Language Models (Hu et al.)
File: sms_spam_cleaning.ipynb
Objective: Build a comprehensive data cleaning and deduplication pipeline for the SMS Spam Collection Dataset.
Key Components:
- Dataset loading with Latin-1 encoding
- Label distribution analysis (Ham vs Spam)
- Data quality assessment
- Unicode normalization (NFKC)
- Phone number obfuscation (privacy protection with
<PHONE>token) - Artifact removal (encoding issues, broken characters)
- Whitespace normalization
- MinHash algorithm implementation from scratch
- Jaccard similarity estimation
- Character n-gram (trigram) based comparison
- Threshold-based duplicate detection (80% similarity)
- Professional pipeline architecture diagram
- Metrics dashboard with data flow visualization
- Label distribution charts
- Deduplication breakdown
Features:
- ✅ Handles 5,572 messages → 4,995 after cleaning (10.4% reduction)
- ✅ Removes 426 exact duplicates
- ✅ Identifies 151 fuzzy duplicates
- ✅ Privacy-preserving phone number detection
- ✅ Publication-ready visualizations
Dataset: SMS Spam Collection Dataset
Each task includes:
- Inline comments explaining implementation details
- Markdown cells with theoretical background
- Visualizations showing results and metrics
- References to relevant papers and resources
This project is for assessment and educational purposes.
- SMS Spam Dataset: UCI Machine Learning Repository
- LoRA Paper: Hu et al. (2021) - arXiv:2106.09685
- MinHash Algorithm: Broder (1997)