Repository containing portfolio of data science projects completed by me for academic, self learning, and hobby purposes. Presented in the form of Jupyter notebooks,
Create a virtual environment and install the project dependencies:
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
pip install -r requirements.txtAlternatively, you can automate the above steps with:
make installAfter setting up the environment, install the pre-commit hooks:
pre-commit install- Use descriptive branch names prefixed with
feature/,fix/, ordocs/. - Separate words with hyphens, e.g.,
feature/add-model-evaluation.
- Use the imperative mood in the subject line.
- Keep the subject line under 50 characters.
- Open PRs against the
mainbranch. - Ensure all checks pass and request at least one review.
- Address review feedback promptly.
Build the image:
docker build -t dsp .Run the container:
docker run -p 8888:8888 dspThis starts a Jupyter Notebook server at http://localhost:8888.
Predicting Boston Housing Prices: Used Linear Regression model to predict the value of houses in the Boston area, Used various statistical tools for analysis. Performed model evaluation by finding the r2 score.
Finding Donors for Charity with ML: I used three different supervised learning algorithms to build a model that predicts whether an individual makes more than $50,000 to identify likely donors for a fiction non-profit organisation
Arrythmia Detection: This project focuses on detecting arrhythmias using the MIT-BIH (Massachusetts Institute of Technology - Beth Israel Hospital) ECG (Electrocardiogram) dataset. Arrhythmias are irregular heart rhythms that can have serious health implications, and early detection is essential.
Building a Simple Language Model with Flask and NLTK This repository details the implementation, testing, and evaluation of a simple n-gram language model using the NLTK library. The model is designed to predict the next word in a sentence based on the preceding words, using n-gram statistics. We explored both bigram (n=2) and trigram (n=3) models to compare their performance.
Synthetic Oncology Data: Generate de-identified cancer patient records with Synthea and matching VCF variant profiles.