Skip to content

StanChiangTW/CS---MLOps---Data-Science-Platform

Repository files navigation

DSBA Platform

A toy MLOps Project

Project Structure

  • a pyproject.toml file contains the project metadata, including the dependencies. It is common to see a "setup.py" file in Python projects but we use this more modern approach to define the project metadata.
  • The src folder contains the code code (dsba) as well as the code for the CLI, the API, the web app, the notebooks, as well as the Dockerfiles.
  • The tests folder contains some unit and integration tests
  • .gitignore is a special file name that will be detected by git. This file contains a list of files and folders that should not be committed to the repository. For example (see below for setup), the .env file is specific to your own deployment so it should not be committed to the repository (it may contain specific file paths that are only meaningful on your machine, and it may contain secrets like API keys - API keys and passwords should never be stored in a git repository).

Installation (dev mode)

Requirements

Your machine should have the following software installed:

  • Python 3.12
  • git
  • to use the model training notebook (not required), you may need to install openmp (libomp) which is required by xgboost. But you can also not use the model_training module from this example or adapt it to use scikit-learn rather than xgboost.

Clone the repository

  • The first things to do is to copy this repository, to have a copy that you own on GitHub. This is because you are not allowed to push directly to the main repository owned by Joachim. Copying a repository on GitHub to have your own is called a "fork". You should understand that "forking" and "cloning" are not the same. Forking is a GitHub concept to copy a repository in your own GitHub account. Cloning basically means "downloading for the first time a repo to your computer". Just click on the fork button above when seeing this document from GitHub.

  • Move into the folder you want to work in (I saw many students not choosing a folder and just working in their home directory, you don't want to do that)

  • To be certain things are ok type:

git status

This should fail and tell you there is no repository at this location. I saw many students trying to clone a repository inside a repository, you also don't want to be in this situation.

Now you can clone the repository:

git clone <the address of your fork>

Installing the project

cd into the repository folder.

Create a virtual environment with the following command (for windows, python, not python3). Using the name ".venv" for your virtual environment is recommended. It is quite standard and tools like vscode will automatically find it.

python3 -m venv .venv

Install dependencies (as specified in pyproject.toml):

pip install -e .

This will install the project in editable mode, meaning that any changes you make to the code will be reflected in your local environment.

Running the tests

To run the tests, you can use the following command:

pytest

This will run all the tests in the tests folder.

Usage

You must set the environment variable DSBA_MODELS_ROOT_PATH to the address you want to store the models in before you can use the platform.

For example as a MacOS user I set /Users/joachim/dev/dsba/models_registry.

There are many ways to set environment variables depending on the context.

In a python notebook, you can use the following code:

import os
os.environ["DSBA_MODELS_ROOT_PATH"] = "/path/to/your/models"

In a terminal or shell script, you can use the following code (Linux and MacOS):

export DSBA_MODELS_ROOT_PATH="/path/to/your/models"

For windows, something of the sort may work:

set DSBA_MODELS_ROOT_PATH="C:\path\to\your\models"

MLOps student project

This project aimed to create an interface for a bank employees that allows them to choose a prediction model, a customer dataset, and then provide a result on whether customers will churn or not.

To do this, we based our work on a machine learning project on bank churners that one of us had previously created. This project used the dataset of a kaggle challenge available at : https://www.kaggle.com/datasets/thedevastator/predicting-credit-card-customer-attrition-with-m

Important Folders and Files

Data

data/

This contains:

  • the original BankChurners.csv file of the Kaggle challenge used for the original ML project
  • the X_test.csv and the y_test.csv files created after the preprocessing of the original dataset, this files are the ones used for the rest of the MLOps project

Models

models/

This contains the different models trained and available to predict results. The 4 models are :

  • lgbm_model.pkl
  • rf_model.pkl
  • svm_model.pkl
  • xgb_model.pkl

Src

src/

This is the main source directory where the code resides.

CLI

Not used in the MLOps project, normally it's used to list models registered on your system:

src/cli/dsba_cli list

Use a model to predict on a file:

src/cli/dsba_cli predict --input /path/to/your/data/file.csv --output /path/to/your/output/file.csv --model-id your_model_id

API

An API is provided, it allows to interact with models. You can start the API by running:

uvicorn api:app --reload

Dockerized API To run the API in a Docker container, follow these steps:

  1. Build the Docker image:
docker build -f Dockerfile.api -t fastapi .
  1. Run the Docker container:
docker run -d -p 8000:80 fastapi

The API will be available at http://127.0.0.1:8000/

Note: Ensure Docker is installed on your machine. 3. Tag the image

docker tag fastapi stanchiangtw/fastapi

Note: Ensure Docker Desktop is logged in by using

docker login
  1. Push the tag image to Docker Hub
docker push stanchiangtw/fastapi:latest

AWS ECS & EC2 We successfully implemented the deployment and scaling of Docker containers on AWS ECS. The process involved the following steps:

  • Creating an ECS cluster
  • Defining a Task Definition
  • Configuring a Security Group
  • Setting up a service
  • Accessing the running service

The API was successfully running at http://13.37.241.233:8000/.

However, the service quickly exceeded the free tier quota, resulting in costs ($12.46). Before stopping the service, we took a screenshot to document the progress made.

Templates

The templates/ directory contains HTML templates used for rendering web views in our application. The main file dashboard.html serves as the user interface for our ML model evaluation dashboard.

static

The static/ directory stores static resources like generated plots (.png files). The image is served directly to the client browser and don't change during runtime. The visualization image created during model evaluation is stored here, allowing it to be displayed in the dashboard.

Dockerfile

The .api file includes all the necessary instructions, organized in a way that ensures Docker executes them correctly.

Requirements

The .txt file contains all the necessary Python packages required to process the data, train the model, and run the API.

DSBA

This contains the core functionality of the MLOps project, including model handling, data preprocessing, and utilities for training and predicting.

Notebooks

This contains:

  • model_training_example.ipynb the original example of Notebook of the MLOps platform project
  • Bank_MLOps.ipynb the notebook of an ML project from which we used it for our MLOps project. The original code from the ML project has not been modified; it may contain some elements from LLM. To use the notebook, navigate to the notebooks/ folder and open the file. You can use the provided utilities to train models, preprocess data, and evaluate performance.

REMARKS

No remarks

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5