This repository is created for a job interview home assessment.
Only a Jupyter Notebook was received containing an imaginary Data Scientist's image classification model training scripts. The task was to make those training scripts reproducible in a professional(ish) way.
For more information, see the original Jupyter Notebook (notebooks/mlops_project_classifier.ipynb).
The repository structure is based on Cookiecutter Data Science guideline.
mlops-home-assessment
├── .git # Git related things
│ ├── hooks # Installed pre-commit hooks
│ └── ...
├── .gitattributes # Git LFS config
├── .github # CODEOWNERS, GitHub Actions Workflow definitions
│ ├── CODEOWNERS
│ └── workflows
│ ├── pre-commit.yaml
│ └── training.yaml
├── .gitignore # gitignore
├── .pre-commit-config.yaml # Pre-commit hooks related settings
├── MANIFEST.in # Python packaging related
├── README.md # This file
├── data # Folder containing the different datasets
│ └── cifar_dataset
│ ├── cifar-10-batches-py
│ │ ├── batches.meta
│ │ ├── data_batch_1
│ │ ├── data_batch_2
│ │ ├── data_batch_3
│ │ ├── data_batch_4
│ │ ├── data_batch_5
│ │ ├── readme.html
│ │ └── test_batch
│ └── metadata.json
├── docker # Folder containing the Docker image build context
│ ├── Dockerfile
│ ├── build.sh
│ ├── cleanup_build_context.sh
│ ├── configure.sh
│ └── create_build_context.sh
├── experiments # Folder for tracking the experiments
│ └── local_exp_test
│ └── run_1
│ ├── metrics.png
│ └── model.pth
├── image_classifier # The Python package folder (src)
│ ├── __init__.py
│ ├── data # Folder containing data loader/handling scripts
│ │ ├── __init__.py
│ │ ├── data.py
│ │ └── db_connect.yaml
│ ├── main.py # Main script for training
│ ├── model # Folder containing scripts for model training
│ │ ├── __init__.py
│ │ ├── model.py
│ │ ├── training.py
│ │ └── validation.py
│ └── visualization # Folder containing visualization scripts
│ └── plot.py
├── notebooks # Folder containing Jupyter Notebooks
│ └── mlops_project_classifier.ipynb
├── pyproject.toml # Python packaging / pre-commit hooks related
├── requirements # Folder containing requirements files
│ ├── requirements-build.in
│ ├── requirements-build.sh
│ ├── requirements-build.txt
│ ├── requirements-cq.in
│ ├── requirements-cq.sh
│ ├── requirements-cq.txt
│ ├── requirements-dev.in
│ ├── requirements-dev.sh
│ ├── requirements-dev.txt
│ ├── requirements-prod.in
│ ├── requirements-prod.sh
│ └── requirements-prod.txt
└── tox.ini # Pre-commit hooks relatedNote that for developing, Ubuntu 22.04 (WSL) and VSCode was being used.
Currently, Python3.10 is used. Please use this version in your Dev Containers and virtualenvs.
If you wish to build and run the Docker images on your local machine, then you will need Docker to be installed. See more information about Docker installation in the official Docker installation guide.
pip-tools is a package management CLI tool, that helps keeping pip-based packages consistent and compatible with one another. When generating a requirements.txt file via pip-tools, it includes all the implicit dependencies of each of the specified packages using the latest possible package versions that avoid any conflicts. Furthermore, it uses verbose, human-readable comments.
See more info about pip-tools
Currently, pip-tools==7.3.0 version is used for generating the requirements-*.txt files from the requirements-*.in files.
To do it, specify all the required packages in the requirements-*.in files (specify package versions only if absolutely necessary) and then generate the requirements-*.txt files as follows:
# With Python3.10
cd <PATH_TO_THIS_REPO>/requirements
pip install pip-tools==<CURRENTLY_USED_VERSION>
# Use build/cq/dev/prod instead of "*"
pip-compile --output-file requirements-*.txt requirements-*.inCurrently, 4 different requirements are being used:
- build: containing the necessary packages for Python package building
- code quality (cq): containing the packages for the code quality checking related pre-commit hooks
- development (dev): containing packages necessary for model development
- production (prod): containing only the packages that are necessary for training and production
For each requirements, there is a requirements-*.sh shell script which with one can install the relevant requirements with pip. The reason behind using shell scripts is that this way one can install multiple requirements-*.txt files, e.g. requirements-dev.sh installs:
requirements-cq.txtrequirements-build.txtrequirements-dev.txt- the repository in editable mode (
pip install -v -e <PATH_TO_THIS_REPO>)
To ensure code quality, the following tools/packages are being used as pre-commit hooks:
pre-commit: file checkers (json, yaml, executables), end of file whitespace checkerflake8: linterblack: code formatterisort: Python imports formatter
The pre-commit hooks are configured via the .pre-commit-config.yaml file.
Furthermore, the following files are used to configure the tools/packages:
pyproject.toml: configureblackandisortpropertiestox.ini: configureflake8properties (flake8does not yet supportpyproject.tomlbased configuration by default).
To install the pre-commit hooks locally, simply run the requirements/requirements-dev.sh script from anywhere via:
# Wherever you are
source <PATH_TO_THIS_REPO>/requirements/requirements-dev.shsetuptools is being used for building a Python package from the image_classifier src folder, which is required for both development and production (currently this means GitHub Actions Workflow based model training). The currently used package versions are:
setuptools>=69.0build>=1.0.3
To install these packages locally in editable mode, simply run the requirements/requirements-dev.sh script from anywhere via:
# Wherever you are
source <PATH_TO_THIS_REPO>/requirements/requirements-dev.shIf you want to build a wheel locally, run the following command within the repository, that will create a *.whl and a *.tar.gz file under the <PATH_TO_THIS_REPO>/dist/ folder:
python -m buildThe following files are used by setuptools for building the package
pyproject.toml: contains general information, build requirements and default logic on which python modules (.pyfiles / folders with an__init__.pyfile in them) and non-python files (package data) must be contained in the packageMANIFEST.in: contains a set of commands with glob-like pattern matching to include (or exclude) relevant files thatsetuptoolsdoes not catch by default with thepyproject.tomlsettings
For more information on setuptools based packaging, check the following links:
- https://setuptools.pypa.io/en/latest/userguide/quickstart.html
- https://packaging.python.org/en/latest/guides/writing-pyproject-toml/
To ensure reproducibility, model trainings are run in Docker containers in the training GitHub Actions Workflow runs.
The docker folder contains all the files that are necessary to build the Docker image both locally and in the GitHub Action Workflow runs:
Dockerfile: the Docker image configuration fileconfigure.sh: shell script for setting up local variables for Docker buildcreate_build_context.sh: shell script for composing the Docker build context- copying the
requirements/requirements-prod.(txt/sh)files to thedockerfolder - building the Python package and copying the resulting
dist/*.whlfile to thedockerfolder
- copying the
cleanup_build_context.sh: shell script for cleaning up the repository after Docker build:- deleting the
docker/requirements-prod.(txt/sh)anddocker/*.whlfiles - deleting the
distand*.egg-infofolders from the project root
- deleting the
build.sh: shell script for executing theconfigure.sh,create_build_context.shshell scripts, thedocker buildcommand and finally thecleanup_build_context.shshell script
To build the Docker image locally, simply just run the build.sh shell script from anywhere:
source <PATH_TO_THIS_REPO>/docker/build.shThe image_classifier folder is the SOURCE / PACKAGE folder that contains all the scripts/files relevant for model training.
To run a model training locally simply run the image_classifier/main.py script and set the arguments as needed:
# With Python3.10
python -m image_classifier.main \
--experiment_name <your_experiment_name> \ # REQUIRED
--run_name <your_run_name> \ # REQUIRED
--dataset_folder <your_dataset> \ # OPTIONAL, defaults to data/cifar_dataset
--learning_rate <your_learning_rate> \ # OPTIONAL
--momentum <your_momentum> \ # OPTIONAL
--num_epochs <your_num_epochs> \ # OPTIONAL
--num_workers <your_num_workers> \ # OPTIONAL
--batch_size <your_batch_size> \ # OPTIONAL
--pretrained_model <path_to_pretrained_model_relative_to_experiments_folder> \ # OPTIONAL, e.g.: local_exp_test/run_1/model.pth
--db_pass <your_mocked_password> \ # REQUIREDNote that instead of python -m image_classifier.main you can simply use image_classifier_trainer as an alias (see in pyproject.toml).
The experiments are tracked under the experiments folder. Each experiment run outputs a trained model.pth file and a corresponding metrics.png of the validation accuracy metrics.
Every *.pth and *.png file in the repository is versioned via Git LFS configured in .gitattributes.
Currently, 2 GitHub Actions Workflows are implemented and being used:
- Code Quality checks with pre-commit (
.github/workflows/pre-commit.yaml):- description: runs the configured pre-commit hooks against the modified files
- triggers: automatic trigger at every git push event of all branches
- Image Classifier Training (
.github/workflows/training.yaml):- description: builds the Docker image and runs the model training in Docker container. The experiment run outputs (
model.pth,metrics.png) are published as build artifacts. If you wish to make a build artifact model available to be used in later workflow runs aspretrained_model, then simply- download the build artifact
- unzip it
- copy the files into your local
<PATH_TO_THIS_REPO>/experiments/<your_experiment_name>/<your_run_name>/folder - commit and push the new files
- pass the
<your_experiment_name>/<your_run_name>/model.pthas thepretrained_modelargument at the next workflow trigger
- triggers: manual (workflow_dispatch), model training input arguments can be configured on the UI
- description: builds the Docker image and runs the model training in Docker container. The experiment run outputs (