Designing a Scalable and Reproducible Machine Learning Workflow Thesis

Music Genre Classification Use Case

This implementation is an integral part of my thesis conducted in collaboration with the University of West Attica, reflecting a culmination of academic exploration and practical application in the field of machine learning.

Introduction

In the realm of machine learning (ML), constructing end-to-end experimentation pipelines with scalability, robustness, and reproducibility is essential for advancing ML applications. This project is dedicated to empowering Data Scientists and ML Engineers by providing a seamless pipeline execution experience, eliminating obstacles such as downtime, hardware unavailability, OS conflicts, or dependency issues. The overarching goal is to achieve robust execution in a highly available environment, revisitable reproducibility, minimal manual intervention through automation, easy extendability, and scalable capabilities for handling larger tasks concurrently. To facilitate these attributes, the project incorporates a comprehensive toolkit encompassing Containerization/Virtualization for consistent environment management, Monitoring experiments for provisioning necessary training information, Data/Model Tracking for tracing model and data versions, Scalable Object Storage for secure data and model storage, and a Workflow Engine for automation, scheduling, and monitoring. The project not only addresses the challenges of designing scalable and reproducible ML workflows but also provides practical insights through industrial case studies, showcasing the tangible benefits of adopting such workflows.

Project Structure

|-- airflow/
|   |-- dags/
|   |-- logs/
|-- checkpoints/
|-- evidently_ai/
|-- Data/
|   |-- .dvc
|-- Dockerfile
|-- docs/
|-- dvc_pull_files.sh
|-- dvc.yaml
|-- genre_classification/
|   |-- data_model/
|   |   |-- (dir with data models used in the project)
|   |-- entrypoints.py
|   |-- feature_extraction/
|   |   |-- (dir with abstract feature extraction class, implemented class, and factory)
|   |-- __init__.py
|   |-- model/
|   |   |-- (dir with abstract model and implemented models)
|   |-- preprocessor/
|   |   |-- (dir with audio and image preprocessing classes, and the factory)
|   |-- trainer/
|   |   |-- (dir with the optimizer)
|   |-- utils/
|       |-- evaluation_metrics.py
|       |-- metadata.py
|       |-- model_selection.py
|       |-- save_load.py
|       |-- save_mel_spec_img.py
|-- __main__.py
|-- mlruns/
|-- playground.py
|-- settings.py

Project Overview

airflow/: Directory containing Airflow DAGs for workflow orchestration and logs for task monitoring. Internal Airflow Documentation
checkpoints/: Directory to store trained models.
evidently_ai/: Directory containing the code and reports from evidently. Internal Evidently AI Documentation
Data/: Directory with a DVC file to synchronize data.
Dockerfile: Script to build the project using Docker. Internal Docker Documentation
docs/: Extended documentation for the project.
dvc_pull_files.sh: Script to download DVC files (models and data). Internal DVC Documentation
dvc.yaml: DVC pipelines codebase. Internal DVC Documentation
genre_classification/: Main package containing the core functionality of the project.

data_model/: Directory with data models used in the project. DataModel Overview

entrypoints.py: Script with developed ML pipelines. Pipeline Overview

feature_extraction/: Directory with an abstract feature extraction class, implemented class, and factory. Feature Extraction Documentation

model/: Directory with an abstract model and implemented models. DL Model Overview

preprocessor/: Directory with audio and image preprocessing classes and the factory. Audio Preprocess Documentation

trainer/: Directory with the optimizer. DL Model Overview

utils/: Directory with scripts supporting the project (e.g., evaluation metrics, metadata handling, model selection, and save/load utilities).
main.py: Command-line interface to execute ML pipelines developed in the project. CLI Implementation Overview
mlruns/: MLflow directory holding metadata. Internal MLFlow Documentation
playground.py: Script to test codebase functions.
settings.py: File containing project settings and global variables.

Software Engineering patterns and principles

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Designing a Scalable and Reproducible Machine Learning Workflow Thesis

Music Genre Classification Use Case

Introduction

Project Structure

Project Overview

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.dvc		.dvc
Data		Data
airflow		airflow
docs		docs
evidently_ai		evidently_ai
genre_classification		genre_classification
metrics		metrics
model_checkpoints		model_checkpoints
model_checkpoints_adam		model_checkpoints_adam
model_checkpoints_rmsprop		model_checkpoints_rmsprop
.dvcignore		.dvcignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile-airflow		Dockerfile-airflow
README.md		README.md
__main__.py		__main__.py
docker-compose.yaml		docker-compose.yaml
dvc.yaml		dvc.yaml
dvc_pull_files.sh		dvc_pull_files.sh
playground.py		playground.py
requirements.txt		requirements.txt
settings.py		settings.py
setup.py		setup.py

ipolychr/content-mlops

Folders and files

Latest commit

History

Repository files navigation

Designing a Scalable and Reproducible Machine Learning Workflow Thesis

Music Genre Classification Use Case

Introduction

Project Structure

Project Overview

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages