bioai-esol-platform

Learning hands-on how to take an applied ML research model to production

Goal of this Project

Main Focus = Taking research into production.

The aim is NOT research specialisation in molecular ML; it is to gain familiarity with an established benchmark to better understand the field, and practice making well thought-out design decisions when taking the model to production

Note: This project is documented as an open-source-style guide, walking readers step by step through the process of taking research skills toward production along with me via this README. It explains how to run various software tools used in context of this project, but does not provide in-depth tutorials on the tools themselves, as they are widely adopted and well documented on their respective websites. My learning of how to use these tools was guided by this well-curated free online course: MLOps Zoomcamp

Roadmap

Current progress highlighted in italics

(Dataset + model) → (Experiment tracking + hyperprameter sweep) → (Analyse metrics to choose best model) → (Model inference API + deployment) → (Monitoring model performance with Evidently AI + Grafana) → (CI and Testing)

Task

Regression Task: Predicting log solubility in mols per litre

Dataset

ESOL (Estimated SOLubility): Small molecular dataset introduced by Delaney (2004) for predicting the aqueous solubility of small organic molecules. Imported from torch_geometric using code in data/dataset.py.

Rationale for Choosing ESOL

Gaining familiarity with molecular ML benchmarking while keeping the task simple and the dataset small.

Small dataset (~1100 molecules) → trains fast
Real-world relevance of solubility in drug discovery (drugs need to dissolve in bodily fluids)
Common benchmark - good starting point + regression good for simulating monitoring/distribution shifts later on in project
Want familiarity with: Basic molecular ML (molecules + Graph NN)

Set up

Clone the repository

git clone https://github.com/nishkakhendry/bioai-esol-platform.git
cd bioai-esol-platform

Create conda environment & install dependencies

Note: Python 3.10 chosen for stability with libraries like rdkit

conda create -n besol python=3.10
conda activate besol

The source code is structured as an installable Python package using a pyproject.toml file to provide standardized dependency management, reproducible builds, and clean installation via modern Python packaging tools.

pip install -e .

Baseline Configuration: configs/config.yaml

seed: 42

data:
  root: "data"
  dataset_name: "ESOL"
  train_ratio: 0.8
  val_ratio: 0.1

model:
  hidden_dim: 64
  num_layers: 2

training:
  batch_size: 32
  lr: 0.001
  epochs: 50

Hidden dimension (64): sufficient representational capacity without overfitting on ~1100 samples.
Two GCN layers: captures local chemical structure while avoiding graph oversmoothing.
Batch size (32): balances gradient stability and generalisation on small datasets.
Learning rate (1e-3): stable Adam default for small GNNs.
Epochs (50): enough to observe convergence without excessive overfitting.

Training the ESOL Model

Update configs/config.yaml with desired parameters

Run:

python train.py

Hyperparameter Sweep with MLflow

Hyperparameter sweep using MLflow for experiment tracking in MLflow + automatic retraining and model promotion policy in MLflow model registry covered in: documentation/hyperpara-sweeps_metric-analysis.md

Outcome of this step = "champion" model ready for production deployment in MLflow model registry

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.vscode		.vscode
configs		configs
documentation		documentation
src/bioai_esol		src/bioai_esol
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bioai-esol-platform

Goal of this Project

Task

Dataset

Set up

Training the ESOL Model

Hyperparameter Sweep with MLflow

Model Inference via API + Deployment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

bioai-esol-platform

Goal of this Project

Task

Dataset

Set up

Training the ESOL Model

Hyperparameter Sweep with MLflow

Model Inference via API + Deployment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages