Skip to content

alexgcsa/auto-admet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Explainable and Evolutionary AutoML for Biochemical Property Prediction

AutoML for Biochemical Property Prediction

This repository contains the code and resources for Evolutionary AutoML for Biochemical Property Prediction. It focuses on interpreting the machine learning pipelines generated by the AutoML method for biochemical property prediction, particularly for ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) properties.

📦 Project Structure

├── datasets/             # Raw and processed datasets for biochemical property prediction
├── grammar/              # The context-free grammar (CFG) describing the AutoML search space in a Backus–Naur form (.bnf) file.
├── automl_biochem.py     # The Python code implementing the evolutionary AutoML method (i.e., Bayesian Optimisation Algorithm) for biochemical property prediction
├── requirements.yml      # Conda environment specification
└── README.md             # Project documentation

🛠️ Environment Setup

To set up the project environment using Conda, follow the steps below:

1. Create the conda environment

Make sure you have Anaconda or Miniconda installed.

Then run:

conda env create -f requirements.yml

2. Activate the environment

conda activate automl_biochem

3. Deactivate the environment (when you're done)

conda deactivate

📖 How to use the proposed AutoML method?

After activating automl_biochem environment, run:

python automl_biochem.py training_file.csv testing_file.csv grammar output_directory

E.g., using:

  • "datasets/01_caco2_train.csv" as the training file.csv
  • "datasets/01_caco2_blindtest.csv" as the testing file.csv
  • "grammar/automl.bnf" as the grammar defining the AutoML search space
  • "." as the output directory
python automl_biochem.py datasets/01_caco2_train.csv datasets/01_caco2_blindtest.csv grammar/automl.bnf .

Other parameters are available, including:

  • "-s": define the seed and control the method's pseudorandom variables. Default: 1.
  • "-m": define the optimisation metric to be used. Options: "auc", "mcc", "recall", "precision", "auprc", "accuracy". Default: "auc".
  • "-e": define the experiment name. Default: "Exp_ADMET".
  • "-t": define the time budget (in minutes) to run the method. Default: 5 (min).
  • "-n": define the number of cores to run the evolutionary AutoML method. Default: 20.
  • "-ta": define the time budget (in minutes) to run each individual (i.e., each ML pipeline). Default: 1 (min).
  • "-p": define the population size. Default: 100.
  • "-mr": define the mutation rate. Default: 0.15.
  • "-cr": define the crossover rate. Default: 0.80.
  • "-cmr": define the rate on applying crossover followed by mutation. Default: 0.05.
  • "-es": define the elitism size. Default: 1.

To run with all options set, you would need to:

python automl_biochem.py datasets/01_caco2_train.csv datasets/01_caco2_blindtest.csv grammar/automl.bnf . -s 1 -m auc -e Exp_ADMET -t 5 -n 20 -ta 1 -p 100 -mr 0.15 -cr 0.80 -cmr 0.05 -es 1

📚 Publication

This work is associated with a paper accepted for the workshop Evolutionary Computing and Explainable Artificial Intelligence at the GECCO conference.

  • de Sá, A. G. C., Pappa, G. L., Freitas, A. A., & Ascher, D. B. (2025). Interpreting machine learning pipelines produced by evolutionary AutoML for biochemical property prediction. In Proceedings of the Genetic and Evolutionary Computation Conference Companion (GECCO ’25 Companion) (pp. 1–9). ACM. https://doi.org/10.1145/3712255.3734339

📬 Contact

For questions or contributions, please open an issue.

About

Automated Machine Learning for Chemical ADMET Prediction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages