This repository contains the code and resources for Evolutionary AutoML for Biochemical Property Prediction. It focuses on interpreting the machine learning pipelines generated by the AutoML method for biochemical property prediction, particularly for ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) properties.
├── datasets/ # Raw and processed datasets for biochemical property prediction
├── grammar/ # The context-free grammar (CFG) describing the AutoML search space in a Backus–Naur form (.bnf) file.
├── automl_biochem.py # The Python code implementing the evolutionary AutoML method (i.e., Bayesian Optimisation Algorithm) for biochemical property prediction
├── requirements.yml # Conda environment specification
└── README.md # Project documentation
To set up the project environment using Conda, follow the steps below:
Make sure you have Anaconda or Miniconda installed.
Then run:
conda env create -f requirements.ymlconda activate automl_biochemconda deactivateAfter activating automl_biochem environment, run:
python automl_biochem.py training_file.csv testing_file.csv grammar output_directoryE.g., using:
- "datasets/01_caco2_train.csv" as the training file.csv
- "datasets/01_caco2_blindtest.csv" as the testing file.csv
- "grammar/automl.bnf" as the grammar defining the AutoML search space
- "." as the output directory
python automl_biochem.py datasets/01_caco2_train.csv datasets/01_caco2_blindtest.csv grammar/automl.bnf .Other parameters are available, including:
- "-s": define the seed and control the method's pseudorandom variables. Default: 1.
- "-m": define the optimisation metric to be used. Options: "auc", "mcc", "recall", "precision", "auprc", "accuracy". Default: "auc".
- "-e": define the experiment name. Default: "Exp_ADMET".
- "-t": define the time budget (in minutes) to run the method. Default: 5 (min).
- "-n": define the number of cores to run the evolutionary AutoML method. Default: 20.
- "-ta": define the time budget (in minutes) to run each individual (i.e., each ML pipeline). Default: 1 (min).
- "-p": define the population size. Default: 100.
- "-mr": define the mutation rate. Default: 0.15.
- "-cr": define the crossover rate. Default: 0.80.
- "-cmr": define the rate on applying crossover followed by mutation. Default: 0.05.
- "-es": define the elitism size. Default: 1.
To run with all options set, you would need to:
python automl_biochem.py datasets/01_caco2_train.csv datasets/01_caco2_blindtest.csv grammar/automl.bnf . -s 1 -m auc -e Exp_ADMET -t 5 -n 20 -ta 1 -p 100 -mr 0.15 -cr 0.80 -cmr 0.05 -es 1This work is associated with a paper accepted for the workshop Evolutionary Computing and Explainable Artificial Intelligence at the GECCO conference.
- de Sá, A. G. C., Pappa, G. L., Freitas, A. A., & Ascher, D. B. (2025). Interpreting machine learning pipelines produced by evolutionary AutoML for biochemical property prediction. In Proceedings of the Genetic and Evolutionary Computation Conference Companion (GECCO ’25 Companion) (pp. 1–9). ACM. https://doi.org/10.1145/3712255.3734339
For questions or contributions, please open an issue.