Matthieu Haeberle, Puck van Gerwen, Ruben Laplaza, Ksenia R Briling, Jan Weinreich, Friedrich Eisenbrand and Clémence Corminboeuf,
“Integer linear programming for unsupervised training set selection in molecular machine learning”
Mach. Learn.: Sci. Technol. 6 025030 (2025)
The code was run on Python 3.10.4. The following modules are required: gurobipy, numpy, pandas, qml, sklearn, skmatter, plotly, kaleido, json, qmllib.
- lapack and blas to install qmllib
conda env create -f environment.yml
conda activate ilpselect
(if it doesn't work install step by step:)
conda create -n ilpselect python=3.10.4
conda activate ilpselect
conda install numpy=1.26.4 pandas=2.2.3 scikit-learn=1.3.0 skmatter=0.2.0 plotly=5.24.1
conda install blas=1.1 lapack=3.9.0
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
export LIBRARY_PATH=$LD_LIBRARY_PATH
pip install qmllib==1.1.5 kaleido==0.2.1 gurobipy==10.0.2
Gurobi is used to solve integer linear programs (FPS and ILP). A license is required. Academic licenses are available, and clusters need special licenses. The following environment variable should point towards the license file in case gurobipy cannot find it on its own.
export GRB_LICENSE_FILE=/ssoft/spack/external/gurobi/gurobi.lic
Note that different results will be obtained with different versions of Gurobi.
- create folder models, rankings, solutions, and learning_curves. The .mps files it contains are large and thus in the .gitignore.
mkdir models rankings solutions learning_curves
- verify that the folder
qm7exists, and that it contains the energies described in anenergies.csvfile (with columnsfileandenergy / Ha).
The main.py file runs everything based on a Python config file. The default config files config.py used by default when running main.py with no argument.
In order to use custom config config-foo.py, use the command python3 main.py "config-foo".
The main.py combines all files of folder scripts to do the following.
- Read target names from config file. The corresponding
{target_name}.xyzfiles should be present in the foldertargets. - Read the config script for the following parameters: database (qm7 for now), representation (FCHL), algorithm-specific parameters, learning curve parameters, ...
- Generate the representations if not present (with convention
{rep}\_{target}.npzand{rep}\_{database}.npz) and save to folderdata. The database must be in the{database}folder. The filescripts/generate.pyis responsible for this step. - Compute fragment subsets using different techniques (indices of database)
- Subset selection by ILP (named
algoin the code):- Generate model and write it to
modelsfolder OR read it and modify its parameters if possible (simple penalty change for example). The filescripts/algo_model.pyis responsible for this step, and is based off of the filescript/fragments.py. - Solve model and output subset to folder
rankingswith prefixalgo_, and solution of ILP to foldersolutions. The filescripts/algo_subset.pyis responsible for this step.
- Generate model and write it to
- Subset selection by SML:
- Output subset to
rankingsfolder with prefixsml_. The filescripts/sml_subset.pyis responsible for this step.
- Output subset to
- Subset selection by CUR:
- Output subset to
rankingswith prefixcur_. The filescripts/cur_subset.pyis responsible for this step.
- Output subset to
- Subset selection by FPS:
- Output subset to
rankingswith prefixfps_. The filescripts/fps_subset.pyis responsible for this step.
- Output subset to
- Subset selection by ILP (named
- Compute the learning curve of each subset and save to folder
learning_curves. The filescripts/learning_curves.pyis responsible for this step. - Draw the learning curves and save to folder
plots. The filescripts/plots.pyis responsible for this step. - The timings of each step are saved in a dump file in the
runfolder. An example filedump-template.csvcan be found in the folder.
The folder run contains a main.run file which describes how the scripts were ran on the JED cluster.
An example output file slurm.out is included.
Add a {target_name}.xyz file to the folder targets.
Add a corresponding entry with the associated energy in the energies.csv file in the same folder.
Create a {database} folder, which contains the energies described in an energies.csv file (with columns file and energy / Ha).
One may add a column atomization energy / Ha.
See the qm9/generate.py script and the cluster/scripts/generate_qm9.py file in the master branch for an example of a qm9 implementation from a master file.
Modify accordingly the file scripts/generate.py. Currently the get_representations function asserts that FCHL is used.
- Add list of class attributes of
scripts.fragments.model. - Implement qm9 database. The only thing currently missing is some pruning of the database because Gurobi uses too much memory (even on clusters).
- Implement other representations than FCHL. Not a priority but should not be too difficult (
representationis already a parameter).