Clone the repository using git:
git clone https://github.com/HenryTeahan/CatSperMLChange directory to the CatSperML folder structure:
cd CatSperMLInstall the required dependencies.
You can set up the environment using Conda.
Create the environment from the provided environment.yml file:
conda env create -f environment.yml
conda activate catsper-freereleasecd src
python align.py # aligns training molecules and screening molecules
python optimization.py # performs optimization
python train.py --load_opt_params # trains on aligned training molecules using optimized parameters
python screen.py # Automatically uses screening file assuming previous step complete. --sdf_t (YOUR ALIGNED TRAINING SDF FILE) --sdf_s (YOUR ALIGNED SCREENING SDF FILE) Go and inspect your results in results/screening/processing_hits.ipynb!
This guide explains how to run a demonstration of the model. At the moment, the original dataset is proprietary.
The file data/toy_indoles.sdf contains randomly generated substituted indoles, produced in random_indoles.ipynb.
A random selection of 10 of these indoles are given an active label, and any molecules in the dataset with a Tanimoto similarity > 0.7 (using ECFP4) are also labeled as active. This gives the model something to work on :-).
The result is an .sdf file, toy_indoles.sdf, where each molecule has the activity defined in its property interface (mol.GetProp("IC50"))
As mentioned in the paper, this model relies on the 2D alignment of the input molecules. For this, an alignment protocol has been developed.
- Align molecules
python align.py --sdf_t "name.sdf" --sdf_s "screen.sdf" # Uses toy_indoles.sdf and HIT_locator.sdf as default.This creates a new file for each input: name_aligned.sdf and screen_aligned.sdf
In this sdf, the indoles are aligned using the alignment protocol described in the paper (reference mols found in /data/processed/references.sdf). The molecules to which they are being aligned are therefore not purpose built for this simulated data, which may explain some discrepancies in the final results.
- Optimize hyperparameters
python parameter_optimization.py --sdf "name_aligned.sdf" (uses the toy_indoles_aligned.sdf by default)You can modify the hyperparameter optimization ranges and the scale of the optimization (Uses a TPEsampler).
For more information, run:
python parameter_optimization.py -h- Train the model
python train.py --sdf "path_to_toy_indoles_file_aligned.sdf" --load_opt_paramsNote: You must include --load_opt_params if you want to use the optimized hyperparameters from step 2. The default .sdf file is toy_indoles_aligned.sdf.
If you want to see the training molecules with their explainable 2D maps and decision points, as well as the decision tree and centroid map, then add --save_img to the command. This saves the images in the results folder.
- Screen the library
python screen.py --sdf_t "path_to_toy_indoles_file_aligned.sdf" --sdf_s "path_to_screening_library"Note: This automatically uses the same hyperparameters as used during training. Use the printout X_train CHECK to verify this. This can be seen as a checksum for the descriptor generation which must be consistent across training and screening. Default paths: --sdf_t: points to toy_indoles_aligned.sdf --sdf_s: points to HIT_locator_aligned.sdf (subset of enamines HIT locator library)
- Success! Once the screening completes, view and analyze your hits using: processing_hits.ipynb