This is a project to train Graph Neural Networks (GNNs) for
You can use either a container image via Docker or apptainer/singularity or install manually (not recommended).
Clone and build the project (this may take a while).
git clone https://github.com/mfmceneaney/lambdaml.git
cd lambdaml
docker build -f /path/to/lambdaml/docker/Dockerfile.cpu -t lambdaml-project /path/to/lambdaml #Note: There is also a cuda Dockerfile.After successfully building, run the project with:
docker run --rm -it lambdaml-projectThe --rm option tells docker to remove the container and its data once it is shut down.
To retain the container data though, you can mount a local directory (src) to a directory (dst)
inside the container with the following:
docker run --rm -it -v <src>:<dst> lambdaml-projectTo link all available GPUs on your node, e.g., in a SLURM job use the --gpus all option.
docker run --rm -it --gpus all lambdaml-projectIf you really only need to run a single python script in the container and then exit, for example, for a SLURM job, you can do that too.
docker run --rm lambdaml-project python3 </path/to/my/python/sript.py>Once you start the container you should have the following environment variables:
LAMBDAML_HOMELAMBDAML_CONT_HOMELAMBDAML_REGISTRY
If you have input data directories and output data directories for your preprocessing or training pipelines, you can mount several directories.
docker run --rm -it -v /path/to/lambdaml:/usr/src/lambdaml -v /path/for/input/files:/data -v /path/for/out/files:/out lambdaml-project-cu129For use with CUDA, see the bit about installing PyTorch-Geometric on an HPC cluster below as well.
It is very hard to access the different volumes of a HPC cluster from Docker, so use apptainer or singularity instead.
Download the PyTorch-Geometric packages and copy them to /path/to/lambdaml/pyg_packages. Then, build the container with
singularity build lambdaml-cu129.sif singularity/lambdaml.def.cu129Then run the container, binding to some volumes on your cluster, with
singularity exec -B /volatile,/path/to/lambdaml:/usr/src/lambdaml lambdaml-cu129.sif bashOr, if you just need to run a python script within the container
singularity exec -B /volatile,/path/to/lambdaml:/usr/src/lambdaml lambdaml-cu129.sif python3 /usr/src/lambdaml/pyscripts/<SCRIPT>.py --help🔴 Avoiding OpenBLAS Errors
Also, when running t-SNE latent space visualization on HPC, you may get the following error due to the fact that you can have many more cores available than the precompiled allowed numbers for OpenBLAS libraries used in numpy and torch.
OpenBLAS warning: precompiled NUM_THREADS exceeded, adding auxiliary array for thread metadata.
To avoid this warning, please rebuild your copy of OpenBLAS with a larger NUM_THREADS setting or set the environment variable OPENBLAS_NUM_THREADS to 64 or lower
BLAS : Bad memory unallocation! : 640 0x7ef44e000000
BLAS : Bad memory unallocation! : 640 0x7ef450000000
BLAS : Bad memory unallocation! : 640 0x7ef3d2000000
BLAS : Bad memory unallocation! : 640 0x7ef3c0000000
Segmentation fault (core dumped)To prevent this you can either export OPENBLAS_NUM_THREADS=64 or restrict the cores visible to the singularity image with the option taskset -c 0-31.
singularity exec -B /volatile,/path/to/lambdaml:/usr/src/lambdaml lambdaml-cu129.sif taskset -c 0-31 python3 /usr/src/lambdaml/pyscripts/<SCRIPT>.py --helpBegin by cloning the repository
git clone https://github.com/mfmceneaney/lambdaml.gitCreate and activate a virtual python environment for your project
python3 -m venv venv
source venv/bin/activateInstall the python modules. These are listed in pyproject.toml and are all available with pip.
pip install -e .Install the extra PyTorch-Geometric extensions needed for some of the GNN models.
pip install -r requirements-pyg-pt28-cpu.txt # Adjust pytorch and cuda version as needed.❌ Installing PyTorch-Geometric on an HPC Cluster
Follow the installation instructions on the PyTorch Geometric Documentation.
If you are on Ifarm or another HPC cluster with a firewall, you will probably get an error like this:
WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))': /wheels/repo.html
ERROR: Could not find a version that satisfies the requirement pyg-lib (from versions: none)
ERROR: No matching distribution found for pyg-lib
In this case, try downloading locally whatever distribution you need from the repo link posted on the installation page for installing with pip. This will look like https://data.pyg.org/whl/torch-${TORCH_VERSION}+${CUDA_VERSION}.html.
Then transfer the downloaded distribution (e.g. with scp or rsync) to ifarm.
In your virtual environment you can now install from the local path:
pip install pyg-lib torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f /path/to/distribution/you/just/uploaded
For your convenience put your packages in some directory /path/to/packages and use the CUDA docker file to install from this path. You will need to mount the directory to /pyg_packages for the build to succeed.
docker build -v /path/to/packages:/pyg_packages -f /path/to/lambdaml/Dockerfile.cu129 -t lambdaml-project /path/to/lambdamlAdd the following to your startup script
cd /path/to/lambdaml
source $PWD/env/env.sh # csh version also available
cdRun the project pipelines for dataset creation, hyperparameter optimization, model selection, and model deployment in pyscripts. From within the container, you can run:
python3 pyscripts/<some_script.py> --helpHowever, for running with actual REC::Kinematic banks using CLAS12-Trains. Assume your output hipo files are in a folder designated by the environment variable $C12TRAINS_OUTPUT_MC for MC simulation and $C12TRAINS_OUTPUT_DT for data.
Then, you can run the python scripts for data set creation from outside the container. You will want to mount the $LAMBDAML_HOME and your output directory, e.g., export VOLATILE_DIR=/volatile/clas12/users/$USER/. For the MC simulation dataset run:
singularity exec \
-B $VOLATILE_DIR,$LAMBDAML_HOME:$LAMBDAML_CONT_HOME lambdaml-cu129.sif \
python3 $LAMBDAML_CONT_HOME/pyscripts/run_pipeline_preprocessing.py \
--file_list $C12TRAINS_OUTPUT_MC/*.hipo \
--banks REC::Particle REC::Kinematics MC::Lund \
--step 100000 \
--out_dataset_path $VOLATILE_DIR/src_dataset \
--lazy_ds_batch_size 100000 \
--num_workers 0 \
--log_level infoAnd similarly, for the real data (unlabelled) dataset, run:
singularity exec \
-B $VOLATILE_DIR,$LAMBDAML_HOME:$LAMBDAML_CONT_HOME lambdaml-cu129.sif \
python3 $LAMBDAML_CONT_HOME/pyscripts/run_pipeline_preprocessing.py \
--file_list $C12TRAINS_OUTPUT_DT/*.hipo \
--banks REC::Particle REC::Kinematics \
--step 100000 \
--out_dataset_path $VOLATILE_DIR/tgt_dataset \
--lazy_ds_batch_size 100000 \
--num_workers 0 \
--log_level infoYou can then run the TIToK training script like so:
singularity exec \
-B $VOLATILE_DIR,$LAMBDAML_HOME:$LAMBDAML_CONT_HOME lambdaml-cu129.sif \
taskset -c 0-31 \
python3 $LAMBDAML_CONT_HOME/pyscripts/run_pipeline_titok.py \
--src_root $VOLATILE_DIR/src_dataset \
--tgt_root $VOLATILE_DIR/tgt_dataset \
--out_dir $VOLATILE_DIR/experiments \
--use_lazy_dataset \
--log_level info \
--batch_size 32 \
--nepochs 10And you can run a hyperparameter optimization study like so:
singularity exec \
-B $VOLATILE_DIR,$LAMBDAML_HOME:$LAMBDAML_CONT_HOME lambdaml-cu129.sif \
taskset -c 0-31 \
python3 $LAMBDAML_CONT_HOME/pyscripts/run_optimize_titok.py \
--src_root $VOLATILE_DIR/src_dataset \
--tgt_root $VOLATILE_DIR/tgt_dataset \
--out_dir $VOLATILE_DIR/experiments \
--use_lazy_dataset \
--log_level info \
--batch_size 32 \
--nepochs 10 \
--opt__storage_url "sqlite:///$VOLATILE_DIR/experiments/optuna_study.db" \
--opt__suggestion_rules 'lr=float:0.0001:0.01:log' \
--opt__study_name 'study' \
'num_layers_gnn=int:3:8' \
'alpha_fn=cat:0.1,0.01,sigmoid_growth,sigmoid_decay,linear_growth,linear_decay'Of course, there are similar pipeline and optimizations scripts for the domain-adversarial and contrastive loss methods as well.
After running a hyperparameter optimization with one of the scripts in pyscripts/,
you will need to select the best or several best models to serve. Copy these into the $LAMBDAML_REGISTRY directory using the pyscripts/select_best_models.py script. For example,
singularity exec \
-B $VOLATILE_DIR,$LAMBDAML_HOME:$LAMBDAML_CONT_HOME lambdaml-cu129.sif \
taskset -c 0-31 \
python3 $LAMBDAML_CONT_HOME/pyscripts/select_best_models.py \
--n_best_trials 10 \
--optuna_storage_url "sqlite:///$VOLATILE_DIR/experiments/optuna_study.db" \
--optuna_study_name 'study' \
--registry $LAMBDAML_REGISTRYAfter these are copied you can deploy a model of your choice from a given study for use by other processes. The model is run as an app with flask.
singularity exec \
-B $VOLATILE_DIR,$LAMBDAML_HOME:$LAMBDAML_CONT_HOME lambdaml-cu129.sif \
taskset -c 0-31 \
python3 $LAMBDAML_CONT_HOME/pyscripts/select_best_models.py \
--n_best_trials 5 \
--optuna_storage_url "sqlite:///$VOLATILE_DIR/experiments/optuna_study.db" \
--optuna_study_name 'study' \
--registry $LAMBDAML_REGISTRYThis will simply list the available trial ids code names for a given study. Once you have chosen a trial id or the corresponding codename, you can specify the model.
singularity exec \
-B $VOLATILE_DIR,$LAMBDAML_HOME:$LAMBDAML_CONT_HOME lambdaml-cu129.sif \
taskset -c 0-31 \
python3 $LAMBDAML_CONT_HOME/app/app.py \
--optuna_study_name 'study' \
--registry $LAMBDAML_REGISTRY \
--trial_id 'best-trial' \
--flask_host "0.0.0.0" \
--flask_port 5000 #NOTE: This will make your service visible on http://localhost:5000Note however, that this is just a development server. To run a production server you can run this same script in production mode, which, internally, will call gunicorn.
singularity exec \
-B $VOLATILE_DIR,$LAMBDAML_HOME:$LAMBDAML_CONT_HOME lambdaml-cu129.sif \
taskset -c 0-31 \
python3 $LAMBDAML_CONT_HOME/app/app.py \
--optuna_study_name 'study' \
--registry $LAMBDAML_REGISTRY \
--trial_id 'best-trial' \
--flask_host "0.0.0.0" \
--flask_port 5000 \
--mode prodContact: matthew.mceneaney@duke.edu