Skip to content

Running Experiments on SLURM clusters

Nathaniel Evans edited this page Jul 31, 2023 · 1 revision

To run an experiment, use the expX.sh file. Experiment parameters are specified within the script.

To run this file, you must be on a SLURM system, you may need to modify the SBATCH parameters in expX.sh and batched_XXXX.sh to fit each specific cluster configurations.

An experiment can be run by:

$ sbatch exp1.sh 

Each experiment will first run the make_data.py file, and produce the processed data. Then the respective batched_XXX.sh files will be called, which performs a limited grid hyper-parameter search. Each of these calls will submit a separate job, so they can run in parralel.

To track training progress or results, you can use tensorboard, to do this you must use port-forwarding:

$ ssh -L 6006:localhost:6006 evansna@acc.ohsu.edu
... credentials ... 
$ ssh -L 6006:localhost:6006 exahead1 
... credentials ... 
$ conda activate gsnn 
$ tensorboard --logdir=path/to/output/dir

Then go to http://localhost:6006/ on the localhost.

Clone this wiki locally