IMPORTANT NOTE: All commands in this README assume that you are in the root of the repository (and need to be run from there)!
From the repository root, please run the following in the command line:
python run.py trainer.gpus=0 model=unet_convnext logger=none callbacks=default
This will train a UNet ConvNext on the CPU using some default callbacks and hyperparameters, but no logging.
It is recommended to define all the parameters of your experiment in a YAML file inside the configs/experiment folder.
For example, the config file oisst_full_unet_convnext defines the experiment to train a UNet ConvNext on the full OISSTv2 dataset with some particular (hyper-)parameters. You can then easily run such an experiment with the following command:
python run.py experiment=oisst_full_unet_convnext # replace 'oisst_unet with the name of your config file
If you want to resume training from a previous run, you can use the following command:
python run.py logger.wandb.id=<run_id>
where <run_id> is the wandb ID of the run you want to resume training.
You can add any extra arguments, e.g. datamodule.num_workers=8, to change the values from the previous run.
Note that if you want to run for more epoch, you need to add trainer.max_epochs=<new_max_epochs>.
- To run on CPU use
trainer.gpus=0, to use a single GPU usetrainer.gpus=1, etc. - To override the data directory you can override the flag
datamodule.data_dir=<data-dir>, or see the data README for more options. - A random seed for reproducibility can be set with
seed=<seed>(by default it is11).
By default,
- the checkpoints (i.e. model weights) are saved in
results/checkpoints/, - any logs are saved in
results/logs/.
To change the name of results/ in both subdirs above, you may simply use the flag work_dir=YOUR-OUT-DIR.
To only change the name of the checkpoints directory, you may use the flag ckpt_dir=YOUR-CHECKPOINTS-DIR.
To only change the name of the logs directory, you may use the flag log_dir=YOUR-LOGS-DIR.
To run the code in debug mode, use the flag mode=debug.
To debug the OISSTv2 data/models, use mode=debug_oisst,
and have the subregion boxes 0, 1, 133 downloaded (since the debug run only uses these boxes for fast data loading).
Important data-specific parameters can be all found in the configs/datamodule/base_data_config file. In particular:
datamodule.data_dir: the directory where the data must be stored (see the data README for more details).datamodule.batch_size: the batch size to use for training.datamodule.num_workers: the number of workers to use for loading the data.
You can override any of these parameters by adding datamodule.<parameter>=<value> to the command line.
To train a pre-defined model do model=<model_name>, e.g. model=cnn, model=mlp, etc.,
where configs/model/<model_name>.yaml must be the configuration file for the respective model.
You can also override any model hyperparameter by adding model.<hyperparameter>=<value> to the command line.
E.g.:
- to change the number of layers and dimensions in an MLP you would use
model=mlp 'model.hidden_dims=[128, 128, 128]'(note that the parentheses are needed when the value is a list). - to change the MLP ratio in the AFNO model you would use
model=transformer 'model.mlp_ratio=0.5'.
Important model-specific parameters can be all found in the configs/model/_base_model_config file. In particular:
model.scheduler: the scheduler to use for the learning rate. Default: Exponential decay with gamma=0.98model.monitor: the logged metric to track for early-stopping, model-checkpointing and LR-scheduling. Default:val/mse.
Hyperparameter optimization is supported via the Optuna Sweeper. Please read the instructions for setting it up and running experiments with Optuna in the Optuna configs README.
Requirements & Logging in
The following requires you to have a wandb (team) account, and you need to login with ``wandb login`` before you can use it. You can also simply export the environment variable ``WANDB_API_KEY`` with your wandb API key, and the [run.py](../run.py) script will automatically login for you.- To log metrics to wandb use
logger=wandb. - To use some nice wandb specific callbacks in addition, use
callbacks=wandb(e.g. save the best trained model to the wandb cloud).
hydra.errors.InstantiationException
The hydra.errors.InstantiationException itself is not very informative,
so you need to look at the preceding exception(s) (i.e. scroll up) to see what went wrong.
Overriding nested Hydra config groups
Nested config groups need to be overridden with a slash - not with a dot, since it would be interpreted as a string otherwise.
For example, if you want to change the filter in the AFNO transformer:
python run.py model=afno model/mixer=self_attention
And if you want to change the optimizer, you should run:
python run.py model=graphnet optimizer@model.optimizer=SGD
Local configurations
You can easily use a local config file (that,e.g., overrides data dirs, working dir etc.), by putting such a yaml config in the configs/local/ subdirectory (Hydra searches for & uses by default the file configs/local/default.yaml, if it exists)
Wandb
If you use Wandb, make sure to select the "Group first prefix" option in the panel/workspace settings of the web app inside the project (in the top right corner). This will make it easier to browse through the logged metrics.