-
Notifications
You must be signed in to change notification settings - Fork 4
Enhancement: Make it possible to rollback to a specific checkpoint and resume with modified parameters #41
Description
At the moment distrib-rl does not have any mechanism that allows one to resume training from a checkpoint with config modifications.
It's often the case that practitioners wish to modify environment details and/or hyperparameters during learning. Sometimes these details are understood at the outset and scheduled in advance, and sometimes they're discovered/decided only after training has begun.
In order to not sacrifice reproducibility, it's critical that it be possible to produce a single configuration that, if run unattended from start to finish, would reproduce the full set of checkpoints, including the ones prior to a mid-stream config change.
As a result, I'd propose a change to the config format that allows for defining arbitrary config "scheduling," with any mid-stream config modifications being required to reference the original configuration over the range for which that config was valid.
For example, assuming that you are training a model with a given config that we'll call config-v1, in order to resume training from checkpoint 5 that was produced by config-v1, a hypothetical config-v2 must reference (in some TBD way) the original config-v1 as having been valid for checkpoints 0-5, inclusive. This makes it relatively trivial to compose a series of modified configs into a single snapshotted config on each checkpoint, with that snapshotted config being executable as-is to reproduce the checkpoint that contains it.
Having that composite config stored as part of each snapshot is a requirement for this feature.
Metadata
Metadata
Assignees
Labels
Projects
Status