In order to be able to properly resume a terminated run we need to be able to capture any/all state necessary to "revive" the run. For much of this we can probably rely on pytorch's built-in checkpointing features, but we'll need to augment the checkpoint with any additional state of which pytorch is not likely aware, such as the learning rate controller's state.
Note: this almost certainly requires that we change the checkpoint format away from the current .npy format (assuming that #42 doesn't change it first). In doing so, we should also make sure that our checkpoints include a format version specifier so that as we change the checkpoint data over time we can be sure that we can either maintain backward compatibility, and/or error on checkpoint versions that we no longer understand.