Enhancement: Capture the full learner state details in the checkpoint artifact

In order to be able to properly resume a terminated run we need to be able to capture any/all state necessary to "revive" the run. For much of this we can probably rely on pytorch's built-in checkpointing features, but we'll need to augment the checkpoint with any  additional state of which pytorch is not likely aware, such as the learning rate controller's state.

Note: this almost certainly requires that we change the checkpoint format away from the current `.npy` format (assuming that #42 doesn't change it first). In doing so, we should also make sure that our checkpoints include a format version specifier so that as we change the checkpoint data over time we can be sure that we can either maintain backward compatibility, and/or error on checkpoint versions that we no longer understand.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement: Capture the full learner state details in the checkpoint artifact #40

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Enhancement: Capture the full learner state details in the checkpoint artifact #40

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions