Skip to content

Enhancement: Capture the full learner state details in the checkpoint artifact #40

@some-rando-rl

Description

@some-rando-rl

In order to be able to properly resume a terminated run we need to be able to capture any/all state necessary to "revive" the run. For much of this we can probably rely on pytorch's built-in checkpointing features, but we'll need to augment the checkpoint with any additional state of which pytorch is not likely aware, such as the learning rate controller's state.

Note: this almost certainly requires that we change the checkpoint format away from the current .npy format (assuming that #42 doesn't change it first). In doing so, we should also make sure that our checkpoints include a format version specifier so that as we change the checkpoint data over time we can be sure that we can either maintain backward compatibility, and/or error on checkpoint versions that we no longer understand.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions