Skip to content

Problem of resume training #8

@shengyenlin

Description

@shengyenlin

Hi,

Thanks for your amazing work.

I would like to ask a problem about reading model from checkpoint.

It seems that the code couldn't load the model correctly from some checkpoints and it always start to train from scratch.

For example, yesterday I trained the model for six hours (29,200 iters completed), and today I use the same training config.

The output states that it correctly store the model from step = 29,200, but the first iteration of this second training is still step: <<<<< 100/108650 >>>>>, and the validation and testing psnr isn't at the level of 29,200 iters too.

Is there any thing that I have to modify for checkpoint loading (e.g. in train_json ?) or I miss something here?

Thanks in advance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions