Skip to content

Failed to resume training #12

@lsjle

Description

@lsjle

Failed to resume training

Failure on resuming train from unfinished process. This happens when a model is trained by rllabplusplus's npo.py script. (other algo might have same issue)

Your environment

  • rllabplusplus@4d55f96
  • tensorflow 1.0.0

Steps to reproduce

Run the below command step-by-step.

python algo_gym_stub.py --algo_name=nuqfqprop --qprop_nu=0.1 --env_name=HalfCheetah-v1 --exp="halfcheetah" --max_episode=4000
Ctrl+C for numerous time
cd to the script directory 
python resume_training.py /path/to/params.pkl

Expected behaviour

The script should be able to resume training from last session.

Actual behaviour

When attempting to run the resume_training.py an error was raised:

tf.get_default_session().run(tf.variables_initializer(self.get_params()))
AttributeError: 'NoneType' object has no attribute 'run'

Attempted fix

Change the below code in run_experiment_lite.py

data = joblib.load(args.resume_from)

to

sess = tf.Session()
with sess.as_default():
        data = joblib.load(args.resume_from)

With this fix, the code now can read from the previous weight. However the next line in same script raise an assertion error - algo was not found in data. We can see it's been store by the original rllab in batch_polopt.py L127 but the same code cannot be found in rllabplusplus. When we added this line back to the same relative position, the params raise an error of

_pickle.PicklingError: Can't pickle <function compile_function.<locals>.run at 0x76341555f400>: it's not found as sandbox.rocky.tf.misc.tensor_utils.compile_function.<locals>.run 

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions