-
Notifications
You must be signed in to change notification settings - Fork 42
Open
Description
Failed to resume training
Failure on resuming train from unfinished process. This happens when a model is trained by rllabplusplus's npo.py script. (other algo might have same issue)
Your environment
- rllabplusplus@4d55f96
- tensorflow 1.0.0
Steps to reproduce
Run the below command step-by-step.
python algo_gym_stub.py --algo_name=nuqfqprop --qprop_nu=0.1 --env_name=HalfCheetah-v1 --exp="halfcheetah" --max_episode=4000
Ctrl+C for numerous time
cd to the script directory
python resume_training.py /path/to/params.pklExpected behaviour
The script should be able to resume training from last session.
Actual behaviour
When attempting to run the resume_training.py an error was raised:
tf.get_default_session().run(tf.variables_initializer(self.get_params()))
AttributeError: 'NoneType' object has no attribute 'run'
Attempted fix
Change the below code in run_experiment_lite.py
data = joblib.load(args.resume_from)to
sess = tf.Session()
with sess.as_default():
data = joblib.load(args.resume_from)With this fix, the code now can read from the previous weight. However the next line in same script raise an assertion error - algo was not found in data. We can see it's been store by the original rllab in batch_polopt.py L127 but the same code cannot be found in rllabplusplus. When we added this line back to the same relative position, the params raise an error of
_pickle.PicklingError: Can't pickle <function compile_function.<locals>.run at 0x76341555f400>: it's not found as sandbox.rocky.tf.misc.tensor_utils.compile_function.<locals>.run
Metadata
Metadata
Assignees
Labels
No labels