Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
e19ec21
add new PPO2, for now just split up actor and critic.
sheim Feb 16, 2024
1a7977e
PPO2: split optimizers for critic and actor
sheim Feb 16, 2024
3253a65
cleaning
sheim Feb 20, 2024
e3c2589
Basic internals of dict_storage in place, fills correctly
sheim Feb 20, 2024
7c1166b
WIP: change storage to be (n_steps, n_envs, var_dim), among other things
sheim Feb 20, 2024
fa05a31
implemented compute advantages (returns), validated against existing.
sheim Feb 21, 2024
5c5d38b
critic update runs, but is very slow. Not clear why
sheim Feb 21, 2024
afa0be2
Learning critic with dict storage now working.
sheim Feb 22, 2024
12661f0
add tensordict to reqs
sheim Feb 25, 2024
679a729
WIP: works now, but not with normalization turned on
sheim Feb 26, 2024
6dc86fd
fix normalization, make it work for (n, m, var_dim) shaped data
sheim Feb 26, 2024
9fc831e
some cleanup and fixes on normalization
sheim Feb 26, 2024
c491213
Pin pytest below 8.0 (tested with 7.4.4)
sheim Feb 26, 2024
9524c81
Merge branch 'dev' into tensorDict
sheim Feb 26, 2024
af0de68
training critic off of frozen policy finished + tweaks
jschneider03 Mar 1, 2024
0a2d28b
new NNs and graphing tweaks
jschneider03 Mar 4, 2024
29d3de2
update logger to allow flexibly adding arbitrary categories for per-i…
sheim Mar 4, 2024
0bd395a
log actor things in on_policy_runner as well.
sheim Mar 6, 2024
812c272
Merge pull request #7 from mit-biomimetics/trackStd
sheim Mar 6, 2024
8129e1d
Merge branch 'dev' into tensorDict
sheim Mar 6, 2024
9a04375
small fixes
jschneider03 Mar 8, 2024
6c5aab4
Merge pull request #5 from mit-biomimetics/tensorDict
sheim Mar 13, 2024
bd92e40
Merge branch 'dev' into sw/lqc_dev
sheim Mar 13, 2024
b5d1823
save logs under `logs/lqrc`, as the project folder
sheim Mar 13, 2024
a8f5c1c
WIP: Merge branch 'dev' into sw/lqc_dev
sheim Mar 13, 2024
49c7be7
WIP: splitting actor_critic
sheim Mar 13, 2024
10247be
WIP: more clean post actor/critic, exporting
sheim Mar 13, 2024
2cf94ab
Updating all envs and configs to have split actor and critic
sheim Mar 13, 2024
2a04147
fix OldPolicyRunner with PPO (original)
sheim Mar 14, 2024
9759a40
put critic loss_fn into critic module, remove value clipping
sheim Mar 14, 2024
164423b
removing last_obs from update_critic()
jschneider03 Mar 15, 2024
b52bfee
Merge pull request #10 from mit-biomimetics/sw/lqc_dev
jschneider03 Mar 15, 2024
5d6c34c
Updating lqc files to fit new PPO, runner interfaces
jschneider03 Mar 18, 2024
0fd1589
WIP custom critics with new critic interface
jschneider03 Mar 18, 2024
74a615c
WIP custom critics refactor
jschneider03 Mar 22, 2024
de64f6c
WIP: more clean post actor/critic, exporting
sheim Mar 25, 2024
8dc7b40
Updating all envs and configs to have split actor and critic
sheim Mar 25, 2024
53e92bc
put critic loss_fn into critic module, remove value clipping
sheim Mar 25, 2024
06e62e3
several missing small fixes
sheim Mar 25, 2024
a443470
custom critics handle varied batch size
jschneider03 Apr 4, 2024
7cb532d
update evaluate_critic.py and minor fixes
jschneider03 Apr 5, 2024
92cbbb0
WIP autoencoder
jschneider03 Apr 9, 2024
02d5bb4
WIP torch.vmap based LQR, new LQR pendulum environment
jschneider03 Apr 17, 2024
7448aa6
update setup info
sheim Apr 18, 2024
a948e2a
WIP LQR Stabilizer
jschneider03 Apr 18, 2024
8b95b5a
fixed
sheim Apr 18, 2024
b951fd7
small fixes
jschneider03 Apr 18, 2024
a9f0e88
Merge branch 'sh/lqr' into js/lqc
jschneider03 Apr 19, 2024
e10c836
burn in normalization before learning
sheim Apr 26, 2024
6eaf961
LQR swing-up with single linearization point
jschneider03 Apr 26, 2024
acedd97
fix configs wrt to normalization, and fixed_base
sheim Apr 26, 2024
d26f027
WIP LQR data generating runs
jschneider03 Apr 26, 2024
4f87735
Merge pull request #17 from mit-biomimetics/runner_burnin
sheim Apr 29, 2024
01de0fe
- save data each iteration in lqr_data_gen_runner
sheim Apr 29, 2024
8ad4eb0
save data every step for offline use
sheim May 1, 2024
28fda43
Merge branch 'dev' into js/lqc
sheim May 1, 2024
4568588
refactor GAE computation to be explicit
sheim May 1, 2024
e781f02
store next_obs explicitly, use for last_values
sheim May 1, 2024
4b8cabb
refactor critic loss function to consume obs and self-evaluate
sheim Mar 26, 2024
1d6ff39
normalize advantages only after computing returns
sheim May 1, 2024
2143670
WIP: set up datalogger, and offline training
sheim May 1, 2024
2ddc584
cherry-picked refactored data generator
sheim May 2, 2024
9603a7b
WIP: offline_critics runs
sheim May 2, 2024
037cc1c
trains new critic
sheim May 2, 2024
98731f1
WIP plotting func added for critics
jschneider03 May 3, 2024
9190f40
critic specific plotting no grad
jschneider03 May 3, 2024
fff4ade
fixed grid to ensure full coverage
jschneider03 May 3, 2024
f949901
remove normalization from MC returns
jschneider03 May 3, 2024
193d66e
plotting for single critic finished
jschneider03 May 3, 2024
80e44b1
hotfix: don't normalize MC returns
sheim May 2, 2024
af9aa2b
bugfixes, and unit test for MC_returns
sheim May 3, 2024
2b58fce
fix shapes, add test for GAE with lam=1.0
sheim May 3, 2024
7211c30
log terminated etc.
sheim May 3, 2024
124ca3e
WIP training multiple critics at once
jschneider03 May 4, 2024
6320c60
make time-outs also terminations for pendulum
sheim May 4, 2024
9882e42
all iterations single critic
jschneider03 May 12, 2024
2edb471
multiple critic graphing done
jschneider03 May 12, 2024
1c46275
added ground truth + vanilla critic and fixed indexing
jschneider03 May 13, 2024
643f550
quick fixes + clean up
jschneider03 May 15, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 5 additions & 8 deletions gym/envs/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@
"Anymal": ".anymal_c.anymal",
"A1": ".a1.a1",
"HumanoidRunning": ".mit_humanoid.humanoid_running",
"HumanoidBouncing": ".mit_humanoid.humanoid_bouncing",
"Pendulum": ".pendulum.pendulum",
"LQRPendulum": ".pendulum.lqr_pendulum",
}

config_dict = {
Expand All @@ -32,8 +32,8 @@
"A1Cfg": ".a1.a1_config",
"AnymalCFlatCfg": ".anymal_c.flat.anymal_c_flat_config",
"HumanoidRunningCfg": ".mit_humanoid.humanoid_running_config",
"HumanoidBouncingCfg": ".mit_humanoid.humanoid_bouncing_config",
"PendulumCfg": ".pendulum.pendulum_config",
"LQRPendulumCfg": ".pendulum.lqr_pendulum_config",
}

runner_config_dict = {
Expand All @@ -45,8 +45,8 @@
"A1RunnerCfg": ".a1.a1_config",
"AnymalCFlatRunnerCfg": ".anymal_c.flat.anymal_c_flat_config",
"HumanoidRunningRunnerCfg": ".mit_humanoid.humanoid_running_config",
"HumanoidBouncingRunnerCfg": ".mit_humanoid.humanoid_bouncing_config",
"PendulumRunnerCfg": ".pendulum.pendulum_config",
"LQRPendulumRunnerCfg": ".pendulum.lqr_pendulum_config",
}

task_dict = {
Expand All @@ -68,12 +68,9 @@
"HumanoidRunningCfg",
"HumanoidRunningRunnerCfg",
],
"humanoid_bouncing": ["HumanoidBouncing",
"HumanoidBouncingCfg",
"HumanoidBouncingRunnerCfg"],
"a1": ["A1", "A1Cfg", "A1RunnerCfg"],
"flat_anymal_c": ["Anymal", "AnymalCFlatCfg", "AnymalCFlatRunnerCfg"],
"pendulum": ["Pendulum", "PendulumCfg", "PendulumRunnerCfg"]
"pendulum": ["Pendulum", "PendulumCfg", "PendulumRunnerCfg"],
"lqr_pendulum": ["LQRPendulum", "LQRPendulumCfg", "LQRPendulumRunnerCfg"],
}

for class_name, class_location in class_dict.items():
Expand Down
34 changes: 19 additions & 15 deletions gym/envs/a1/a1_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -137,13 +137,13 @@ class scaling(LeggedRobotCfg.scaling):

class A1RunnerCfg(LeggedRobotRunnerCfg):
seed = -1
runner_class_name = "OldPolicyRunner"

class policy(LeggedRobotRunnerCfg.policy):
actor_hidden_dims = [256, 256, 256]
critic_hidden_dims = [256, 256, 256]
class actor(LeggedRobotRunnerCfg.actor):
hidden_dims = [256, 256, 256]
# activation can be elu, relu, selu, crelu, lrelu, tanh, sigmoid
activation = "elu"
actor_obs = [
obs = [
"base_height",
"base_lin_vel",
"base_ang_vel",
Expand All @@ -154,7 +154,19 @@ class policy(LeggedRobotRunnerCfg.policy):
"commands",
]

critic_obs = [
actions = ["dof_pos_target"]

class noise:
dof_pos_obs = 0.005
dof_vel = 0.005
base_ang_vel = 0.05
projected_gravity = 0.02

class critic:
hidden_dims = [256, 256, 256]
# activation can be elu, relu, selu, crelu, lrelu, tanh, sigmoid
activation = "elu"
obs = [
"base_height",
"base_lin_vel",
"base_ang_vel",
Expand All @@ -165,16 +177,8 @@ class policy(LeggedRobotRunnerCfg.policy):
"commands",
]

actions = ["dof_pos_target"]

class noise:
dof_pos_obs = 0.005 # can be made very low
dof_vel = 0.005
base_ang_vel = 0.05
projected_gravity = 0.02

class reward(LeggedRobotRunnerCfg.policy.reward):
class weights(LeggedRobotRunnerCfg.policy.reward.weights):
class reward:
class weights:
tracking_lin_vel = 1.0
tracking_ang_vel = 1.0
lin_vel_z = 0.0
Expand Down
35 changes: 20 additions & 15 deletions gym/envs/anymal_c/flat/anymal_c_flat_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,7 @@ class push_robots:
toggle = True
interval_s = 1
max_push_vel_xy = 0.5
push_box_dims = [0.2, 0.2, 0.2]

class domain_rand(LeggedRobotCfg.domain_rand):
randomize_base_mass = True
Expand Down Expand Up @@ -120,14 +121,14 @@ class scaling(LeggedRobotCfg.scaling):

class AnymalCFlatRunnerCfg(LeggedRobotRunnerCfg):
seed = -1
runner_class_name = "OldPolicyRunner"

class policy(LeggedRobotRunnerCfg.policy):
actor_hidden_dims = [256, 256, 256]
critic_hidden_dims = [256, 256, 256]
class actor(LeggedRobotRunnerCfg.actor):
hidden_dims = [256, 256, 256]
# can be elu, relu, selu, crelu, lrelu, tanh, sigmoid
activation = "elu"

actor_obs = [
obs = [
"base_height",
"base_lin_vel",
"base_ang_vel",
Expand All @@ -138,7 +139,19 @@ class policy(LeggedRobotRunnerCfg.policy):
"dof_pos_history",
]

critic_obs = [
actions = ["dof_pos_target"]

class noise:
dof_pos_obs = 0.005
dof_vel = 0.005
base_ang_vel = 0.05 # 0.027, 0.14, 0.37
projected_gravity = 0.02

class critic(LeggedRobotRunnerCfg.critic):
hidden_dims = [256, 256, 256]
# can be elu, relu, selu, crelu, lrelu, tanh, sigmoid
activation = "elu"
obs = [
"base_height",
"base_lin_vel",
"base_ang_vel",
Expand All @@ -149,16 +162,8 @@ class policy(LeggedRobotRunnerCfg.policy):
"dof_pos_history",
]

actions = ["dof_pos_target"]

class noise:
dof_pos_obs = 0.005 # can be made very low
dof_vel = 0.005
base_ang_vel = 0.05 # 0.027, 0.14, 0.37
projected_gravity = 0.02

class reward(LeggedRobotRunnerCfg.policy.reward):
class weights(LeggedRobotRunnerCfg.policy.reward.weights):
class reward:
class weights:
tracking_lin_vel = 3.0
tracking_ang_vel = 1.0
lin_vel_z = 0.0
Expand Down
7 changes: 5 additions & 2 deletions gym/envs/anymal_c/mixed_terrains/anymal_c_rough_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,12 +56,15 @@ class domain_rand(AnymalCFlatCfg.domain_rand):


class AnymalCRoughCCfgPPO(AnymalCFlatCfgPPO):
class policy(AnymalCFlatCfgPPO.policy):
actor_hidden_dims = [128, 64, 32]
class actor(AnymalCFlatCfgPPO.actor):
hidden_dims = [128, 64, 32]
critic_hidden_dims = [128, 64, 32]
# can be elu, relu, selu, crelu, lrelu, tanh, sigmoid
activation = "elu"

class critic(AnymalCFlatCfgPPO.critic):
pass

class algorithm(AnymalCFlatCfgPPO.algorithm):
entropy_coef = 0.01

Expand Down
4 changes: 2 additions & 2 deletions gym/envs/base/fixed_robot.py
Original file line number Diff line number Diff line change
Expand Up @@ -417,8 +417,8 @@ def _init_buffers(self):
self.act_idx = to_torch(actuated_idx, dtype=torch.long, device=self.device)
# * check that init range highs and lows are consistent
# * and repopulate to match
if self.cfg.init_state.reset_mode == "reset_to_range":
self.initialize_ranges_for_initial_conditions()
# if self.cfg.init_state.reset_mode == "reset_to_range":
self.initialize_ranges_for_initial_conditions()

def initialize_ranges_for_initial_conditions(self):
self.dof_pos_range = torch.zeros(
Expand Down
33 changes: 16 additions & 17 deletions gym/envs/base/fixed_robot_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -123,34 +123,33 @@ class FixedRobotCfgPPO(BaseConfig):
class logging:
enable_local_saving = True

class policy:
class actor:
init_noise_std = 1.0
actor_hidden_dims = [512, 256, 128]
critic_hidden_dims = [512, 256, 128]
hidden_dims = [512, 256, 128]
# * can be elu, relu, selu, crelu, lrelu, tanh, sigmoid
activation = "elu"
# only for 'ActorCriticRecurrent':
# rnn_type = 'lstm'
# rnn_hidden_size = 512
# rnn_num_layers = 1

actor_obs = [
obs = [
"observation_a",
"observation_b",
"these_need_to_be_atributes_(states)_of_the_robot_env",
]

critic_obs = [
"observation_x",
"observation_y",
"critic_obs_can_be_the_same_or_different_than_actor_obs",
]
normalize_obs = True

actions = ["tau_ff"]
disable_actions = False

class noise:
noise = 0.1 # implement as needed, also in your robot class
observation_a = 0.1 # implement as needed, also in your robot class

class critic:
hidden_dims = [512, 256, 128]
activation = "elu"
normalize_obs = True
obs = [
"observation_x",
"observation_y",
"critic_obs_can_be_the_same_or_different_than_actor_obs",
]

class rewards:
class weights:
Expand Down Expand Up @@ -182,7 +181,7 @@ class algorithm:

class runner:
policy_class_name = "ActorCritic"
algorithm_class_name = "PPO"
algorithm_class_name = "PPO2"
num_steps_per_env = 24 # per iteration
max_iterations = 500 # number of policy updates

Expand Down
45 changes: 30 additions & 15 deletions gym/envs/base/legged_robot_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -233,10 +233,9 @@ class LeggedRobotRunnerCfg(BaseConfig):
class logging:
enable_local_saving = True

class policy:
class actor:
init_noise_std = 1.0
actor_hidden_dims = [512, 256, 128]
critic_hidden_dims = [512, 256, 128]
hidden_dims = [512, 256, 128]
# can be elu, relu, selu, crelu, lrelu, tanh, sigmoid
activation = "elu"
normalize_obs = True
Expand All @@ -263,6 +262,17 @@ class noise:
projected_gravity = 0.05
height_measurements = 0.1

class critic:
hidden_dims = [512, 256, 128]
# can be elu, relu, selu, crelu, lrelu, tanh, sigmoid
activation = "elu"
normalize_obs = True
obs = [
"observation_x",
"observation_y",
"critic_obs_can_be_the_same_or_different_than_actor_obs",
]

class reward:
class weights:
tracking_lin_vel = 0.0
Expand All @@ -283,25 +293,30 @@ class termination_weight:
termination = 0.01

class algorithm:
# * training params
value_loss_coef = 1.0
use_clipped_value_loss = True
# both
gamma = 0.99
lam = 0.95
# shared
batch_size = 2**15
max_grad_steps = 10
# new
storage_size = 2**17 # new
mini_batch_size = 2**15 # new

clip_param = 0.2
entropy_coef = 0.01
num_learning_epochs = 5
# * mini batch size = num_envs*nsteps / nminibatches
num_mini_batches = 4
learning_rate = 1.0e-3
max_grad_norm = 1.0
# Critic
use_clipped_value_loss = True
# Actor
entropy_coef = 0.01
schedule = "adaptive" # could be adaptive, fixed
gamma = 0.99
lam = 0.95
desired_kl = 0.01
max_grad_norm = 1.0

class runner:
policy_class_name = "ActorCritic"
algorithm_class_name = "PPO"
num_steps_per_env = 24
algorithm_class_name = "PPO2"
num_steps_per_env = 24 # deprecate
max_iterations = 1500
save_interval = 50
run_name = ""
Expand Down
1 change: 1 addition & 0 deletions gym/envs/base/task_skeleton.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ def reset(self):
"""Reset all robots"""
self._reset_idx(torch.arange(self.num_envs, device=self.device))
self.step()
self.episode_length_buf[:] = 0

def _reset_buffers(self):
self.to_be_reset[:] = False
Expand Down
25 changes: 18 additions & 7 deletions gym/envs/cartpole/cartpole_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,24 +67,21 @@ class CartpoleRunnerCfg(FixedRobotCfgPPO):
seed = -1
runner_class_name = "OnPolicyRunner"

class policy(FixedRobotCfgPPO.policy):
class actor(FixedRobotCfgPPO.actor):
init_noise_std = 1.0
num_layers = 2
num_units = 32
actor_hidden_dims = [num_units] * num_layers
critic_hidden_dims = [num_units] * num_layers
hidden_dims = [num_units] * num_layers
activation = "elu"

actor_obs = [
obs = [
"cart_obs",
"pole_trig_obs",
"dof_vel",
"cart_vel_square",
"pole_vel_square",
]

critic_obs = actor_obs

actions = ["tau_ff"]

class noise:
Expand All @@ -94,6 +91,20 @@ class noise:
pole_vel = 0.010
actuation = 0.00

class critic:
num_layers = 2
num_units = 32
hidden_dims = [num_units] * num_layers
activation = "elu"

obs = [
"cart_obs",
"pole_trig_obs",
"dof_vel",
"cart_vel_square",
"pole_vel_square",
]

class reward:
class weights:
pole_pos = 5
Expand Down Expand Up @@ -125,7 +136,7 @@ class algorithm(FixedRobotCfgPPO.algorithm):

class runner(FixedRobotCfgPPO.runner):
policy_class_name = "ActorCritic"
algorithm_class_name = "PPO"
algorithm_class_name = "PPO2"
num_steps_per_env = 96 # per iteration
max_iterations = 500 # number of policy updates

Expand Down
Loading