-
Notifications
You must be signed in to change notification settings - Fork 0
Merge humanoid #37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge humanoid #37
Conversation
…d history length. Implemented for dof_pos_target, dof_pos, dof_vel.
in preparation for domain randomization of gains
…f_pos_history` (used for smoothness reward). Some tuning. Change stance/swing reward to use the smoothed square wave.
diff --git a/gym/envs/mit_humanoid/mit_humanoid_config.py b/gym/envs/mit_humanoid/mit_humanoid_config.py index 54813f7..368f2c3 100644 --- a/gym/envs/mit_humanoid/mit_humanoid_config.py +++ b/gym/envs/mit_humanoid/mit_humanoid_config.py @@ -142,7 +142,7 @@ class MITHumanoidCfg(LeggedRobotCfg): class asset(LeggedRobotCfg.asset): file = ( "{LEGGED_GYM_ROOT_DIR}/resources/robots/" - + "mit_humanoid/urdf/humanoid_R_sf.urdf" + + "mit_humanoid/urdf/humanoid_F_sf_learnt.urdf" ) # foot_collisionbox_names = ["foot"] foot_name = "foot" @@ -289,12 +289,12 @@ class MITHumanoidRunnerCfg(LeggedRobotRunnerCfg): use_clipped_value_loss = True clip_param = 0.2 entropy_coef = 0.01 - num_learning_epochs = 5 + num_learning_epochs = 4 # * mini batch size = num_envs*nsteps / nminibatches num_mini_batches = 4 - learning_rate = 5.0e-5 + learning_rate = 1.0e-6 schedule = "adaptive" # could be adaptive, fixed - gamma = 0.999 + gamma = 0.99 lam = 0.95 desired_kl = 0.01 max_grad_norm = 1.0 diff --git a/resources/robots/mit_humanoid/urdf/humanoid_F_sf_learnt.urdf b/resources/robots/mit_humanoid/urdf/humanoid_F_sf_learnt.urdf index 5e76f34..98a0773 100644 --- a/resources/robots/mit_humanoid/urdf/humanoid_F_sf_learnt.urdf +++ b/resources/robots/mit_humanoid/urdf/humanoid_F_sf_learnt.urdf @@ -570,7 +570,7 @@ Simple Foot: foot approximated as single box-contact --> </link> <joint name="15_left_shoulder_pitch" - type="fixed"> + type="revolute"> <origin xyz="0.01346 0.17608 0.24657" rpy="0 0 0" /> @@ -615,7 +615,7 @@ Simple Foot: foot approximated as single box-contact --> </link> <joint name="16_left_shoulder_abad" - type="fixed"> + type="revolute"> <origin xyz="0 .05760 0" rpy="0.0 0 0" /> @@ -668,7 +668,7 @@ Simple Foot: foot approximated as single box-contact --> </link> <joint name="17_left_shoulder_yaw" - type="fixed"> + type="revolute"> <origin xyz="0 0 -.10250" rpy="0.0 0 0" /> @@ -719,7 +719,7 @@ Simple Foot: foot approximated as single box-contact --> </link> <joint name="18_left_elbow" - type="fixed"> + type="revolute"> <origin xyz="0 0 -.15750" rpy="0 0 0.0" /> @@ -798,7 +798,7 @@ Simple Foot: foot approximated as single box-contact --> </link> <joint name="11_right_shoulder_pitch" - type="fixed"> + type="revolute"> <origin xyz="0.01346 -0.17608 0.24657" rpy="0 0 0" /> @@ -843,7 +843,7 @@ Simple Foot: foot approximated as single box-contact --> </link> <joint name="12_right_shoulder_abad" - type="fixed"> + type="revolute"> <origin xyz="0 -.05760 0" rpy="0.0 0 0" /> @@ -896,7 +896,7 @@ Simple Foot: foot approximated as single box-contact --> </link> <joint name="13_right_shoulder_yaw" - type="fixed"> + type="revolute"> <origin xyz="0 0 -.10250" rpy="0.0 0 0" /> @@ -947,7 +947,7 @@ Simple Foot: foot approximated as single box-contact --> </link> <joint name="14_right_elbow" - type="fixed"> + type="revolute"> <origin xyz="0 0 -.15750" rpy="0 0 0.0" />
…erically unstable for humanoid
put jacobian into MIT_humanoid base environment, rather than import; importing breaks when running --original_config
only for lander: needs to overload _check_terminations_and_timeouts to not actually reset so that reset can be handled only by runner (without breaking backwards compatibility) Also keeps a list of rewards, for finer reward integration, and can also do traj stats
Adjust some reward weights of action_rate (mini-cheetah) to work decently.
…ffer. Doesn't seem to make a difference in speed.
…n the runner. Seems to make a tiny difference in speed. More importantly, it removes the `eval` hack, which means we can profile individual reward functions. THIS BREAKS PBRS
…cluding minor speedups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR merges humanoid robot support into the codebase, adding new MIT humanoid robot configurations and a "Lander" task variant. The changes include significant refactoring of reward computation in runner classes to support sub-step reward integration, new URDF files and friction models for the humanoid robot, and various configuration updates.
Key changes:
- Refactored runner reward computation to integrate rewards over multiple sub-steps between policy actions
- Added new MIT Humanoid and Lander environment configurations with sampled history buffers
- Updated URDF files and added learned friction models for the humanoid robot
- Reformatted assert statements across test files for code style consistency
Reviewed changes
Copilot reviewed 31 out of 33 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
learning/runners/on_policy_runner.py |
Refactored reward computation to integrate over sub-steps with TensorDict |
learning/runners/off_policy_runner.py |
Same reward refactoring as on_policy_runner |
learning/runners/my_runner.py |
Extended reward refactoring plus state logging functionality |
learning/runners/BaseRunner.py |
Changed reward computation to use direct function calls instead of env.compute_reward |
gym/envs/mit_humanoid/mit_humanoid.py |
Added sampled history buffers and Jacobian coupling for humanoid robot |
gym/envs/mit_humanoid/lander.py |
New Lander environment implementation extending MIT_Humanoid |
gym/envs/mit_humanoid/lander_config.py |
Configuration for Lander task |
gym/envs/mit_humanoid/mit_humanoid_config.py |
Updated MIT humanoid configuration with new scaling and rewards |
resources/robots/mit_humanoid/urdf/humanoid_F_sf_learnt.urdf |
New URDF file with full arm degrees of freedom |
resources/robots/mit_humanoid/urdf/humanoid_F_sf.urdf |
Changed arm joints from revolute to fixed |
resources/robots/mit_humanoid/friction_model_L.pt |
Learned friction model (left side) |
resources/robots/mit_humanoid/friction_model_R.pt |
Learned friction model (right side) |
gym/envs/base/task_skeleton.py |
Moved buffer initialization to init and removed compute_reward method |
gym/envs/base/legged_robot.py |
Updated history buffer rolling and p/d gains to be per-environment |
gym/utils/task_registry.py |
Added default actor frequency handling |
tests/regression_tests/test_generated_torch_files.py |
Reformatted assert statement |
tests/integration_tests/test_runner_integration.py |
Reformatted multiple assert statements |
learning/utils/logger/*.py |
Reformatted assert statements |
gym/envs/mini_cheetah/*.py |
Updated _switch to use instance variable instead of method calls |
.pre-commit-config.yaml |
Updated ruff version from v0.1.3 to v0.14.2 |
Comments suppressed due to low confidence (5)
gym/envs/mit_humanoid/lander_config.py:303
- This assignment to 'batch_size' is unnecessary as it is redefined before this value is used.
batch_size = 2**15
learning/runners/my_runner.py:96
- This 'for' statement has a redundant 'else' as no 'break' is present in the body.
for step in range(n_policy_steps):
learning/runners/off_policy_runner.py:93
- This 'for' statement has a redundant 'else' as no 'break' is present in the body.
for step in range(n_policy_steps):
learning/runners/off_policy_runner.py:150
- This 'for' statement has a redundant 'else' as no 'break' is present in the body.
for step in range(n_policy_steps):
learning/runners/on_policy_runner.py:85
- This 'for' statement has a redundant 'else' as no 'break' is present in the body.
for step in range(n_policy_steps):
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def update_rewards_dict(self, rewards_dict, step): | ||
| # sum existing rewards with new rewards | ||
| rewards_dict[step].update( | ||
| self.get_rewards( | ||
| self.critic_cfg["reward"]["termination_weight"], mask=terminated | ||
| ) | ||
| self.critic_cfg["reward"]["termination_weight"], | ||
| modifier=self.env.dt, | ||
| mask=self.env.terminated, | ||
| ), | ||
| inplace=True, | ||
| ) | ||
| rewards_dict.update( | ||
| rewards_dict[step].update( | ||
| self.get_rewards( | ||
| self.critic_cfg["reward"]["weights"], | ||
| modifier=self.env.dt, | ||
| mask=~terminated, | ||
| mask=~self.env.terminated, | ||
| ), | ||
| inplace=True, | ||
| ) |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the update_rewards_dict method, rewards are being multiplied by modifier=self.env.dt, but then summed across steps. Since n_policy_steps = int((1 / self.env.dt) / self.actor_cfg["frequency"]), this could lead to incorrect reward scaling. The reward at each sub-step is scaled by dt, then summed over n_policy_steps, which may not produce the intended reward integration.
| self.sampled_history_dof_pos[ids] = torch.roll( | ||
| self.sampled_history_dof_pos[ids], self.num_dof, dims=1 | ||
| ) # check | ||
| self.sampled_history_dof_pos[ids, : self.num_dof] = self.dof_pos_target[ids] |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 124 resets sampled_history_dof_pos using self.dof_pos_target[ids], but line 87 shows the buffer should store actual dof_pos, not targets. This inconsistency means the history buffer is initialized with targets but meant to track actual positions.
| self.sampled_history_dof_pos[ids, : self.num_dof] = self.dof_pos_target[ids] | |
| self.sampled_history_dof_pos[ids, : self.num_dof] = self.dof_pos[ids] |
| def _switch(self, mode=None): | ||
| c_vel = torch.linalg.norm(self.commands, dim=1) | ||
| switch = torch.exp( | ||
| -torch.square( | ||
| torch.max( | ||
| torch.zeros_like(c_vel), | ||
| c_vel - self.cfg.reward_settings.switch_scale, | ||
| ) | ||
| ) | ||
| / self.cfg.reward_settings.switch_scale | ||
| ) | ||
| if mode is None or mode == "stand": | ||
| return switch | ||
| elif mode == "move": | ||
| return 1 - switch |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method _switch() is defined on lines 134-148 but returns different values based on the mode parameter. However, the actual instance variable self._switch is updated in _update_cmd_switch() in mini_cheetah_ref.py. This creates confusion between a method and an instance variable with similar names. The MIT humanoid version uses a method, while mini_cheetah uses an instance variable.
| else: | ||
| # catch and reset failed envs | ||
| to_be_reset = self.env.timed_out | self.env.terminated | ||
| env_ids = (to_be_reset).nonzero(as_tuple=False).flatten() | ||
| self.env._reset_idx(env_ids) |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same for-else bug exists here. The else clause on lines 154-158 will only execute after the entire loop completes, not after each step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct
| else: | ||
| # catch and reset failed envs | ||
| to_be_reset = self.env.timed_out | self.env.terminated | ||
| env_ids = (to_be_reset).nonzero(as_tuple=False).flatten() | ||
| self.env._reset_idx(env_ids) |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same for-else bug exists here. The else clause on lines 100-104 will only execute after the entire loop completes, not after each step.
| max_gradient_steps = 24 | ||
| # new | ||
| storage_size = 2**17 # new | ||
| batch_size = 2**15 # new |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The batch_size configuration parameter is defined twice on lines 303 and 307, with different values (2**15 both times, but still redundant). This duplication is confusing and could lead to maintenance issues.
| batch_size = 2**15 # new |
| else: | ||
| # catch and reset failed envs | ||
| to_be_reset = self.env.timed_out | self.env.terminated | ||
| env_ids = (to_be_reset).nonzero(as_tuple=False).flatten() | ||
| self.env._reset_idx(env_ids) |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The else clause after a for loop on lines 89-93 is unconventional and misleading. In Python, a for-else block means the else executes when the loop completes normally (without a break), not after each iteration. This code will execute the reset logic only once after all n_policy_steps iterations complete, not after each step. This likely causes environments to not be reset when they terminate mid-loop, leading to incorrect simulation behavior.
| else: | ||
| # catch and reset failed envs | ||
| to_be_reset = self.env.timed_out | self.env.terminated | ||
| env_ids = (to_be_reset).nonzero(as_tuple=False).flatten() | ||
| self.env._reset_idx(env_ids) |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same for-else bug exists here as in on_policy_runner.py. The else clause on lines 97-101 will only execute after the entire loop completes, not after each step, preventing proper environment resets during the loop.
| # class oscillator: | ||
| # base_frequency = 3.0 # [Hz] |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment appears to contain commented-out code.
| # class oscillator: | |
| # base_frequency = 3.0 # [Hz] |
| @torch.no_grad | ||
| def burn_in_normalization(self, n_iterations=100): | ||
| actor_obs = self.get_obs(self.actor_cfg["obs"]) | ||
| critic_obs = self.get_obs(self.critic_cfg["obs"]) |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assignment to 'critic_obs' is unnecessary as it is redefined before this value is used.
| critic_obs = self.get_obs(self.critic_cfg["obs"]) |
Issue ticket number and link
Fixes # (issue)
Describe your changes
Please include a summary of the change, including why you did this, and the desired effect.
Instructions for reviewers
Indicate anything in particular that you would like a code-reviewer to pay particular attention to.
Indicate steps to actually test code, including CLI instructions if different than usual.
Point out the desired behavior, and not just the "check that this appears" (otherwise the code reviewer will be lazy and just verify what you've already verified).
Checklist before requesting a review
ruff format .manually