Skip to content

Conversation

@sheim
Copy link
Owner

@sheim sheim commented Dec 10, 2025

Issue ticket number and link

Fixes # (issue)

Describe your changes

Please include a summary of the change, including why you did this, and the desired effect.

Instructions for reviewers

Indicate anything in particular that you would like a code-reviewer to pay particular attention to.
Indicate steps to actually test code, including CLI instructions if different than usual.
Point out the desired behavior, and not just the "check that this appears" (otherwise the code reviewer will be lazy and just verify what you've already verified).

Checklist before requesting a review

  • This is expected to break regression tests.
  • I have assigned a reviewer
  • I have added the PR to the project, and tagged with with priority
  • If it is a core feature, I have added tests.
  • I have set up pre-commit hooks with ruff, or run ruff format . manually

sheim added 29 commits March 16, 2024 12:04
…d history length.

Implemented for dof_pos_target, dof_pos, dof_vel.
in preparation for domain randomization of gains
…f_pos_history` (used for smoothness reward).

Some tuning.
Change stance/swing reward to use the smoothed square wave.
diff --git a/gym/envs/mit_humanoid/mit_humanoid_config.py b/gym/envs/mit_humanoid/mit_humanoid_config.py
index 54813f7..368f2c3 100644
--- a/gym/envs/mit_humanoid/mit_humanoid_config.py
+++ b/gym/envs/mit_humanoid/mit_humanoid_config.py
@@ -142,7 +142,7 @@ class MITHumanoidCfg(LeggedRobotCfg):
     class asset(LeggedRobotCfg.asset):
         file = (
             "{LEGGED_GYM_ROOT_DIR}/resources/robots/"
-            + "mit_humanoid/urdf/humanoid_R_sf.urdf"
+            + "mit_humanoid/urdf/humanoid_F_sf_learnt.urdf"
         )
         # foot_collisionbox_names = ["foot"]
         foot_name = "foot"
@@ -289,12 +289,12 @@ class MITHumanoidRunnerCfg(LeggedRobotRunnerCfg):
         use_clipped_value_loss = True
         clip_param = 0.2
         entropy_coef = 0.01
-        num_learning_epochs = 5
+        num_learning_epochs = 4
         # * mini batch size = num_envs*nsteps / nminibatches
         num_mini_batches = 4
-        learning_rate = 5.0e-5
+        learning_rate = 1.0e-6
         schedule = "adaptive"  # could be adaptive, fixed
-        gamma = 0.999
+        gamma = 0.99
         lam = 0.95
         desired_kl = 0.01
         max_grad_norm = 1.0
diff --git a/resources/robots/mit_humanoid/urdf/humanoid_F_sf_learnt.urdf b/resources/robots/mit_humanoid/urdf/humanoid_F_sf_learnt.urdf
index 5e76f34..98a0773 100644
--- a/resources/robots/mit_humanoid/urdf/humanoid_F_sf_learnt.urdf
+++ b/resources/robots/mit_humanoid/urdf/humanoid_F_sf_learnt.urdf
@@ -570,7 +570,7 @@ Simple Foot: foot approximated as single box-contact -->
   </link>
   <joint
     name="15_left_shoulder_pitch"
-    type="fixed">
+    type="revolute">
     <origin
       xyz="0.01346 0.17608 0.24657"
       rpy="0 0 0" />
@@ -615,7 +615,7 @@ Simple Foot: foot approximated as single box-contact -->
   </link>
   <joint
     name="16_left_shoulder_abad"
-    type="fixed">
+    type="revolute">
     <origin
       xyz="0 .05760 0"
       rpy="0.0 0 0" />
@@ -668,7 +668,7 @@ Simple Foot: foot approximated as single box-contact -->
   </link>
   <joint
     name="17_left_shoulder_yaw"
-    type="fixed">
+    type="revolute">
     <origin
       xyz="0 0 -.10250"
       rpy="0.0 0 0" />
@@ -719,7 +719,7 @@ Simple Foot: foot approximated as single box-contact -->
   </link>
   <joint
     name="18_left_elbow"
-    type="fixed">
+    type="revolute">
     <origin
       xyz="0 0 -.15750"
       rpy="0 0 0.0" />
@@ -798,7 +798,7 @@ Simple Foot: foot approximated as single box-contact -->
   </link>
   <joint
     name="11_right_shoulder_pitch"
-    type="fixed">
+    type="revolute">
     <origin
       xyz="0.01346 -0.17608 0.24657"
       rpy="0 0 0" />
@@ -843,7 +843,7 @@ Simple Foot: foot approximated as single box-contact -->
   </link>
   <joint
     name="12_right_shoulder_abad"
-    type="fixed">
+    type="revolute">
     <origin
       xyz="0 -.05760 0"
       rpy="0.0 0 0" />
@@ -896,7 +896,7 @@ Simple Foot: foot approximated as single box-contact -->
   </link>
   <joint
     name="13_right_shoulder_yaw"
-    type="fixed">
+    type="revolute">
     <origin
       xyz="0 0 -.10250"
       rpy="0.0 0 0" />
@@ -947,7 +947,7 @@ Simple Foot: foot approximated as single box-contact -->
   </link>
   <joint
     name="14_right_elbow"
-    type="fixed">
+    type="revolute">
     <origin
       xyz="0 0 -.15750"
       rpy="0 0 0.0" />
put jacobian into MIT_humanoid base environment, rather than import; importing breaks when running --original_config
only for lander: needs to overload _check_terminations_and_timeouts to not actually reset
so that reset can be handled only by runner (without breaking backwards compatibility)

Also keeps a list of rewards, for finer reward integration, and can also do traj stats
Adjust some reward weights of action_rate (mini-cheetah) to work decently.
…ffer. Doesn't seem to make a difference in speed.
…n the runner.

Seems to make a tiny difference in speed.
More importantly, it removes the `eval` hack, which means we can profile individual reward functions.
THIS BREAKS PBRS
@sheim sheim requested a review from Copilot December 10, 2025 22:18
@sheim sheim self-assigned this Dec 10, 2025
@sheim sheim requested review from Copilot and removed request for Copilot December 10, 2025 22:24
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR merges humanoid robot support into the codebase, adding new MIT humanoid robot configurations and a "Lander" task variant. The changes include significant refactoring of reward computation in runner classes to support sub-step reward integration, new URDF files and friction models for the humanoid robot, and various configuration updates.

Key changes:

  • Refactored runner reward computation to integrate rewards over multiple sub-steps between policy actions
  • Added new MIT Humanoid and Lander environment configurations with sampled history buffers
  • Updated URDF files and added learned friction models for the humanoid robot
  • Reformatted assert statements across test files for code style consistency

Reviewed changes

Copilot reviewed 31 out of 33 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
learning/runners/on_policy_runner.py Refactored reward computation to integrate over sub-steps with TensorDict
learning/runners/off_policy_runner.py Same reward refactoring as on_policy_runner
learning/runners/my_runner.py Extended reward refactoring plus state logging functionality
learning/runners/BaseRunner.py Changed reward computation to use direct function calls instead of env.compute_reward
gym/envs/mit_humanoid/mit_humanoid.py Added sampled history buffers and Jacobian coupling for humanoid robot
gym/envs/mit_humanoid/lander.py New Lander environment implementation extending MIT_Humanoid
gym/envs/mit_humanoid/lander_config.py Configuration for Lander task
gym/envs/mit_humanoid/mit_humanoid_config.py Updated MIT humanoid configuration with new scaling and rewards
resources/robots/mit_humanoid/urdf/humanoid_F_sf_learnt.urdf New URDF file with full arm degrees of freedom
resources/robots/mit_humanoid/urdf/humanoid_F_sf.urdf Changed arm joints from revolute to fixed
resources/robots/mit_humanoid/friction_model_L.pt Learned friction model (left side)
resources/robots/mit_humanoid/friction_model_R.pt Learned friction model (right side)
gym/envs/base/task_skeleton.py Moved buffer initialization to init and removed compute_reward method
gym/envs/base/legged_robot.py Updated history buffer rolling and p/d gains to be per-environment
gym/utils/task_registry.py Added default actor frequency handling
tests/regression_tests/test_generated_torch_files.py Reformatted assert statement
tests/integration_tests/test_runner_integration.py Reformatted multiple assert statements
learning/utils/logger/*.py Reformatted assert statements
gym/envs/mini_cheetah/*.py Updated _switch to use instance variable instead of method calls
.pre-commit-config.yaml Updated ruff version from v0.1.3 to v0.14.2
Comments suppressed due to low confidence (5)

gym/envs/mit_humanoid/lander_config.py:303

  • This assignment to 'batch_size' is unnecessary as it is redefined before this value is used.
        batch_size = 2**15

learning/runners/my_runner.py:96

  • This 'for' statement has a redundant 'else' as no 'break' is present in the body.
                    for step in range(n_policy_steps):

learning/runners/off_policy_runner.py:93

  • This 'for' statement has a redundant 'else' as no 'break' is present in the body.
                for step in range(n_policy_steps):

learning/runners/off_policy_runner.py:150

  • This 'for' statement has a redundant 'else' as no 'break' is present in the body.
                    for step in range(n_policy_steps):

learning/runners/on_policy_runner.py:85

  • This 'for' statement has a redundant 'else' as no 'break' is present in the body.
                    for step in range(n_policy_steps):

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +150 to +167
def update_rewards_dict(self, rewards_dict, step):
# sum existing rewards with new rewards
rewards_dict[step].update(
self.get_rewards(
self.critic_cfg["reward"]["termination_weight"], mask=terminated
)
self.critic_cfg["reward"]["termination_weight"],
modifier=self.env.dt,
mask=self.env.terminated,
),
inplace=True,
)
rewards_dict.update(
rewards_dict[step].update(
self.get_rewards(
self.critic_cfg["reward"]["weights"],
modifier=self.env.dt,
mask=~terminated,
mask=~self.env.terminated,
),
inplace=True,
)
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the update_rewards_dict method, rewards are being multiplied by modifier=self.env.dt, but then summed across steps. Since n_policy_steps = int((1 / self.env.dt) / self.actor_cfg["frequency"]), this could lead to incorrect reward scaling. The reward at each sub-step is scaled by dt, then summed over n_policy_steps, which may not produce the intended reward integration.

Copilot uses AI. Check for mistakes.
self.sampled_history_dof_pos[ids] = torch.roll(
self.sampled_history_dof_pos[ids], self.num_dof, dims=1
) # check
self.sampled_history_dof_pos[ids, : self.num_dof] = self.dof_pos_target[ids]
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 124 resets sampled_history_dof_pos using self.dof_pos_target[ids], but line 87 shows the buffer should store actual dof_pos, not targets. This inconsistency means the history buffer is initialized with targets but meant to track actual positions.

Suggested change
self.sampled_history_dof_pos[ids, : self.num_dof] = self.dof_pos_target[ids]
self.sampled_history_dof_pos[ids, : self.num_dof] = self.dof_pos[ids]

Copilot uses AI. Check for mistakes.
Comment on lines +134 to +148
def _switch(self, mode=None):
c_vel = torch.linalg.norm(self.commands, dim=1)
switch = torch.exp(
-torch.square(
torch.max(
torch.zeros_like(c_vel),
c_vel - self.cfg.reward_settings.switch_scale,
)
)
/ self.cfg.reward_settings.switch_scale
)
if mode is None or mode == "stand":
return switch
elif mode == "move":
return 1 - switch
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method _switch() is defined on lines 134-148 but returns different values based on the mode parameter. However, the actual instance variable self._switch is updated in _update_cmd_switch() in mini_cheetah_ref.py. This creates confusion between a method and an instance variable with similar names. The MIT humanoid version uses a method, while mini_cheetah uses an instance variable.

Copilot uses AI. Check for mistakes.
Comment on lines +154 to +158
else:
# catch and reset failed envs
to_be_reset = self.env.timed_out | self.env.terminated
env_ids = (to_be_reset).nonzero(as_tuple=False).flatten()
self.env._reset_idx(env_ids)
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same for-else bug exists here. The else clause on lines 154-158 will only execute after the entire loop completes, not after each step.

Copilot uses AI. Check for mistakes.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct

Comment on lines +100 to +104
else:
# catch and reset failed envs
to_be_reset = self.env.timed_out | self.env.terminated
env_ids = (to_be_reset).nonzero(as_tuple=False).flatten()
self.env._reset_idx(env_ids)
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same for-else bug exists here. The else clause on lines 100-104 will only execute after the entire loop completes, not after each step.

Copilot uses AI. Check for mistakes.
max_gradient_steps = 24
# new
storage_size = 2**17 # new
batch_size = 2**15 # new
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The batch_size configuration parameter is defined twice on lines 303 and 307, with different values (2**15 both times, but still redundant). This duplication is confusing and could lead to maintenance issues.

Suggested change
batch_size = 2**15 # new

Copilot uses AI. Check for mistakes.
Comment on lines +89 to +93
else:
# catch and reset failed envs
to_be_reset = self.env.timed_out | self.env.terminated
env_ids = (to_be_reset).nonzero(as_tuple=False).flatten()
self.env._reset_idx(env_ids)
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The else clause after a for loop on lines 89-93 is unconventional and misleading. In Python, a for-else block means the else executes when the loop completes normally (without a break), not after each iteration. This code will execute the reset logic only once after all n_policy_steps iterations complete, not after each step. This likely causes environments to not be reset when they terminate mid-loop, leading to incorrect simulation behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +97 to +101
else:
# catch and reset failed envs
to_be_reset = self.env.timed_out | self.env.terminated
env_ids = (to_be_reset).nonzero(as_tuple=False).flatten()
self.env._reset_idx(env_ids)
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same for-else bug exists here as in on_policy_runner.py. The else clause on lines 97-101 will only execute after the entire loop completes, not after each step, preventing proper environment resets during the loop.

Copilot uses AI. Check for mistakes.
Comment on lines +117 to +118
# class oscillator:
# base_frequency = 3.0 # [Hz]
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment appears to contain commented-out code.

Suggested change
# class oscillator:
# base_frequency = 3.0 # [Hz]

Copilot uses AI. Check for mistakes.
@torch.no_grad
def burn_in_normalization(self, n_iterations=100):
actor_obs = self.get_obs(self.actor_cfg["obs"])
critic_obs = self.get_obs(self.critic_cfg["obs"])
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assignment to 'critic_obs' is unnecessary as it is redefined before this value is used.

Suggested change
critic_obs = self.get_obs(self.critic_cfg["obs"])

Copilot uses AI. Check for mistakes.
@sheim sheim merged commit 738b5d9 into dev Dec 10, 2025
2 checks passed
@sheim sheim deleted the merge_humanoid branch December 10, 2025 23:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants