Merge humanoid #37

sheim · 2025-12-10T22:18:11Z

Issue ticket number and link

Fixes # (issue)

Describe your changes

Please include a summary of the change, including why you did this, and the desired effect.

Instructions for reviewers

Indicate anything in particular that you would like a code-reviewer to pay particular attention to.
Indicate steps to actually test code, including CLI instructions if different than usual.
Point out the desired behavior, and not just the "check that this appears" (otherwise the code reviewer will be lazy and just verify what you've already verified).

Checklist before requesting a review

This is expected to break regression tests.
I have assigned a reviewer
I have added the PR to the project, and tagged with with priority
If it is a core feature, I have added tests.
I have set up pre-commit hooks with ruff, or run ruff format . manually

…d history length. Implemented for dof_pos_target, dof_pos, dof_vel.

in preparation for domain randomization of gains

…f_pos_history` (used for smoothness reward). Some tuning. Change stance/swing reward to use the smoothed square wave.

diff --git a/gym/envs/mit_humanoid/mit_humanoid_config.py b/gym/envs/mit_humanoid/mit_humanoid_config.py index 54813f7..368f2c3 100644 --- a/gym/envs/mit_humanoid/mit_humanoid_config.py +++ b/gym/envs/mit_humanoid/mit_humanoid_config.py @@ -142,7 +142,7 @@ class MITHumanoidCfg(LeggedRobotCfg): class asset(LeggedRobotCfg.asset): file = ( "{LEGGED_GYM_ROOT_DIR}/resources/robots/" - + "mit_humanoid/urdf/humanoid_R_sf.urdf" + + "mit_humanoid/urdf/humanoid_F_sf_learnt.urdf" ) # foot_collisionbox_names = ["foot"] foot_name = "foot" @@ -289,12 +289,12 @@ class MITHumanoidRunnerCfg(LeggedRobotRunnerCfg): use_clipped_value_loss = True clip_param = 0.2 entropy_coef = 0.01 - num_learning_epochs = 5 + num_learning_epochs = 4 # * mini batch size = num_envs*nsteps / nminibatches num_mini_batches = 4 - learning_rate = 5.0e-5 + learning_rate = 1.0e-6 schedule = "adaptive" # could be adaptive, fixed - gamma = 0.999 + gamma = 0.99 lam = 0.95 desired_kl = 0.01 max_grad_norm = 1.0 diff --git a/resources/robots/mit_humanoid/urdf/humanoid_F_sf_learnt.urdf b/resources/robots/mit_humanoid/urdf/humanoid_F_sf_learnt.urdf index 5e76f34..98a0773 100644 --- a/resources/robots/mit_humanoid/urdf/humanoid_F_sf_learnt.urdf +++ b/resources/robots/mit_humanoid/urdf/humanoid_F_sf_learnt.urdf @@ -570,7 +570,7 @@ Simple Foot: foot approximated as single box-contact --> </link> <joint name="15_left_shoulder_pitch" - type="fixed"> + type="revolute"> <origin xyz="0.01346 0.17608 0.24657" rpy="0 0 0" /> @@ -615,7 +615,7 @@ Simple Foot: foot approximated as single box-contact --> </link> <joint name="16_left_shoulder_abad" - type="fixed"> + type="revolute"> <origin xyz="0 .05760 0" rpy="0.0 0 0" /> @@ -668,7 +668,7 @@ Simple Foot: foot approximated as single box-contact --> </link> <joint name="17_left_shoulder_yaw" - type="fixed"> + type="revolute"> <origin xyz="0 0 -.10250" rpy="0.0 0 0" /> @@ -719,7 +719,7 @@ Simple Foot: foot approximated as single box-contact --> </link> <joint name="18_left_elbow" - type="fixed"> + type="revolute"> <origin xyz="0 0 -.15750" rpy="0 0 0.0" /> @@ -798,7 +798,7 @@ Simple Foot: foot approximated as single box-contact --> </link> <joint name="11_right_shoulder_pitch" - type="fixed"> + type="revolute"> <origin xyz="0.01346 -0.17608 0.24657" rpy="0 0 0" /> @@ -843,7 +843,7 @@ Simple Foot: foot approximated as single box-contact --> </link> <joint name="12_right_shoulder_abad" - type="fixed"> + type="revolute"> <origin xyz="0 -.05760 0" rpy="0.0 0 0" /> @@ -896,7 +896,7 @@ Simple Foot: foot approximated as single box-contact --> </link> <joint name="13_right_shoulder_yaw" - type="fixed"> + type="revolute"> <origin xyz="0 0 -.10250" rpy="0.0 0 0" /> @@ -947,7 +947,7 @@ Simple Foot: foot approximated as single box-contact --> </link> <joint name="14_right_elbow" - type="fixed"> + type="revolute"> <origin xyz="0 0 -.15750" rpy="0 0 0.0" />

…erically unstable for humanoid

put jacobian into MIT_humanoid base environment, rather than import; importing breaks when running --original_config

only for lander: needs to overload _check_terminations_and_timeouts to not actually reset so that reset can be handled only by runner (without breaking backwards compatibility) Also keeps a list of rewards, for finer reward integration, and can also do traj stats

Adjust some reward weights of action_rate (mini-cheetah) to work decently.

…ffer. Doesn't seem to make a difference in speed.

…n the runner. Seems to make a tiny difference in speed. More importantly, it removes the `eval` hack, which means we can profile individual reward functions. THIS BREAKS PBRS

…cluding minor speedups

Copilot

Pull request overview

This PR merges humanoid robot support into the codebase, adding new MIT humanoid robot configurations and a "Lander" task variant. The changes include significant refactoring of reward computation in runner classes to support sub-step reward integration, new URDF files and friction models for the humanoid robot, and various configuration updates.

Key changes:

Refactored runner reward computation to integrate rewards over multiple sub-steps between policy actions
Added new MIT Humanoid and Lander environment configurations with sampled history buffers
Updated URDF files and added learned friction models for the humanoid robot
Reformatted assert statements across test files for code style consistency

Reviewed changes

Copilot reviewed 31 out of 33 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
`learning/runners/on_policy_runner.py`	Refactored reward computation to integrate over sub-steps with TensorDict
`learning/runners/off_policy_runner.py`	Same reward refactoring as on_policy_runner
`learning/runners/my_runner.py`	Extended reward refactoring plus state logging functionality
`learning/runners/BaseRunner.py`	Changed reward computation to use direct function calls instead of env.compute_reward
`gym/envs/mit_humanoid/mit_humanoid.py`	Added sampled history buffers and Jacobian coupling for humanoid robot
`gym/envs/mit_humanoid/lander.py`	New Lander environment implementation extending MIT_Humanoid
`gym/envs/mit_humanoid/lander_config.py`	Configuration for Lander task
`gym/envs/mit_humanoid/mit_humanoid_config.py`	Updated MIT humanoid configuration with new scaling and rewards
`resources/robots/mit_humanoid/urdf/humanoid_F_sf_learnt.urdf`	New URDF file with full arm degrees of freedom
`resources/robots/mit_humanoid/urdf/humanoid_F_sf.urdf`	Changed arm joints from revolute to fixed
`resources/robots/mit_humanoid/friction_model_L.pt`	Learned friction model (left side)
`resources/robots/mit_humanoid/friction_model_R.pt`	Learned friction model (right side)
`gym/envs/base/task_skeleton.py`	Moved buffer initialization to init and removed compute_reward method
`gym/envs/base/legged_robot.py`	Updated history buffer rolling and p/d gains to be per-environment
`gym/utils/task_registry.py`	Added default actor frequency handling
`tests/regression_tests/test_generated_torch_files.py`	Reformatted assert statement
`tests/integration_tests/test_runner_integration.py`	Reformatted multiple assert statements
`learning/utils/logger/*.py`	Reformatted assert statements
`gym/envs/mini_cheetah/*.py`	Updated _switch to use instance variable instead of method calls
`.pre-commit-config.yaml`	Updated ruff version from v0.1.3 to v0.14.2

Comments suppressed due to low confidence (5)

gym/envs/mit_humanoid/lander_config.py:303

This assignment to 'batch_size' is unnecessary as it is redefined before this value is used.

        batch_size = 2**15

learning/runners/my_runner.py:96

This 'for' statement has a redundant 'else' as no 'break' is present in the body.

                    for step in range(n_policy_steps):

learning/runners/off_policy_runner.py:93

This 'for' statement has a redundant 'else' as no 'break' is present in the body.

                for step in range(n_policy_steps):

learning/runners/off_policy_runner.py:150

This 'for' statement has a redundant 'else' as no 'break' is present in the body.

                    for step in range(n_policy_steps):

learning/runners/on_policy_runner.py:85

This 'for' statement has a redundant 'else' as no 'break' is present in the body.

                    for step in range(n_policy_steps):

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-10T22:24:39Z

learning/runners/on_policy_runner.py

+    def update_rewards_dict(self, rewards_dict, step):
+        # sum existing rewards with new rewards
+        rewards_dict[step].update(
            self.get_rewards(
-                self.critic_cfg["reward"]["termination_weight"], mask=terminated
-            )
+                self.critic_cfg["reward"]["termination_weight"],
+                modifier=self.env.dt,
+                mask=self.env.terminated,
+            ),
+            inplace=True,
        )
-        rewards_dict.update(
+        rewards_dict[step].update(
            self.get_rewards(
                self.critic_cfg["reward"]["weights"],
                modifier=self.env.dt,
-                mask=~terminated,
+                mask=~self.env.terminated,
+            ),
+            inplace=True,
+        )


In the update_rewards_dict method, rewards are being multiplied by modifier=self.env.dt, but then summed across steps. Since n_policy_steps = int((1 / self.env.dt) / self.actor_cfg["frequency"]), this could lead to incorrect reward scaling. The reward at each sub-step is scaled by dt, then summed over n_policy_steps, which may not produce the intended reward integration.

Copilot · 2025-12-10T22:24:39Z

gym/envs/mit_humanoid/mit_humanoid.py

+        self.sampled_history_dof_pos[ids] = torch.roll(
+            self.sampled_history_dof_pos[ids], self.num_dof, dims=1
+        )  # check
+        self.sampled_history_dof_pos[ids, : self.num_dof] = self.dof_pos_target[ids]


Line 124 resets sampled_history_dof_pos using self.dof_pos_target[ids], but line 87 shows the buffer should store actual dof_pos, not targets. This inconsistency means the history buffer is initialized with targets but meant to track actual positions.

Suggested change

self.sampled_history_dof_pos[ids, : self.num_dof] = self.dof_pos_target[ids]

self.sampled_history_dof_pos[ids, : self.num_dof] = self.dof_pos[ids]

Copilot · 2025-12-10T22:24:40Z

gym/envs/mit_humanoid/mit_humanoid.py

+    def _switch(self, mode=None):
+        c_vel = torch.linalg.norm(self.commands, dim=1)
+        switch = torch.exp(
+            -torch.square(
+                torch.max(
+                    torch.zeros_like(c_vel),
+                    c_vel - self.cfg.reward_settings.switch_scale,
+                )
+            )
+            / self.cfg.reward_settings.switch_scale
+        )
+        if mode is None or mode == "stand":
+            return switch
+        elif mode == "move":
+            return 1 - switch


The method _switch() is defined on lines 134-148 but returns different values based on the mode parameter. However, the actual instance variable self._switch is updated in _update_cmd_switch() in mini_cheetah_ref.py. This creates confusion between a method and an instance variable with similar names. The MIT humanoid version uses a method, while mini_cheetah uses an instance variable.

Copilot · 2025-12-10T22:24:40Z

learning/runners/off_policy_runner.py

+                    else:
+                        # catch and reset failed envs
+                        to_be_reset = self.env.timed_out | self.env.terminated
+                        env_ids = (to_be_reset).nonzero(as_tuple=False).flatten()
+                        self.env._reset_idx(env_ids)


The same for-else bug exists here. The else clause on lines 154-158 will only execute after the entire loop completes, not after each step.

This is correct

Copilot · 2025-12-10T22:24:40Z

learning/runners/my_runner.py

+                    else:
+                        # catch and reset failed envs
+                        to_be_reset = self.env.timed_out | self.env.terminated
+                        env_ids = (to_be_reset).nonzero(as_tuple=False).flatten()
+                        self.env._reset_idx(env_ids)


The same for-else bug exists here. The else clause on lines 100-104 will only execute after the entire loop completes, not after each step.

Copilot · 2025-12-10T22:24:41Z

gym/envs/mit_humanoid/lander_config.py

+        max_gradient_steps = 24
+        # new
+        storage_size = 2**17  # new
+        batch_size = 2**15  #  new


The batch_size configuration parameter is defined twice on lines 303 and 307, with different values (2**15 both times, but still redundant). This duplication is confusing and could lead to maintenance issues.

Suggested change

batch_size = 2**15 # new

Copilot · 2025-12-10T22:24:41Z

learning/runners/on_policy_runner.py

+                    else:
+                        # catch and reset failed envs
+                        to_be_reset = self.env.timed_out | self.env.terminated
+                        env_ids = (to_be_reset).nonzero(as_tuple=False).flatten()
+                        self.env._reset_idx(env_ids)


The else clause after a for loop on lines 89-93 is unconventional and misleading. In Python, a for-else block means the else executes when the loop completes normally (without a break), not after each iteration. This code will execute the reset logic only once after all n_policy_steps iterations complete, not after each step. This likely causes environments to not be reset when they terminate mid-loop, leading to incorrect simulation behavior.

Copilot · 2025-12-10T22:24:41Z

learning/runners/off_policy_runner.py

+                else:
+                    # catch and reset failed envs
+                    to_be_reset = self.env.timed_out | self.env.terminated
+                    env_ids = (to_be_reset).nonzero(as_tuple=False).flatten()
+                    self.env._reset_idx(env_ids)


The same for-else bug exists here as in on_policy_runner.py. The else clause on lines 97-101 will only execute after the entire loop completes, not after each step, preventing proper environment resets during the loop.

Copilot · 2025-12-10T22:24:42Z

gym/envs/mit_humanoid/lander_config.py

+    # class oscillator:
+    #     base_frequency = 3.0  # [Hz]


This comment appears to contain commented-out code.

Suggested change

# class oscillator:

# base_frequency = 3.0 # [Hz]

Copilot · 2025-12-10T22:24:42Z

learning/runners/my_runner.py

+    @torch.no_grad
+    def burn_in_normalization(self, n_iterations=100):
+        actor_obs = self.get_obs(self.actor_cfg["obs"])
+        critic_obs = self.get_obs(self.critic_cfg["obs"])


This assignment to 'critic_obs' is unnecessary as it is redefined before this value is used.

Suggested change

critic_obs = self.get_obs(self.critic_cfg["obs"])

sheim added 29 commits March 16, 2024 12:04

add coupling to humanoid torques

6b9a2c9

add exp avg filtering

e52208f

add oscillators and phase-based rewards, some tuning

54b826d

some tuning

c9411cd

implement history with (approximated) arbitrary sampling frequency an…

90ff50c

…d history length. Implemented for dof_pos_target, dof_pos, dof_vel.

make pd gains specific for each robot

5020fd0

in preparation for domain randomization of gains

refactor sampled history to be more explicit and not overlap with `do…

82968b1

…f_pos_history` (used for smoothness reward). Some tuning. Change stance/swing reward to use the smoothed square wave.

properly roll histories, retune action_rate for mc

3ae04a0

hotfix old runner, and set as default for humanoid. New runner is num…

0aab100

…erically unstable for humanoid

WIP: tuning, critic loss higher than expected

d86162e

Merge branch 'dev' into humanoid

816a18b

fix hips_forward dimensionality

098989a

lander env

0611bcd

unify humanoid and lander

fd30865

put jacobian into MIT_humanoid base environment, rather than import; importing breaks when running --original_config

wip

be2e990

Merge branch 'dev' into humanoid

3cd76b3

Merge branch 'dev' into humanoid

12e8327

update nn_params

f105c0e

apply stash w/ sim-freq reward sampling.

c04b563

Adjust some reward weights of action_rate (mini-cheetah) to work decently.

refactor skeleton with a super_init fix, and pre-initialize reward bu…

e395673

…ffer. Doesn't seem to make a difference in speed.

refactor: redo rewards computation, with a dict of reward functions i…

ea5cdff

…n the runner. Seems to make a tiny difference in speed. More importantly, it removes the `eval` hack, which means we can profile individual reward functions. THIS BREAKS PBRS

compute switch once per decimation (speedup ~10%)

7296c6c

fixed logging bug that scaled with wrong dt

69218dd

hardcode Jacobian (halves time for _apply_coupling)

b1bd4f2

merge in humanoid branch, which had some refactoring and bugfixes, in…

2022c82

…cluding minor speedups

ruff

2a7d1a1

ruff

8b9fcb1

sheim requested a review from Copilot December 10, 2025 22:18

Copilot started reviewing on behalf of sheim December 10, 2025 22:18 View session

sheim self-assigned this Dec 10, 2025

sheim requested review from Copilot and removed request for Copilot December 10, 2025 22:24

Copilot AI reviewed Dec 10, 2025

View reviewed changes

Copilot started reviewing on behalf of sheim December 10, 2025 22:25 View session

catch sample history bug

1834328

sheim merged commit 738b5d9 into dev Dec 10, 2025
2 checks passed

sheim deleted the merge_humanoid branch December 10, 2025 23:08

	self.sampled_history_dof_pos[ids, : self.num_dof] = self.dof_pos_target[ids]
	self.sampled_history_dof_pos[ids, : self.num_dof] = self.dof_pos[ids]

Merge humanoid #37

Merge humanoid #37

Uh oh!

Conversation

sheim commented Dec 10, 2025

Issue ticket number and link

Describe your changes

Instructions for reviewers

Checklist before requesting a review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

sheim Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants