Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
701bf78
add ack
Ericonaldo Aug 15, 2024
daa5c04
Merge branch 'main' of https://github.com/muzi2018/visual_wholebody
muzi2018 Aug 13, 2025
c131d97
impletement low policy
muzi2018 Aug 13, 2025
8b57db5
impletement low policy
muzi2018 Aug 13, 2025
1abcfbf
Merge branch 'main' of https://github.com/muzi2018/visual_wholebody
muzi2018 Aug 13, 2025
f1d6f41
update
muzi2018 Aug 17, 2025
cd07834
update
muzi2018 Aug 17, 2025
479a759
update
muzi2018 Aug 18, 2025
9e92554
contact_offset: 0.04
muzi2018 Aug 18, 2025
aac0bdd
1024 envs
muzi2018 Aug 18, 2025
04f4ece
update
muzi2018 Aug 20, 2025
bdfe35f
update
muzi2018 Aug 29, 2025
05ac450
update
muzi2018 Aug 30, 2025
0c15d12
b1z1_pickmulti.yaml
muzi2018 Aug 30, 2025
f51af48
update
muzi2018 Aug 31, 2025
5a6e2c8
update
muzi2018 Aug 31, 2025
b039920
update
muzi2018 Sep 3, 2025
b9759af
update reward
muzi2018 Sep 3, 2025
7b38f16
update
muzi2018 Sep 3, 2025
1bf299d
update reward eeorn
muzi2018 Sep 3, 2025
716a2e1
update
muzi2018 Sep 3, 2025
5a78463
Merge branch 'train/parameter' of https://github.com/muzi2018/visual_…
muzi2018 Sep 3, 2025
c6892ab
policy 5: scales->approaching = 10.0
muzi2018 Sep 4, 2025
24553bf
syn
muzi2018 Sep 4, 2025
17dbf56
update parameter ee_orn: 0.1
muzi2018 Sep 4, 2025
d1a7b9e
update parameter ee_orn: 0.1 policy 5 again
muzi2018 Sep 4, 2025
fb7f1b8
Merge branch 'train/scales' of https://github.com/muzi2018/visual_who…
muzi2018 Sep 4, 2025
fe5b78e
update push
muzi2018 Sep 5, 2025
8f840a7
add reward watch
muzi2018 Sep 5, 2025
853dffc
Create high_level_reward.txt
muzi2018 Sep 5, 2025
6ee8fb8
modify _reward_base_dir
muzi2018 Sep 6, 2025
8cf4d4f
modify
muzi2018 Sep 6, 2025
e1bed5d
success
muzi2018 Sep 8, 2025
3800500
Merge branch 'debug/pipeline' of https://github.com/muzi2018/visual_w…
muzi2018 Sep 8, 2025
bd3159f
success for teacher policy
muzi2018 Sep 8, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 23 additions & 7 deletions high-level/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,10 @@
This code base only includes the task of picking multiple objects.

## Code structure

`data` contains assets and config files.
assets: robot model
config: RL Environment

`envs` contains environment related codes.

Expand All @@ -15,24 +18,25 @@ This code base only includes the task of picking multiple objects.

## Train

1. Environments:
1. Environments:
**Picking multiple objects**: [b1_z1pickmulti.py](./envs/b1_z1pickmulti.py) is for walking and picking policy, supporting floating base (fixed and varied body heights).

2. To train the state-based teacher, using [train_multistate.py](./train_multistate.py) with the config [b1z1_pickmulti.yaml](./data/cfg/b1z1_pickmulti.yaml) (remember to determine the pre-trained low-level policy path in the config file):

```bash
python train_multistate.py --rl_device "cuda:0" --sim_device "cuda:0" --timesteps 60000 --headless --task B1Z1PickMulti --experiment_dir b1-pick-multi-teacher --wandb --wandb_project "b1-pick-multi-teacher" --wandb_name "some descriptions" --roboinfo --observe_gait_commands --small_value_set_zero --rand_control --stop_pick
```

Arguments explanation:

`--task` should be the full name of the environment, with every first letter of each word capitalized.
`--task` should be the full name of the environment, with every first letter of each word capitalized.

`--timesteps` total training steps.
`--timesteps` total training steps.

`--experiment_dir` is the name of the directory where the running is saved.

`--wandb`, then the training will be logged to wandb. You can omit this argument if you don't want to use wandb or when debugging.
`--wandb`, then the training will be logged to wandb. You can omit this argument if you don't want to use wandb or when debugging.

`--wandb_project` is the name of the project in wandb.
`--wandb_project` is the name of the project in wandb.

`--wandb_name` is the name of the run in wandb, which is also the name of this run in experiment_dir.

Expand All @@ -47,26 +51,38 @@ Arguments explanation:
`--stop_pick` enforces the robot to stop when the gripper is closing.

For playing the teacher policy, using [play_multistate.py](./play_multistate.py):

```bash
python play_multistate.py --task B1Z1PickMulti --checkpoint "(specify the path)" # --(same arguments as training)
```
It should be a maximum of 60000 timesteps for successful teacher policy training.

It should be a maximum of 60000 timesteps for successful teacher policy training.

3. To train the vision-based student policy, use [train_multi_bc_deter.py](./train_multi_bc_deter.py) with the config [b1z1_pickmulti.yaml](./data/cfg/b1z1_pickmulti.yaml)

```bash
python train_multi_bc_deter.py --headless --task B1Z1PickMulti --rl_device "cuda:0" --sim_device "cuda:0" --timesteps 60000 --experiment_dir "b1-pick-multi-stu" --wandb --wandb_project "b1-pick-multi-stu" --wandb_name "checkpoint dir path" --teacher_ckpt_path "teacher checkpoint path" --roboinfo --observe_gait_commands --small_value_set_zero --rand_control --stop_pick
```

Arguments are similar to those above.

For playing the trained student policy, using [play_multi_bc_deter.py](./play_multi_bc_deter.py):

```bash
python play_multi_bc_deter.py --task B1Z1PickMulti --checkpoint "(specify the path)" # --(same arguments as training)
```

If you don't specify `--num_envs`, it will use 34 by default (only for this script).
It should be a maximum of 60000 timesteps for successful student policy training.

## Others

[test_pointcloud.py](./test_pointcloud.py) can be use for checking the pointcloud of the objects.

[train_multistate_asym.py](./train_multistate_asym.py) is a try of using asymetric PPO for training the high-level policy (i.e, vision-based policy and privilaged value function), it is training inefficient and is not comparable to the teacher-student as it cannot parallel too many environments due to the depth images consumption.

command:

python train_multistate.py --timesteps 60000 --headless --task B1Z1PickMulti --experiment_dir b1-pick-multi-teacher --wandb --wandb_project "b1-pick-multi-teacher" --wandb_name "policy2" --roboinfo --observe_gait_commands --small_value_set_zero --rand_control --stop_pick

python play_multistate.py --task B1Z1PickMulti --checkpoint /home/wang/Desktop/visual_wholebody/high-level/b1-pick-multi-teacher/policy_2/checkpoints/agent_30001.pt --roboinfo --observe_gait_commands --small_value_set_zero --rand_control --stop_pick
19 changes: 10 additions & 9 deletions high-level/data/cfg/b1z1_pickmulti.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
env:
numEnvs: 10240
numEnvs: 5120
numAgents: 1
envSpacing: 5.0
enableDebugVis: False
Expand All @@ -11,14 +11,15 @@ env:
controlFrequencyLow: 8
imgDelayFrame: 4

liftedSuccessThreshold: 0.35
liftedSuccessThreshold: 0.35 # 物体被抬起多少米才算成功
liftedInitThreshold: 0.05
baseObjectDisThreshold: 0.6
holdSteps: 25
lastCommands: False

low_policy_path: "/data/mhliu/visual_wholebody/high-level/data/low_policy/publiccheckrollrew_42000.pt" # "data/low_policy/jan17-frontfeet_terrainflat_linrew2_freq2_gaitstopzero_penelizeang_roll_fixbug_35600.pt"

low_policy_path: "/home/wang/Desktop/visual_wholebody/low-level/logs/b1z1-low/gait_reference/model_37000.pt"
# "data/low_policy/jan17-frontfeet_terrainflat_linrew2_freq2_gaitstopzero_penelizeang_roll_fixbug_35600.pt"
# "/home/wang/Desktop/visual_wholebody/low-level/logs/b1z1-low/gait_tuning_v2/model_13000.pt"
asset:
assetRoot: "data/asset"
assetFileRobot: "b1z1-col/urdf/b1z1.urdf"
Expand Down Expand Up @@ -216,7 +217,7 @@ sim:
solver_type: 1
num_position_iterations: 8
num_velocity_iterations: 0
contact_offset: 0.02
contact_offset: 0.04
rest_offset: 0.0
bounce_threshold_velocity: 0.2
max_depenetration_velocity: 50.0
Expand All @@ -233,15 +234,15 @@ reward:
acc_penalty: -0.001
command_penalty: -1.0
command_reward: 0.25
standpick: 0.25
standpick: 0.25 # no found in reward_vec_task.py #
action_rate: -0.001
ee_orn: 0.01
ee_orn: 0.01 # 0.01
base_dir: 0.25
rad_penalty: 0.0
base_ang_pen: 0.0
base_approaching: 0.01 # 0.05
grasp_base_height: 0.5
gripper_rate: 0.0 # -0.1
grasp_base_height: 0.5 # no found in reward_vec_task.py
gripper_rate: -0.0 # -0.1

sensor:
depth_clip_lower: 0.15
Expand Down
72 changes: 61 additions & 11 deletions high-level/envs/b1z1_pickmulti.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,19 +18,19 @@

class B1Z1PickMulti(B1Z1Base):
def __init__(self, table_height=None, *args, **kwargs):
self.num_actors = 3
self.num_actors = 3 # 环境里会有 3 个 actor(可理解为仿真里的“主要对象”)。一个是 机器人 (B1),一个是 桌子 (Z1),一个是 要抓的物体 (Pick target)
super().__init__(*args, **kwargs)
self.near_goal_stop = self.cfg["env"].get("near_goal_stop", False)
self.obj_move_prob = self.cfg["env"].get("obj_move_prob", 0.0)
self.near_goal_stop = self.cfg["env"].get("near_goal_stop", False) # 当机器人接近目标时,停止移动基座,以便更好地抓取物体
self.obj_move_prob = self.cfg["env"].get("obj_move_prob", 0.0) # 物体可能被随机移动的概率(扰动)。用于增加环境随机性,提高策略的 鲁棒性。
self.table_heights_fix = table_height

def update_roboinfo(self):
super().update_roboinfo()
base_obj_dis = self._cube_root_states[:, :2] - self.arm_base[:, :2]
self.base_obj_dis = torch.norm(base_obj_dis, dim=-1)
self.base_obj_dis = torch.norm(base_obj_dis, dim=-1) # 基座到物体的距离,物体与机械臂基座之间的 2D 平面向量差。

def _setup_obs_and_action_info(self):
super()._setup_obs_and_action_info(removed_dim=9, num_action=9, num_obs=38+self.num_features-1)
super()._setup_obs_and_action_info(removed_dim=9, num_action=9, num_obs=38+self.num_features-1) # removed_dim=9:在 observation 里移除了 9 个维度,可能是一些无关或冗余的状态(比如某些 joint states)。num_action=9:对应 9 个控制信号,机械臂 6 自由度关节控制 + 抓手控制(~7-9个);num_obs=38+self.num_features-1:总的 observation 维度,38 是基础信息,self.num_features 是物体特征维度(如果有的话),-1 可能是因为某个特征被移除了。

def _extra_env_settings(self):
self.multi_obj = self.cfg["env"]["asset"]["asset_multi"]
Expand All @@ -39,7 +39,7 @@ def _extra_env_settings(self):
self.obj_orn = [self.multi_obj[obj]["orientation"] for obj in self.obj_list]
self.obj_scale = [self.multi_obj[obj]["scale"] for obj in self.obj_list]
obj_dir = os.path.join(self.cfg["env"]["asset"]["assetRoot"], self.cfg["env"]["asset"]["assetFileObj"])
if not self.no_feature:
if not self.no_feature: # feature is used
features = []
for obj_name in self.obj_list:
feature_path = os.path.join(obj_dir, obj_name, "features.npy")
Expand Down Expand Up @@ -210,8 +210,46 @@ def _reset_envs(self, env_ids):
"SuccessRate / Bottle": bottle_success_rate,
"SuccessRate / Cup": cup_success_rate,
"SuccessRate / Drill": drill_success_rate,
}
},
}

base_height_reward , _ = self._reward_base_height()
reward_approaching , _ = self._reward_approaching()
reward_lifting, _ = self._reward_lifting()
reward_pick_up, _ = self._reward_pick_up()
reward_acc_penalty, _ = self._reward_action_penalty()
reward_command_reward, _ = self._reward_command_reward()
reward_command_penalty, _ = self._reward_command_penalty()
reward_action_rate, _ = self._reward_action_rate()
reward_ee_orn, _ = self._reward_ee_orn()
reward_base_dir, _ = self._reward_base_dir()
reward_rad_penalty, _ = self._reward_rad_penalty()
reward_base_ang_pen, _ = self._reward_base_ang_pen()
reward_base_approaching, _ = self._reward_base_approaching()
reward_standpick, _ = self._reward_standpick()
reward_grasp_base_height, _ = self._reward_grasp_base_height()

if self.cfg["env"]["wandb"]:
wandb.log({
"Reward / BaseHeight_mean": base_height_reward.mean().item(),
"Reward / Approaching_mean": reward_approaching.mean().item(),
"Reward / Lifting_mean": reward_lifting.mean().item(),
"Reward / PickUp_mean": reward_pick_up.mean().item(),
"Reward / ActionPenalty_mean": reward_acc_penalty.mean().item(),
"Reward / CommandReward_mean": reward_command_reward,
"Reward / CommandPenalty_mean": reward_command_penalty,
"Reward / ActionRate_mean": reward_action_rate.mean().item(),
"Reward / EEOrientation_mean": reward_ee_orn.mean().item(),
"Reward / BaseDir_mean": reward_base_dir.mean().item(),
"Reward / RadiusPenalty_mean": reward_rad_penalty.mean().item(),
"Reward / BaseAngPenalty_mean": reward_base_ang_pen.mean().item(),
"Reward / BaseApproaching_mean": reward_base_approaching.mean().item(),
"Reward / StandPick_mean": reward_standpick,
"Reward / GraspBaseHeight_mean": reward_grasp_base_height.mean().item(),
}, step=self.global_step_counter)



if self.pred_success:
predlift_success_rate = 0 if self.global_step_counter==0 else (self.predlift_success_counter / self.local_step_counter).mean().item()
wandb_dict["success_rate"]["SuccessRate / PredLifted"] = predlift_success_rate
Expand Down Expand Up @@ -257,7 +295,7 @@ def _reset_table(self, env_ids):
else:
rand_heights = torch.ones((len(env_ids), 1), device=self.device, dtype=torch.float)*self.table_heights_fix - self.table_dimz / 2

self._table_root_states[env_ids, 2] = rand_heights.squeeze(1) - self.table_dimz / 2.0
self._table_root_states[env_ids, 2] = self.table_dimz / 2.0
self.table_heights[env_ids] = self._table_root_states[env_ids, 2] + self.table_dimz / 2.0

def _reset_actors(self, env_ids):
Expand Down Expand Up @@ -314,6 +352,10 @@ def _compute_observations(self, env_ids=None):
self.obs_buf[env_ids] = torch.cat([obs, self.command_history_buf[env_ids, -1]], dim=-1)
else:
self.obs_buf[env_ids] = torch.cat([obs, self.action_history_buf[env_ids, -1]], dim=-1)





def _compute_robot_obs(self, env_ids=None):
if env_ids is None:
Expand Down Expand Up @@ -420,7 +462,7 @@ def check_termination(self):
self.reset_buf = self.reset_buf | deviate_much | too_close_table

# --------------------------------- reward functions ---------------------------------
def _reward_standpick(self):
def _reward_standpick(self): # 引导 agent 在抓取前先稳稳靠近物体
reward = torch.zeros(self.num_envs, device=self.device, dtype=torch.float)
reward[(self.base_obj_dis < self.base_object_distace_threshold) & (self.commands[:, 0] < LIN_VEL_X_CLIP)] = 1.0

Expand All @@ -429,7 +471,7 @@ def _reward_standpick(self):

return reward, reward

def _reward_grasp_base_height(self):
def _reward_grasp_base_height(self): # 鼓励 agent 把物体抬到一定高度
cube_height = self._cube_root_states[:, 2]
box_pos = self._cube_root_states[:, :3]
d1 = torch.norm(box_pos - self.ee_pos, dim=-1)
Expand All @@ -439,7 +481,7 @@ def _reward_grasp_base_height(self):

return reward, reward

def _reward_approaching(self):
def _reward_approaching(self): # 在 抬起前奖励靠近物体,抬起后不重复奖励
"""Change the reward function to be effective only when the object is lifted
"""
reward, _ = super()._reward_approaching()
Expand Down Expand Up @@ -470,6 +512,14 @@ def _reward_base_dir(self):

# --------------------------------- jit functions ---------------------------------

# 最终 obs 维度组合:
# Cube pos + orn (3+3)
# EE pos + orn (3+3)
# Arm DOF pos + vel
# commands
# EE goal pos + orn
# Robot base velocity

@torch.jit.script
def compute_robot_observations(robot_root_state, table_root_state, cube_root_state, body_pos,
body_rot, body_vel, body_ang_vel, dof_pos, dof_vel, base_quat_yaw, spherical_center, commands, gripper_idx, table_dim,
Expand Down
7 changes: 3 additions & 4 deletions high-level/envs/reward_vec_task.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import torch.nn.functional as F
import numpy as np
from isaacgym.torch_utils import *

import wandb
class RewardVecTask(VecTask):
# --------------------------------- reward functions ---------------------------------
def _reward_approaching(self):
Expand Down Expand Up @@ -126,11 +126,10 @@ def _reward_base_approaching(self, obj_pos):
base_obj_dis = torch.norm(base_obj_dis, dim=-1)
delta_dis = torch.abs(base_obj_dis - self.base_object_distace_threshold)
reward = torch.tanh(-10*delta_dis) + 1

return reward, reward
return reward, base_obj_dis

def _reward_base_dir(self, obj_pos):
base_x_dir = torch.tensor([1., 0., 0.], device=self.device).repeat(self.num_envs, 1)
base_x_dir = torch.tensor([0., 0., 1.], device=self.device).repeat(self.num_envs, 1)
base_x_dir_world = quat_apply(self.base_yaw_quat, base_x_dir)
obj_dir = obj_pos - self._robot_root_states[:, :3]
obj_dir[:,:2] = 0.
Expand Down
Loading