-
Notifications
You must be signed in to change notification settings - Fork 87
Update to New RSL-RL - Enable multi gpu #31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Update to New RSL-RL - Enable multi gpu #31
Conversation
fix missing attribute and log path add player for new rsl_rl remove checkpoint abspath. Updated doc running formatter adding student add check for multi gpu and set right device delete dependency remove install_deps.sh file remove custom rsl_rl fix deprecated attribute save log in robot subfolder configured training for student with new rsl_rl change observations teacher_policy to teacher for rsl_rl compatibility get right checkpoints in the player update code for eval run formatter fix eval script. updated doc add my isaaclab path to intellisense specify python package
|
Thanks for the MR! We really appreciate the contribution. It might take us a bit of time to review the changes thoroughly and test them out before we can merge it in. Thanks for your patience! |
huihuaNvidia2023
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First path done, need to verify locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice but assumes a path of IsaacLab. A better way would be using the environmental variable ISAACLAB_PATH. VSCODE does provide utils to use env variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was using this for debugging, I think it would be best to just remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't host this dataset due to licence issue.
| num_steps_per_env: int = field( | ||
| default=24, metadata={"description": "Number of steps per environment per iteration."} | ||
| num_steps_per_env = 24 | ||
| max_iterations = 10000000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this number is too big. Normally 50k or 80k can give a good result, maybe 100k is enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the comprehensive changes, your contribution is much appreciated!
Actually, the latest IsaacLab 2.1 comes with the version of RSL RL that has distillation and a bunch other new features.
If we upgrade to v2.1.0, we could potentially make it even nicer. To jump from v2.0.0 to v2.1.0, the changes are minimum, only a few data names need to be updated.
It may worth to give it a shot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this I kept the same iteration number as the original version, but I can reduce it no problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah so this PR already uses the version of RL_RL that has distillation. The only reason why I did not use it and I was installing rsl separately is because in the distillation algorithm rsl was not using a gradient cap, which was instead used in the original implementation shipped with HOVER. Now rsl has it, but as far as i know these changes have not been reflected in isaaclab which still uses rsl-rl-lib 2.3.1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this pr was mostly to include the new library so that we can use the multi gpu support. I know it's not super clean, but before spending too much time on it I wanted to know if you have any advice on how to proceed.
I'm very open to suggestions and I can work on it, so if you have ideas please share✌️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for mentioning the detail. Yeah, this works for me. Your PR is still the main piece. We can worry about the Lab conversion later.
| num_steps_per_env: int = field( | ||
| default=24, metadata={"description": "Number of steps per environment per iteration."} | ||
| num_steps_per_env = 24 | ||
| max_iterations = 10000000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for mentioning the detail. Yeah, this works for me. Your PR is still the main piece. We can worry about the Lab conversion later.
| It is possible to train on a node with multiple gpus or multiple nodes with multiple gpues. | ||
| In order to train the teacher on a machine with 4 gpus: | ||
| ```bash | ||
| python -m torch.distributed.run --nnodes=1 --nproc_per_node=4 scripts/rsl_rl/train_teacher.py --num_envs=1024 --headless --distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
train_teacher.py -> train_teacher_policy.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please also update to use joint_friction_coeff here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On another note, there are several unit tests are failing in the core, inference_env and isaac_lab_wrapper folders. Could you please take a look?
| # Add max_grad_norm to the configuration | ||
| # As it has not been added to isaaclab yet | ||
| max_grad_norm: float = MISSING | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove
|
Hey there! I checked the tests. |
Update to New RSL-RL - Enable multi-gpu
Description
This pull request introduces significant improvements to the project by fully integrating the latest configuration of the RSL-RL library. The update removes the need for a custom version, enabling a more streamlined and modular workflow.
Summary of Changes
🆕 New Files
scripts/rsl_rl/cli_args.pyandscripts/rsl_rl/student_policy_cfg.pyto support the new configuration workflow.✏️ Modified Files
README.md: Updated to reflect the latest usage instructions and improvements.neural_wbc/: Modified multiple components, including environment wrappers, inference environments, and policy trainers, to align with the updated RSL-RL interface.scripts/rsl_rl/: Revised training and evaluation scripts to support the latest features of the library.pyproject.toml&requirements.txt: Updated dependencies to match the new setup.🗑️ Removed Files
third_party/rsl_rl/, as the project now uses the upstream RSL-RL library directly.Purpose
The main goals of this update are:
Testing
Let me know if there are any concerns or suggestions for further improvement.