Algorithm by fei-yang-wu · Pull Request #5 · fei-yang-wu/RLOpt

fei-yang-wu · 2026-02-14T14:31:40Z

Note

Medium Risk
Substantial new RL/imitation algorithms and a major IPMD training-loop refactor could change training behavior and introduce subtle correctness/performance issues. Tooling/doc changes are low risk, but the core learning code paths warrant careful review and testing.

Overview
Adds new imitation-learning algorithms: GAIL (SAC-based), InfoGAIL (skill-conditioned discriminator + mutual-information bonus), and ASE (multi-discriminator + style/diversity rewards), along with their config dataclasses and supporting discriminator/encoder/posterior modules.

Refactors IPMD from a standalone Hydra/SAC-style script into a library PPO-derived algorithm with an explicit reward-estimator network, expert replay-buffer integration, optional behavior cloning, improved logging/diagnostics (including EPIC distance), and a new diffusion-feature variant IPMDDiffSR.

Cleans up repo/tooling: removes checked-in IDE configs and old Hydra YAML configs, adds ignore rules for logs/expert outputs, bumps package version and test/ruff settings (skip slow by default + markers/logging), updates LICENSE year/README blurb, and adds a notebooks/test_sac.ipynb SAC smoke test notebook.

^{Written by Cursor Bugbot for commit 13cc371. This will update automatically on new commits. Configure here.}

…configuration and testing setup. Remove unused files and consolidate common neural network utilities in BaseAlgorithm.

…in SAC, L2T, and PPO agents using reusable base methods. Remove unused imports and consolidate logging utilities.

…ne L2T and PPO implementations, and enhance code readability with improved docstrings and assertions.

… and scheduler configurations, add NaN checks, and streamline parameter validation.

…etwork configuration details into code comments and improving overall documentation structure.

… adjust optimizer configurations, and enhance SAC training parameters. Add long-run SAC HalfCheetah-v5 test with wandb logging support.

…s, enhance environment setup, and improve profiling tests. Adjusted network architectures, removed unused features, and streamlined logging for better performance tracking.

…lts directories, refactor BaseAlgorithm to support cudagraph policy, and improve logging configurations. Clean up unused code in buffer and agent files, and streamline environment setup in env_utils. Remove deprecated PPO and SAC implementations, ensuring compatibility with current configurations.

…rations and training logic. Remove unused parameters and streamline update method to accommodate expert data handling. Adjust tests to reflect changes in baseline performance comparisons with PPO.

…Update `NetworkConfig` to support `None` as a default for `input_keys`, introducing a method to retrieve effective keys. Refactor GAIL, IPMD, PPO, and SAC classes to utilize this new method, ensuring compatibility with multi-key TensorDict inputs. Add utility functions for observation key management and tensor operations in `config_utils`.

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

This PR is being reviewed by Cursor Bugbot

Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

cursor · 2026-02-14T14:46:38Z

rlopt/agent/gail/gail.py

+            sampled_tensordict_for_sac = sampled_tensordict.clone()
+            sampled_tensordict_for_sac["reward"] = (
+                gail_rewards * self.config.gail.gail_reward_coeff
+            ).unsqueeze(-1)


GAIL reward written to wrong TensorDict key

High Severity

The GAIL reward is written to the flat key "reward" on the cloned TensorDict, but TorchRL's SACLoss reads rewards from the nested key ("next", "reward"). As a result, the discriminator-computed GAIL rewards are never actually used for the SAC policy update — the original environment rewards remain at ("next", "reward") and are what the loss module uses. The key assignment needs to target ("next", "reward") instead of "reward", consistent with how InfoGAIL and ASE handle it.

cursor · 2026-02-14T14:46:38Z

rlopt/agent/ipmd/ipmd_diffsr.py

+        cfg = self.config
+        frames_per_batch = cfg.collector.frames_per_batch
+        total_frames = cfg.collector.total_frames
+        utd_ratio = float(cfg.ipmd.utd_ratio)


Missing utd_ratio attribute on IPMDConfig dataclass

High Severity

IPMDDiffSR.train() accesses cfg.ipmd.utd_ratio, but IPMDConfig (which extends PPOConfig) has no utd_ratio field. This will raise an AttributeError at runtime. The field exists on SACConfig (accessed as sac.utd_ratio) but was never added to IPMDConfig. It likely needs to be declared in IPMDConfig or accessed from a different config section.

Additional Locations (1)

rlopt/agent/ipmd/ipmd.py#L50-L126

cursor · 2026-02-14T14:46:38Z

rlopt/agent/infogail/infogail.py

+            loss_alpha = loss_td["loss_alpha"]
+            self.optim[2].zero_grad()
+            loss_alpha.backward()
+            self.optim[2].step()


Indexing grouped optimizer as list crashes update

High Severity

Both InfoGAIL and ASE update methods index self.optim with self.optim[0], self.optim[1], self.optim[2] as if it were a list. However, BaseAlgorithm._configure_optimizers() passes the list returned by _set_optimizers through group_optimizers(...), producing a single grouped torch.optim.Optimizer. This grouped optimizer doesn't support __getitem__, so these calls will crash at runtime. Every other algorithm (PPO, SAC, GAIL, IPMD) correctly uses self.optim.zero_grad() and self.optim.step() on the single object.

Additional Locations (1)

rlopt/agent/ase/ase.py#L281-L297

fei-yang-wu added 30 commits August 18, 2025 14:10

WIP: rewriting l2t with torchrl

6199fac

WIP: sac

0b49ae9

Generalize initialization with keys as arguments.

beeb6f4

Merge branch 'main' into l2t_torchrl

db638a5

Merge branch 'main' into algorithm

eb80cc0

WIP

a477ae7

WIP: clean up

bafedf5

Chore: various clean up

4a1c5af

WIP: fix ppo linter

72ef4a0

Feat: sac implementation and flexible feature extractor

70a9f21

Fix: remove unused imports in test files

dff8cf3

Merge branch 'l2t_torchrl' into algorithm

f92b612

Remove deprecated SAC implementation file from the agent module.

3d94378

Implement L2T algorithm with teacher-student architecture, including …

a7d732d

…configuration and testing setup. Remove unused files and consolidate common neural network utilities in BaseAlgorithm.

Refactor: streamline feature extractor and policy/value construction …

9dd414e

…in SAC, L2T, and PPO agents using reusable base methods. Remove unused imports and consolidate logging utilities.

Refactor: remove unused configuration and environment files, streamli…

b64e5d4

…ne L2T and PPO implementations, and enhance code readability with improved docstrings and assertions.

Feat: add Inverse Policy Mirror Descent method.

6330128

Fix forward call for L2TActorValueOperatorWrapper

823089a

Refactor: enhance PPO and SAC implementations with improved optimizer…

593f06e

… and scheduler configurations, add NaN checks, and streamline parameter validation.

Chore: clean up type hints and better logger.

c0f9266

Refactor: remove NETWORK_LAYOUT_REFACTOR.md document, consolidating n…

1571d42

…etwork configuration details into code comments and improving overall documentation structure.

WIP: PPO works stably. SAC works but training is not working.

b22b819

Fix: tests only support gymnasium, not gym.

bb46c98

Update version to 2025.4.0, improve error messaging in BaseAlgorithm,…

4854fc0

… adjust optimizer configurations, and enhance SAC training parameters. Add long-run SAC HalfCheetah-v5 test with wandb logging support.

Refactor: update SAC and PPO configurations to match TorchRL standard…

0b45e80

…s, enhance environment setup, and improve profiling tests. Adjusted network architectures, removed unused features, and streamlined logging for better performance tracking.

Fix: SAC

aca2fdf

WIP: refactor config system

8f64f55

Chore: clean up and start crafting test cases for SAC.

3ad4923

WIP: stablizing sac

01114ba

fei-yang-wu added 12 commits December 3, 2025 20:26

Fix: SAC loss cannot use get_critic_operator for q value.

4ba0c17

WIP: slice sampler for manager; ipmd with diffusr variant.

15e613a

Fix: basic ipmd tests

ccf1548

Chore: add isaaclab support

aef3679

Feat: ppo with torchrl for Isaac Lab replicates rsl_rl

c45b60c

Refactor: transition IPMD to PPO-based architecture, updating configu…

0f67522

…rations and training logic. Remove unused parameters and streamline update method to accommodate expert data handling. Adjust tests to reflect changes in baseline performance comparisons with PPO.

Fix ppo and bump version.

e61fb47

Use analytical kl

801c918

WIP: expert batch should not be flatten again

7472f0c

Fix logging structure in IsaacLab

b19159b

Merge remote-tracking branch 'origin/main' into algorithm

13cc371

fei-yang-wu merged commit a998b82 into main Feb 14, 2026
1 of 2 checks passed

cursor bot reviewed Feb 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Algorithm#5

Algorithm#5
fei-yang-wu merged 42 commits intomainfrom
algorithm

fei-yang-wu commented Feb 14, 2026 •

edited by cursor bot

Loading

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Feb 14, 2026

Uh oh!

cursor bot Feb 14, 2026

Uh oh!

cursor bot Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

fei-yang-wu commented Feb 14, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

This PR is being reviewed by Cursor Bugbot

Uh oh!

cursor bot Feb 14, 2026

Choose a reason for hiding this comment

GAIL reward written to wrong TensorDict key

Uh oh!

cursor bot Feb 14, 2026

Choose a reason for hiding this comment

Missing utd_ratio attribute on IPMDConfig dataclass

Uh oh!

cursor bot Feb 14, 2026

Choose a reason for hiding this comment

Indexing grouped optimizer as list crashes update

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fei-yang-wu commented Feb 14, 2026 •

edited by cursor bot

Loading

Missing `utd_ratio` attribute on `IPMDConfig` dataclass