Skip to content

Comments

Algorithm#5

Merged
fei-yang-wu merged 42 commits intomainfrom
algorithm
Feb 14, 2026
Merged

Algorithm#5
fei-yang-wu merged 42 commits intomainfrom
algorithm

Conversation

@fei-yang-wu
Copy link
Owner

@fei-yang-wu fei-yang-wu commented Feb 14, 2026

Note

Medium Risk
Substantial new RL/imitation algorithms and a major IPMD training-loop refactor could change training behavior and introduce subtle correctness/performance issues. Tooling/doc changes are low risk, but the core learning code paths warrant careful review and testing.

Overview
Adds new imitation-learning algorithms: GAIL (SAC-based), InfoGAIL (skill-conditioned discriminator + mutual-information bonus), and ASE (multi-discriminator + style/diversity rewards), along with their config dataclasses and supporting discriminator/encoder/posterior modules.

Refactors IPMD from a standalone Hydra/SAC-style script into a library PPO-derived algorithm with an explicit reward-estimator network, expert replay-buffer integration, optional behavior cloning, improved logging/diagnostics (including EPIC distance), and a new diffusion-feature variant IPMDDiffSR.

Cleans up repo/tooling: removes checked-in IDE configs and old Hydra YAML configs, adds ignore rules for logs/expert outputs, bumps package version and test/ruff settings (skip slow by default + markers/logging), updates LICENSE year/README blurb, and adds a notebooks/test_sac.ipynb SAC smoke test notebook.

Written by Cursor Bugbot for commit 13cc371. This will update automatically on new commits. Configure here.

…configuration and testing setup. Remove unused files and consolidate common neural network utilities in BaseAlgorithm.
…in SAC, L2T, and PPO agents using reusable base methods. Remove unused imports and consolidate logging utilities.
…ne L2T and PPO implementations, and enhance code readability with improved docstrings and assertions.
… and scheduler configurations, add NaN checks, and streamline parameter validation.
…etwork configuration details into code comments and improving overall documentation structure.
… adjust optimizer configurations, and enhance SAC training parameters. Add long-run SAC HalfCheetah-v5 test with wandb logging support.
…s, enhance environment setup, and improve profiling tests. Adjusted network architectures, removed unused features, and streamlined logging for better performance tracking.
…lts directories, refactor BaseAlgorithm to support cudagraph policy, and improve logging configurations. Clean up unused code in buffer and agent files, and streamline environment setup in env_utils. Remove deprecated PPO and SAC implementations, ensuring compatibility with current configurations.
…rations and training logic. Remove unused parameters and streamline update method to accommodate expert data handling. Adjust tests to reflect changes in baseline performance comparisons with PPO.
…Update `NetworkConfig` to support `None` as a default for `input_keys`, introducing a method to retrieve effective keys. Refactor GAIL, IPMD, PPO, and SAC classes to utilize this new method, ensuring compatibility with multi-key TensorDict inputs. Add utility functions for observation key management and tensor operations in `config_utils`.
@fei-yang-wu fei-yang-wu merged commit a998b82 into main Feb 14, 2026
1 of 2 checks passed
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

This PR is being reviewed by Cursor Bugbot

Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

sampled_tensordict_for_sac = sampled_tensordict.clone()
sampled_tensordict_for_sac["reward"] = (
gail_rewards * self.config.gail.gail_reward_coeff
).unsqueeze(-1)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GAIL reward written to wrong TensorDict key

High Severity

The GAIL reward is written to the flat key "reward" on the cloned TensorDict, but TorchRL's SACLoss reads rewards from the nested key ("next", "reward"). As a result, the discriminator-computed GAIL rewards are never actually used for the SAC policy update — the original environment rewards remain at ("next", "reward") and are what the loss module uses. The key assignment needs to target ("next", "reward") instead of "reward", consistent with how InfoGAIL and ASE handle it.

Fix in Cursor Fix in Web

cfg = self.config
frames_per_batch = cfg.collector.frames_per_batch
total_frames = cfg.collector.total_frames
utd_ratio = float(cfg.ipmd.utd_ratio)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing utd_ratio attribute on IPMDConfig dataclass

High Severity

IPMDDiffSR.train() accesses cfg.ipmd.utd_ratio, but IPMDConfig (which extends PPOConfig) has no utd_ratio field. This will raise an AttributeError at runtime. The field exists on SACConfig (accessed as sac.utd_ratio) but was never added to IPMDConfig. It likely needs to be declared in IPMDConfig or accessed from a different config section.

Additional Locations (1)

Fix in Cursor Fix in Web

loss_alpha = loss_td["loss_alpha"]
self.optim[2].zero_grad()
loss_alpha.backward()
self.optim[2].step()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indexing grouped optimizer as list crashes update

High Severity

Both InfoGAIL and ASE update methods index self.optim with self.optim[0], self.optim[1], self.optim[2] as if it were a list. However, BaseAlgorithm._configure_optimizers() passes the list returned by _set_optimizers through group_optimizers(...), producing a single grouped torch.optim.Optimizer. This grouped optimizer doesn't support __getitem__, so these calls will crash at runtime. Every other algorithm (PPO, SAC, GAIL, IPMD) correctly uses self.optim.zero_grad() and self.optim.step() on the single object.

Additional Locations (1)

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant