Conversation
…configuration and testing setup. Remove unused files and consolidate common neural network utilities in BaseAlgorithm.
…in SAC, L2T, and PPO agents using reusable base methods. Remove unused imports and consolidate logging utilities.
…ne L2T and PPO implementations, and enhance code readability with improved docstrings and assertions.
… and scheduler configurations, add NaN checks, and streamline parameter validation.
…etwork configuration details into code comments and improving overall documentation structure.
… adjust optimizer configurations, and enhance SAC training parameters. Add long-run SAC HalfCheetah-v5 test with wandb logging support.
…s, enhance environment setup, and improve profiling tests. Adjusted network architectures, removed unused features, and streamlined logging for better performance tracking.
…lts directories, refactor BaseAlgorithm to support cudagraph policy, and improve logging configurations. Clean up unused code in buffer and agent files, and streamline environment setup in env_utils. Remove deprecated PPO and SAC implementations, ensuring compatibility with current configurations.
…rations and training logic. Remove unused parameters and streamline update method to accommodate expert data handling. Adjust tests to reflect changes in baseline performance comparisons with PPO.
…Update `NetworkConfig` to support `None` as a default for `input_keys`, introducing a method to retrieve effective keys. Refactor GAIL, IPMD, PPO, and SAC classes to utilize this new method, ensuring compatibility with multi-key TensorDict inputs. Add utility functions for observation key management and tensor operations in `config_utils`.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
This PR is being reviewed by Cursor Bugbot
Details
You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
| sampled_tensordict_for_sac = sampled_tensordict.clone() | ||
| sampled_tensordict_for_sac["reward"] = ( | ||
| gail_rewards * self.config.gail.gail_reward_coeff | ||
| ).unsqueeze(-1) |
There was a problem hiding this comment.
GAIL reward written to wrong TensorDict key
High Severity
The GAIL reward is written to the flat key "reward" on the cloned TensorDict, but TorchRL's SACLoss reads rewards from the nested key ("next", "reward"). As a result, the discriminator-computed GAIL rewards are never actually used for the SAC policy update — the original environment rewards remain at ("next", "reward") and are what the loss module uses. The key assignment needs to target ("next", "reward") instead of "reward", consistent with how InfoGAIL and ASE handle it.
| cfg = self.config | ||
| frames_per_batch = cfg.collector.frames_per_batch | ||
| total_frames = cfg.collector.total_frames | ||
| utd_ratio = float(cfg.ipmd.utd_ratio) |
There was a problem hiding this comment.
Missing utd_ratio attribute on IPMDConfig dataclass
High Severity
IPMDDiffSR.train() accesses cfg.ipmd.utd_ratio, but IPMDConfig (which extends PPOConfig) has no utd_ratio field. This will raise an AttributeError at runtime. The field exists on SACConfig (accessed as sac.utd_ratio) but was never added to IPMDConfig. It likely needs to be declared in IPMDConfig or accessed from a different config section.
Additional Locations (1)
| loss_alpha = loss_td["loss_alpha"] | ||
| self.optim[2].zero_grad() | ||
| loss_alpha.backward() | ||
| self.optim[2].step() |
There was a problem hiding this comment.
Indexing grouped optimizer as list crashes update
High Severity
Both InfoGAIL and ASE update methods index self.optim with self.optim[0], self.optim[1], self.optim[2] as if it were a list. However, BaseAlgorithm._configure_optimizers() passes the list returned by _set_optimizers through group_optimizers(...), producing a single grouped torch.optim.Optimizer. This grouped optimizer doesn't support __getitem__, so these calls will crash at runtime. Every other algorithm (PPO, SAC, GAIL, IPMD) correctly uses self.optim.zero_grad() and self.optim.step() on the single object.


Note
Medium Risk
Substantial new RL/imitation algorithms and a major
IPMDtraining-loop refactor could change training behavior and introduce subtle correctness/performance issues. Tooling/doc changes are low risk, but the core learning code paths warrant careful review and testing.Overview
Adds new imitation-learning algorithms:
GAIL(SAC-based),InfoGAIL(skill-conditioned discriminator + mutual-information bonus), andASE(multi-discriminator + style/diversity rewards), along with their config dataclasses and supporting discriminator/encoder/posterior modules.Refactors
IPMDfrom a standalone Hydra/SAC-style script into a libraryPPO-derived algorithm with an explicit reward-estimator network, expert replay-buffer integration, optional behavior cloning, improved logging/diagnostics (including EPIC distance), and a new diffusion-feature variantIPMDDiffSR.Cleans up repo/tooling: removes checked-in IDE configs and old Hydra YAML configs, adds ignore rules for logs/expert outputs, bumps package version and test/ruff settings (skip
slowby default + markers/logging), updates LICENSE year/README blurb, and adds anotebooks/test_sac.ipynbSAC smoke test notebook.Written by Cursor Bugbot for commit 13cc371. This will update automatically on new commits. Configure here.