Skip to content

Differences Between the Policy Correction Implementation and the Paper, and a Potential Lack of Policy Update in 'projection()' Due to a Broken Gradient #2

@haruki-abe

Description

@haruki-abe

Hi, thank you very much for the excellent work on this project and for making the code available!

While reading through the implementation, I noticed two points I’d like to kindly clarify (I may be misunderstanding any part—please feel free to correct me):

  • The update rule used in this repository’s Policy Correction phase is different from the paper’s practical implementation, which uses KL(new‖old) + a lower-bound constraint (hinge + λ).

  • In addition, the current implementation in projection() seems to break the computation graph, which can prevent gradients from reaching the actor. As a result, the policy may fail to update.

For reference (picor.py, l.253):

project_loss = (
                torch.tensor(all_loss, device=self.device, requires_grad=True) * weights
            ).sum()

Questions

  1. Is it intentional that the Policy Correction loss differs from the paper (KL + lower-bound constraint) and uses the current loss instead? If so, could you share the design rationale (e.g., stability, ease of implementation) and where the paper’s constraint parameters c and λ are enforced or approximated?
  2. The use of torch.tensor(all_loss, ...) appears to sever the gradient path, preventing gradients from flowing back to the actor. Do you acknowledge this issue, and is there a fix or planned change?

Thank you again for your time and for maintaining this repository. I appreciate any guidance you can provide!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions