Differences Between the Policy Correction Implementation and the Paper, and a Potential Lack of Policy Update in 'projection()' Due to a Broken Gradient

Hi, thank you very much for the excellent work on this project and for making the code available!

While reading through the implementation, I noticed two points I’d like to kindly clarify (I may be misunderstanding any part—please feel free to correct me):

- The update rule used in this repository’s Policy Correction phase is different from the paper’s practical implementation, which uses KL(new‖old) + a lower-bound constraint (hinge + λ).

- In addition, the current implementation in `projection()` seems to break the computation graph, which can prevent gradients from reaching the actor. As a result, the policy may fail to update.

For reference (picor.py, l.253):
```
project_loss = (
                torch.tensor(all_loss, device=self.device, requires_grad=True) * weights
            ).sum()
```

Questions

1. Is it intentional that the Policy Correction loss differs from the paper (KL + lower-bound constraint) and uses the current loss instead? If so, could you share the design rationale (e.g., stability, ease of implementation) and where the paper’s constraint parameters c and λ are enforced or approximated?
2. The use of torch.tensor(all_loss, ...) appears to sever the gradient path, preventing gradients from flowing back to the actor. Do you acknowledge this issue, and is there a fix or planned change?

Thank you again for your time and for maintaining this repository. I appreciate any guidance you can provide!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Differences Between the Policy Correction Implementation and the Paper, and a Potential Lack of Policy Update in 'projection()' Due to a Broken Gradient #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Differences Between the Policy Correction Implementation and the Paper, and a Potential Lack of Policy Update in 'projection()' Due to a Broken Gradient #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions