Hi, thank you very much for the excellent work on this project and for making the code available!
While reading through the implementation, I noticed two points I’d like to kindly clarify (I may be misunderstanding any part—please feel free to correct me):
-
The update rule used in this repository’s Policy Correction phase is different from the paper’s practical implementation, which uses KL(new‖old) + a lower-bound constraint (hinge + λ).
-
In addition, the current implementation in projection() seems to break the computation graph, which can prevent gradients from reaching the actor. As a result, the policy may fail to update.
For reference (picor.py, l.253):
project_loss = (
torch.tensor(all_loss, device=self.device, requires_grad=True) * weights
).sum()
Questions
- Is it intentional that the Policy Correction loss differs from the paper (KL + lower-bound constraint) and uses the current loss instead? If so, could you share the design rationale (e.g., stability, ease of implementation) and where the paper’s constraint parameters c and λ are enforced or approximated?
- The use of torch.tensor(all_loss, ...) appears to sever the gradient path, preventing gradients from flowing back to the actor. Do you acknowledge this issue, and is there a fix or planned change?
Thank you again for your time and for maintaining this repository. I appreciate any guidance you can provide!