-
Notifications
You must be signed in to change notification settings - Fork 37
Open
Description
for the special reward setting in this work, better policy will select the sentences in the bag that has higher logP(r|xi), the best result is find the max one, which means finding one max sentence for each bag and feed it to train the classifier. Is that correct?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels