-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
The optimal policy takes the minimum number of steps to reach the exit. POMCP should converge precisely to this optimal policy in the limit. A constant negative reward for each step before reaching the exit and zero reward afterward has this property. Adding a large positive reward for finding the exit may actually break the property (even though perhaps it seems intuitively appealing) depending I suppose on how the horizon is handled. It ~certainly cannot help (for instance assuming horizon were infinite, a rollout would eventually find the exit with probability 1 and without discounting any reward would only translate all values - this would effect behavior only through possibly messing with the tradeoff against exploration bonus).
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels