Reward after finding the exit should be zero

The optimal policy takes the minimum number of steps to reach the exit. POMCP should converge _precisely_ to this optimal policy in the limit. A constant negative reward for each step before reaching the exit and zero reward afterward has this property. Adding a large positive reward for finding the exit may actually break the property (even though perhaps it seems intuitively appealing) depending I suppose on how the horizon is handled. It ~certainly cannot help (for instance assuming horizon were infinite, a rollout would eventually find the exit with probability 1 and without discounting any reward would only translate all values - this would effect behavior only through possibly messing with the tradeoff against exploration bonus).   

https://github.com/JeffreyQin/Fragment-based-POMCP/blob/640ae24ed4094553327a4ec6359b62117427c4f3/generator.py#L114

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reward after finding the exit should be zero #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Reward after finding the exit should be zero #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions