Skip to content

Reward after finding the exit should be zero #3

@ColeWyeth

Description

@ColeWyeth

The optimal policy takes the minimum number of steps to reach the exit. POMCP should converge precisely to this optimal policy in the limit. A constant negative reward for each step before reaching the exit and zero reward afterward has this property. Adding a large positive reward for finding the exit may actually break the property (even though perhaps it seems intuitively appealing) depending I suppose on how the horizon is handled. It ~certainly cannot help (for instance assuming horizon were infinite, a rollout would eventually find the exit with probability 1 and without discounting any reward would only translate all values - this would effect behavior only through possibly messing with the tradeoff against exploration bonus).

https://github.com/JeffreyQin/Fragment-based-POMCP/blob/640ae24ed4094553327a4ec6359b62117427c4f3/generator.py#L114

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions