[Discussion] The learning design

I think about the learning design that is implemented here, and I just can't resolve to myself two questions. The core function for the learning is [the environment step function](https://github.com/glmcdona/LuxPythonEnvGym/blob/5a69ff40b9679a151204b4e1fa0af5a19f36cdbe/luxai2021/env/lux_env.py#L41). The chain of learning is `[OBS_UNIT1 -> ACTION1 -> REWARD -> OBS_UNIT2 -> ACTION2 -> OBS_UNIT3 -> ACTION3 ... -> ALL TURN ACTIONS ARE ACTUALLY TAKEN] -> [THE SAME FOR THE NEXT TURN ...]`. The questions are:
 
1. **Less important**. Only the first action gets reward. Doesn't it create significant problems, especially when the number of units per turn is big? Especially if the discount factor `gamma` is small, but also in general. Even this intermediate reward for most actions is delayed. I wonder how much harder the life is for the model because of this. One thing, - the ordering of the units to act can be important. I can imagine that the model can handle it. But is there an example of multi-unit problems that are designed like this?

2. **More important**. The algorithms like TD(0), Q-Learning, and more involved like PPO, all depend for the model update not only on the current state (or state-action pair) but also the next one. But the next step is a different unit, its observation is unit-dependent, its value function is completely different, and barely related. The process is basically not markovian, the states are heavily incomplete information, and each time different incomplete information. Isn't it a no-go? Or I miss-understand something major?

Please share your thought!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Discussion] The learning design #88

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Discussion] The learning design #88

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions