To use the environments you will want to install it as a module.
To do so run pip install -e implementations from the root of the repository.
Here several algorithms were compared in the same conditions:
benchmark.py.
| Method | Total reward |
|---|---|
| greedy | 225±10 |
|
|
242±10 |
| gradient | 247±10 |
| gradient_biased | 256±10 |
| UCB | 262±10 |
| optimal_gradient | 308±10 |
| optimal | 350±10 |
Greedy algorithm tries each arm ones, and then selects the one with maximal mean value observed so far.
Its modification,
Gradient ascent learns probabilities of taking each action instead of action values.
This version uses biased estimator of gradient, but on this testbed it works even better.
Upper Confidence Bound method calculates for each arm an optimistic estimate of its value which gets closer to the real value with more tries.
Then selects the action with the highest estimate.
It is supposed to be an upper-bound for all gradient methods. This algorithm has access to the true action values and uses it to calculate exact gradients.
And this is an upper-bound for all algorithms, because it always takes the action with highest true value.
In envs.pullup there is a simple engine for simulating a system of pointlike particles with soft constraints on distances and angles between them. And there is a demonstration how it can be used to simulate something resembling a man swinging on a pullup bar.
In algorithms.online there are three versions of Q-learning, and in algorithms.approximation there are some simple function approximators that can be used with it. In blackjack.py and cliffwalking.py there are examples of using these algorithms with simple environments.