Skip to content

A repository with reinforcement learning experiments, mostly for educational purposes. Based on Sutton & Barto book.

Notifications You must be signed in to change notification settings

WideLearning/RL

Repository files navigation

To use the environments you will want to install it as a module. To do so run pip install -e implementations from the root of the repository.

Multi-armed bandit problem

Here several algorithms were compared in the same conditions: $5$ arms, $300$ steps, action value for each arm is sampled from standard normal distribution, on each step observed value is sampled from $\mathcal{N}(\text{action value}, 3)$. For other details see benchmark.py.

Method Total reward
greedy 225±10
$\varepsilon$-greedy 242±10
gradient 247±10
gradient_biased 256±10
UCB 262±10
optimal_gradient 308±10
optimal 350±10

Greedy algorithm tries each arm ones, and then selects the one with maximal mean value observed so far. greedy

Its modification, $\varepsilon$-greedy, does the same, but with chance $\varepsilon$ it selects arm randomly. eps_greedy

Gradient ascent learns probabilities of taking each action instead of action values. gradient

This version uses biased estimator of gradient, but on this testbed it works even better. gradient_biased

Upper Confidence Bound method calculates for each arm an optimistic estimate of its value which gets closer to the real value with more tries. Then selects the action with the highest estimate. ucb

It is supposed to be an upper-bound for all gradient methods. This algorithm has access to the true action values and uses it to calculate exact gradients. optimal_gradient

And this is an upper-bound for all algorithms, because it always takes the action with highest true value. optimal

Pullup environment

In envs.pullup there is a simple engine for simulating a system of pointlike particles with soft constraints on distances and angles between them. And there is a demonstration how it can be used to simulate something resembling a man swinging on a pullup bar.

Q-learning implementations

In algorithms.online there are three versions of Q-learning, and in algorithms.approximation there are some simple function approximators that can be used with it. In blackjack.py and cliffwalking.py there are examples of using these algorithms with simple environments.

About

A repository with reinforcement learning experiments, mostly for educational purposes. Based on Sutton & Barto book.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages