GitHub - WideLearning/RL: A repository with reinforcement learning experiments, mostly for educational purposes. Based on Sutton & Barto book.

To use the environments you will want to install it as a module. To do so run pip install -e implementations from the root of the repository.

Multi-armed bandit problem

Here several algorithms were compared in the same conditions: $5$ arms, $300$ steps, action value for each arm is sampled from standard normal distribution, on each step observed value is sampled from $\mathcal{N}(\text{action value}, 3)$. For other details see benchmark.py.

Method	Total reward
greedy	225±10
$\varepsilon$-greedy	242±10
gradient	247±10
gradient_biased	256±10
UCB	262±10
optimal_gradient	308±10
optimal	350±10

Greedy algorithm tries each arm ones, and then selects the one with maximal mean value observed so far.

Its modification, $\varepsilon$-greedy, does the same, but with chance $\varepsilon$ it selects arm randomly.

Gradient ascent learns probabilities of taking each action instead of action values.

This version uses biased estimator of gradient, but on this testbed it works even better.

Upper Confidence Bound method calculates for each arm an optimistic estimate of its value which gets closer to the real value with more tries. Then selects the action with the highest estimate.

It is supposed to be an upper-bound for all gradient methods. This algorithm has access to the true action values and uses it to calculate exact gradients.

And this is an upper-bound for all algorithms, because it always takes the action with highest true value.

Pullup environment

In envs.pullup there is a simple engine for simulating a system of pointlike particles with soft constraints on distances and angles between them. And there is a demonstration how it can be used to simulate something resembling a man swinging on a pullup bar.

Q-learning implementations

In algorithms.online there are three versions of Q-learning, and in algorithms.approximation there are some simple function approximators that can be used with it. In blackjack.py and cliffwalking.py there are examples of using these algorithms with simple environments.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.vscode		.vscode
hex		hex
images		images
implementations		implementations
.gitignore		.gitignore
.prospector.yaml		.prospector.yaml
README.md		README.md
bandit_benchmark.py		bandit_benchmark.py
blackjack.py		blackjack.py
cliffwalking.py		cliffwalking.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multi-armed bandit problem

Pullup environment

Q-learning implementations

About

Uh oh!

Releases

Packages

Languages

WideLearning/RL

Folders and files

Latest commit

History

Repository files navigation

Multi-armed bandit problem

Pullup environment

Q-learning implementations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages