Skip to content

Picus303/CleanRL

Repository files navigation

Structured RL on Custom CartPole

The goal of this repository is too implement a simple agent while following as many best practices as possible to make this code stable and mathematically accurate.

The goal is not to provide the shortest way to solve CartPole, but to keep a staged RL pipeline readable:

  1. fit a reward model in feature space,
  2. pretrain successor features with stable Monte Carlo targets,
  3. refine them with an n-step bootstrap tail,
  4. distill an action planner into a deterministic actor,
  5. fine-tune online with mixed replay.

The environment is a custom continuous-action CartPole with friction, shaped reward, worst-case reset support.

ToDo: I broke something while refactoring the code for online training resulting in sub-optimal performances. The notebooks in ./old already work properly but are not very readable.

Core model idea

Let z = f(o) be an encoded observation.

  • phi(z, a) is an immediate feature vector.
  • w is a linear reward head so that r_hat(z, a) = <phi(z, a), w>.
  • M_k(z, a) is the k-th successor-feature head.
  • Q_k(z, a) = <M_k(z, a), w>.
  • The ensemble mean and standard deviation define:
    • exploitation score: mu_Q(z, a)
    • uncertainty score: sigma_Q(z, a)
    • conservative planner objective: LCB(z, a) = mu_Q(z, a) - beta * sigma_Q(z, a)

The planner is optimized in action space and later distilled into a small actor network.

Offline phases

Phase A — Reward fit

Train the shared encoder, phi, and w on supervised reward prediction.

Phase B1 — Monte Carlo SF pretraining

Freeze the representation and train each M_k on full-window Monte Carlo sums of phi.

Phase B2 — n-step + tail bootstrap

Keep the same representation and move toward an infinite-horizon target with one bootstrap tail from a target network.

Phase C — Planner distillation

Use the LCB planner to produce target actions and regress the actor onto them.

Phase D — Online finetune

The online notebook starts from the offline checkpoints and:

  • freezes the encoder, phi, and w,
  • updates the M ensemble with one-step targets,
  • updates the actor with the LCB objective,
  • mixes offline and online replay to avoid drift.

Quick start

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

About

Stable and Mathematically Accurate Reinforcement Learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors