The goal of this repository is too implement a simple agent while following as many best practices as possible to make this code stable and mathematically accurate.
The goal is not to provide the shortest way to solve CartPole, but to keep a staged RL pipeline readable:
- fit a reward model in feature space,
- pretrain successor features with stable Monte Carlo targets,
- refine them with an n-step bootstrap tail,
- distill an action planner into a deterministic actor,
- fine-tune online with mixed replay.
The environment is a custom continuous-action CartPole with friction, shaped reward, worst-case reset support.
ToDo: I broke something while refactoring the code for online training resulting in sub-optimal performances. The notebooks in ./old already work properly but are not very readable.
Let z = f(o) be an encoded observation.
phi(z, a)is an immediate feature vector.wis a linear reward head so thatr_hat(z, a) = <phi(z, a), w>.M_k(z, a)is thek-th successor-feature head.Q_k(z, a) = <M_k(z, a), w>.- The ensemble mean and standard deviation define:
- exploitation score:
mu_Q(z, a) - uncertainty score:
sigma_Q(z, a) - conservative planner objective:
LCB(z, a) = mu_Q(z, a) - beta * sigma_Q(z, a)
- exploitation score:
The planner is optimized in action space and later distilled into a small actor network.
Train the shared encoder, phi, and w on supervised reward prediction.
Freeze the representation and train each M_k on full-window Monte Carlo sums of phi.
Keep the same representation and move toward an infinite-horizon target with one bootstrap tail from a target network.
Use the LCB planner to produce target actions and regress the actor onto them.
The online notebook starts from the offline checkpoints and:
- freezes the encoder,
phi, andw, - updates the
Mensemble with one-step targets, - updates the actor with the LCB objective,
- mixes offline and online replay to avoid drift.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt