Structured RL on Custom CartPole

The goal of this repository is too implement a simple agent while following as many best practices as possible to make this code stable and mathematically accurate.

The goal is not to provide the shortest way to solve CartPole, but to keep a staged RL pipeline readable:

fit a reward model in feature space,
pretrain successor features with stable Monte Carlo targets,
refine them with an n-step bootstrap tail,
distill an action planner into a deterministic actor,
fine-tune online with mixed replay.

The environment is a custom continuous-action CartPole with friction, shaped reward, worst-case reset support.

ToDo: I broke something while refactoring the code for online training resulting in sub-optimal performances. The notebooks in ./old already work properly but are not very readable.

Core model idea

Let z = f(o) be an encoded observation.

phi(z, a) is an immediate feature vector.
w is a linear reward head so that r_hat(z, a) = <phi(z, a), w>.
M_k(z, a) is the k-th successor-feature head.
Q_k(z, a) = <M_k(z, a), w>.
The ensemble mean and standard deviation define:
- exploitation score: mu_Q(z, a)
- uncertainty score: sigma_Q(z, a)
- conservative planner objective: LCB(z, a) = mu_Q(z, a) - beta * sigma_Q(z, a)

The planner is optimized in action space and later distilled into a small actor network.

Offline phases

Phase A — Reward fit

Train the shared encoder, phi, and w on supervised reward prediction.

Phase B1 — Monte Carlo SF pretraining

Freeze the representation and train each M_k on full-window Monte Carlo sums of phi.

Phase B2 — n-step + tail bootstrap

Keep the same representation and move toward an infinite-horizon target with one bootstrap tail from a target network.

Phase C — Planner distillation

Use the LCB planner to produce target actions and regress the actor onto them.

Phase D — Online finetune

The online notebook starts from the offline checkpoints and:

freezes the encoder, phi, and w,
updates the M ensemble with one-step targets,
updates the actor with the LCB objective,
mixes offline and online replay to avoid drift.

Quick start

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
cartpole_sf		cartpole_sf
checkpoints		checkpoints
old		old
.gitignore		.gitignore
10_offline_training.ipynb		10_offline_training.ipynb
20_online_finetune.ipynb		20_online_finetune.ipynb
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Structured RL on Custom CartPole

Core model idea

Offline phases

Phase A — Reward fit

Phase B1 — Monte Carlo SF pretraining

Phase B2 — n-step + tail bootstrap

Phase C — Planner distillation

Phase D — Online finetune

Quick start

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Structured RL on Custom CartPole

Core model idea

Offline phases

Phase A — Reward fit

Phase B1 — Monte Carlo SF pretraining

Phase B2 — n-step + tail bootstrap

Phase C — Planner distillation

Phase D — Online finetune

Quick start

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages