This project implements a custom Gymnasium environment for a drone delivery task with wind‑induced action noise and a battery constraint. It includes tabular Dynamic Programming (Value Iteration, Policy Iteration), tabular Monte Carlo Control (on‑policy and off‑policy with importance sampling), and SARSA(λ) with linear function approximation (basic/one‑hot/engineered/tile coding).
- Grid world (discrete): width × height.
- Obstacles (impassable cells).
- Pickup location (start) and Drop-off location (goal).
- Battery levels with charge action at charging stations.
- Wind slip: with probability
p_slip, a move action deviates 90° left or right; otherwise it goes as intended. - Rewards:
- -1 per time step.
- -20 on collision (into wall/obstacle; agent stays in place).
- +50 on successful delivery (episode terminates).
- -10 if battery depletes before delivery (episode terminates).
Default settings keep the state space small and suitable for tabular methods.
- State space S:
- Tuples
(x, y, b, p)wherex ∈ {0..width-1},y ∈ {0..height-1},b ∈ {0..max_battery}, andp ∈ {0,1}indicates whether the drone still carries the package. - Obstacles are impassable cells; the agent position never enters them.
- Tuples
- Action space A:
{UP, DOWN, LEFT, RIGHT, STAY, CHARGE}.
- Transition dynamics P(s'|s,a):
- Movement actions are affected by wind slip: with probability
1 - 2·p_slipthe intended direction is taken; with probabilityp_slipthe action is rotated 90° left; with probabilityp_slipit is rotated 90° right. - Collisions (moving out of bounds or into obstacles) keep the agent in place and add a collision penalty.
- Battery rules: attempting movement decrements battery by 1 if
b > 0; ifb == 0and a movement is attempted, the episode terminates with an additional penalty.STAYandCHARGEdo not consume battery;CHARGEincreases battery bycharge_rateat charging stations (clamped tomax_battery). - Termination: delivery (reaching drop-off while carrying package), battery depletion without delivery, or reaching
max_steps(truncation).
- Movement actions are affected by wind slip: with probability
- Reward function R(s,a,s'):
- Base step cost −1 each step; −20 on collision; +50 on successful delivery; −10 on termination by battery depletion.
- Discount factor γ:
- Typically in
[0.95, 0.99]for these tasks; configurable in experiments.
- Typically in
- Start state s₀:
- At pickup location, full battery, and
p=1(carrying the package).
- At pickup location, full battery, and
- Terminal states:
- Delivery completed (
pbecomes 0) or battery depletion before delivery.
- Delivery completed (
- Dynamic Programming (requires model P and R):
- Value Iteration
- Policy Iteration (iterative policy evaluation + greedy improvement)
- Monte Carlo (model-free):
- On-policy MC Control, epsilon-soft, first-visit
- Off-policy Monte Carlo Control using Importance Sampling
- SARSA(λ) (model-free):
- SARSA(λ) with linear function approximation
- Stable-Baselines3 (model-free, deep RL):
- PPO, A2C, DQN (via Stable-Baselines3 with observation wrappers)
envs/drone_delivery.py— Gymnasium environment withenumerate_transitions(s, a)for DP.algorithms/dp.py— Value Iteration and Policy Iteration implementations.algorithms/mc.py— On-policy and Off-policy MC Control with Importance Sampling.algorithms/sarsa.py— SARSA(λ) with linear FA (feature options incl. tile coding, decays, replacing traces, optimistic init).examples/replay_vi_policy.py— Visualize and evaluate Value Iteration policies.examples/replay_mc_policy.py— Visualize and evaluate Monte Carlo policies.examples/replay_sarsa_policy.py— Train/evaluate/replay SARSA(λ) policies.examples/replay_sb3_policy.py— Replay SB3 (PPO/A2C/DQN) policies with visualization and optional CSV logging.examples/train_ppo.py— Train PPO with SB3 (EvalCallback, Monitor, TensorBoard).examples/train_a2c.py— Train A2C with SB3 (EvalCallback, Monitor, TensorBoard).examples/train_dqn.py— Train DQN with SB3 (EvalCallback, Monitor, TensorBoard).wrappers/sb3_wrappers.py— Observation wrappers for SB3 (one-hot, decoded).create_plot.py— Consolidated plotting of VI/MC/SARSA and SB3 CSVs.docs/examples_guide.md— Comprehensive guide for using the example scripts.docs/ALGORITHMS.md— Detailed algorithm documentation and implementation notes.run_experiments.sh— Automated experiment suite for VI, MC, SARSA, and SB3 replays.requirements.txt— Lightweight dependencies for the tabular project.
Install dependencies (optionally in a virtual environment):
pip install -r requirements.txtNote: The repository may already have gymnasium and numpy installed via the top-level requirements.txt. The requirements.txt adds optional packages used here (e.g., matplotlib, pandas, tqdm, tensorboard). SB3 training scripts use Stable-Baselines3 and can log to TensorBoard.
python examples/replay_vi_policy.py \
--max-battery 30 \
--obstacles default \
--sleep 0.5python examples/replay_mc_policy.py \
--max-battery 30 \
--episodes 5000 \
--obstacles default \
--sleep 0.5# PPO (~3M steps)
python examples/train_ppo.py
# A2C (~2M steps)
python examples/train_a2c.py
# DQN (~1M steps)
python examples/train_dqn.pyTensorBoard (opcional):
python -m tensorboard.main --logdir ./tensorboardpython examples/replay_sb3_policy.py \
--algo ppo \
--episodes 5 \
--deterministic \
--max-battery 30 \
--wind-slip 0.05 \
--obstacles default --charging-stations default./run_experiments.shResults are saved to vi_experiments.csv, mc_experiments.csv, and sarsa_experiments.csv.
SB3 replay results are saved to sb3_experiments.csv when using --to-csv.
python create_plot.pyOutputs performance_plot.png comparing VI, MC, SARSA, and SB3.
python examples/replay_sarsa_policy.py \
--no-render \
--episodes 10000 \
--features one_hot \
--gamma 0.995 \
--epsilon 0.3 --epsilon-final 0.05 \
--alpha 0.05 --alpha-final 0.01 \
--lam 0.95 --replacing-traces \
--optimistic-init 10.0 \
--max-battery 30 \
--obstacles default \
--eval-episodes 200Tile Coding variant (stable and efficient):
python examples/replay_sarsa_policy.py \
--no-render \
--episodes 20000 \
--features tile \
--tile-tilings 8 --tile-bins-x 6 --tile-bins-y 6 --tile-bins-b 6 \
--gamma 0.995 \
--epsilon 0.3 --epsilon-final 0.02 \
--alpha 0.007 --alpha-final 0.002 \
--lam 0.90 --replacing-traces \
--optimistic-init 5.0 \
--max-battery 30 \
--obstacles default \
--eval-episodes 200- Live matplotlib rendering with drone, obstacles, charging stations, and package
- Trajectory tracking showing step-by-step decisions
- Policy summary with action distribution and Q-values
- Progress monitoring with tqdm progress bars for MC/SARSA training
- Graceful interruption (Ctrl+C) with partial result saving
The environment supports strategic placement of charging stations:
- Default: Store location (0,0) plus mid-grid stations at (3,6) and (6,3)
- Customizable via
charging_stationsparameter
- 🏪 S = Store (pickup location)
- 🏠 H = House (delivery location)
- ⬛ X = Obstacles
- ⭐ ★ = Charging stations
- 🚁 Drone = Red triangle with package indicator
- Battery indicator with color coding (green/orange/red)
- Examples Guide - Comprehensive guide for using visualization scripts
- Algorithms Documentation - Detailed algorithm specifications and implementation notes
============================================================
LEARNED POLICY SUMMARY
============================================================
Total states: 2352
Action distribution across all states:
UP : 450 states ( 19.1%)
DOWN : 520 states ( 22.1%)
RIGHT : 890 states ( 37.8%)
...
============================================================
TRAJECTORY USING OPTIMAL POLICY
============================================================
Step 0: pos=(0,0) bat=30 📦 → RIGHT → reward= -1.0
Step 1: pos=(1,0) bat=29 📦 → RIGHT → reward= -1.0
...
Step 16: pos=(6,6) bat=14 ✓ → TERMINAL
============================================================
RESULT: SUCCESS ✓
Total steps: 16, Total return: 34.00
============================================================
