For src/agents/, extend existing DQN/PPO with new algos (Objective 1).
-
Extend agents/base_agent.py with A2C agent using PyTorch. Policy/value nets share GAT encoder from encoders/. Actor-critic: multiple envs parallel rollout, synchronous update with advantage A = R + gamma V(s') - V(s). EVRP action mask: logits[mask] = -inf. Train loop: sample actions, env step, compute returns/advs, PPO-clip or A2C loss. Config from YAML: algo='a2c', lr=3e-4.
-
Implement SAC agent in agents/sac_agent.py for continuous exploration in discrete EVRP (Gumbel-Softmax actions?). Off-policy: actor/critic/Q-nets with encoder. Max entropy: KL(Q - alpha log pi). Replay buffer for EVRP episodes. Reference proposal SAC . YAML config: tau=0.005, alpha='auto'.
-
Refactor existing DQN/PPO to AgentFactory: load YAML config['agent'], instantiate with encoder/reward_fn. Train script train.py: load config, env=EVREnv(), agent=factory(config), loop episodes/train_steps.
For src/agents/, extend existing DQN/PPO with new algos (Objective 1).
Extend agents/base_agent.py with A2C agent using PyTorch. Policy/value nets share GAT encoder from encoders/. Actor-critic: multiple envs parallel rollout, synchronous update with advantage A = R + gamma V(s') - V(s). EVRP action mask: logits[mask] = -inf. Train loop: sample actions, env step, compute returns/advs, PPO-clip or A2C loss. Config from YAML: algo='a2c', lr=3e-4.
Implement SAC agent in agents/sac_agent.py for continuous exploration in discrete EVRP (Gumbel-Softmax actions?). Off-policy: actor/critic/Q-nets with encoder. Max entropy: KL(Q - alpha log pi). Replay buffer for EVRP episodes. Reference proposal SAC . YAML config: tau=0.005, alpha='auto'.
Refactor existing DQN/PPO to AgentFactory: load YAML config['agent'], instantiate with encoder/reward_fn. Train script train.py: load config, env=EVREnv(), agent=factory(config), loop episodes/train_steps.