-
Notifications
You must be signed in to change notification settings - Fork 31
Description
Training Failure: PPO Agent Fails to Learn in BVRDog Environment
Problem Description
The PPO agent is unable to learn effectively in the BVRDog (Beyond Visual Range Dogfight) environment. After training for 75+ episodes, the agent shows no improvement and consistently fails to achieve the objective (defeating the red aircraft).
Training Configuration
- Algorithm: PPO (Proximal Policy Optimization)
- State Dimension: 8
- Action Dimension: 2 (heading, altitude)
- Learning Rate: 1e-05
- Gamma: 1.0
- Epsilon Clip: 0.2
- K Epochs: 80
- Update Timestep: 600
- Action Std: 0.3
- Normalize Rewards: True
Observed Issues
1. Extremely Sparse Rewards (Critical)
The reward function only provides non-zero rewards at episode termination:
- During episode: All intermediate steps return
reward = 0 - Episode end: Returns
+1(blue wins),-1(red wins or timeout), or-1(other termination)
Code Location: jsb_gym/environmets/bvrdog.py:676-692
def get_reward(self, is_done):
if is_done:
if not self.f16r_alive:
return 1
elif not self.f16_alive:
return -1
else:
return -1
else:
return 0Impact:
- No learning signal during the episode
- Agent cannot differentiate between good and bad intermediate actions
- Makes training extremely difficult, especially for long episodes
2. Abnormal Action Execution
Action mapping from normalized values [-1, 1] to actual control commands shows inconsistent behavior:
Example from Episode 1 (Step 1):
- Raw action:
heading=0.3656, altitude=0.3547 - Scaled action:
heading=245.80°, altitude=9096.2m - Before execution:
heading=0.05°, alt=8741.3m - After execution:
heading=306.38°, alt=7118.4m - Heading change: 306.33° (should not jump this much in one step)
- Altitude change: -1622.9m (very large)
Problem: The agent commands absolute heading/altitude values, but the actual state changes are erratic and don't match the intended actions.
3. Short or Unproductive Episodes
- Early episodes (e.g., Episode 1): Only 12 steps before termination (blue aircraft shot down)
- Later episodes (e.g., Episode 75-76): Run to timeout (960 seconds, 96 steps) with no progress
- All episodes end with
reward = -1(no wins observed) - Distance may increase instead of decrease (e.g., from 77.97km to 151.69km in Episode 2)
4. No Intermediate Learning Signal
From Episode 1 reward breakdown:
Step rewards: [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, -1.0000]
All steps except the final one provide zero reward, making it impossible for the agent to learn which actions are beneficial during the episode.
Expected Behavior
The agent should:
- Learn to approach and engage the red aircraft
- Maintain advantageous position (distance, altitude, angle)
- Successfully defeat the red aircraft in some episodes
- Show improving performance over training episodes
Actual Behavior
- Agent consistently fails (all episodes end with
reward = -1) - No improvement after 75+ episodes
- Episodes either end too quickly (agent shot down) or timeout without engagement
- Distance often increases instead of decreases
Log Evidence
From result/training.log:
[Episode 1] 奖励: -1.0000 (正奖励:0, 负奖励:1, 零奖励:11),
红方胜利 (蓝方F16被击落), Steps: 12,
初始距离: 77.97km, 最终距离: 42.94km
[Episode 75] 奖励: -1.0000 (正奖励:0, 负奖励:1, 零奖励:95),
超时 (达到最大时间 960.01秒), Steps: 96,
初始距离: 77.97km, 最终距离: 16.05km
Suggested Fixes
1. Implement Dense Reward Function (High Priority)
Add intermediate rewards based on:
- Distance reduction (reward for closing distance)
- Maintaining advantageous position (angle, altitude)
- Avoiding missile threats (negative reward for being locked)
- Approaching engagement envelope (positive reward for getting in range)
Example structure:
def get_reward(self, is_done):
if is_done:
# Terminal rewards (existing)
...
else:
# Intermediate rewards
distance_reward = ... # Reward for closing distance
angle_reward = ... # Reward for good angle
altitude_reward = ... # Reward for maintaining altitude
return distance_reward + angle_reward + altitude_reward2. Fix Action Mapping
Investigate why action commands don't translate correctly to state changes:
- Check PID controller parameters
- Verify action scaling function (
scale_between_inv) - Consider using relative actions (delta heading/altitude) instead of absolute values
- Add action limits to prevent extreme changes
3. Adjust Training Hyperparameters
- Reduce
update_timestepfor more frequent updates with sparse rewards - Consider curriculum learning (start with easier scenarios)
- Adjust learning rate if needed
- Add reward shaping to guide early learning
4. Add Reward Normalization and Scaling
Ensure rewards are properly scaled for PPO:
- Current rewards (-1, 0, +1) may be too sparse for normalization
- Consider scaling intermediate rewards appropriately
Environment Details
- Environment: BVRDog (Beyond Visual Range Dogfight)
- Blue Aircraft: F-16 (controlled by agent)
- Red Aircraft: F-16r (controlled by behavior tree)
- Initial Distance: ~77.97km
- Max Simulation Time: 960 seconds
- Action Space: Heading [0-360°], Altitude [3-12km]
Files Involved
jsb_gym/environmets/bvrdog.py- Environment and reward functionjsb_gym/TAU/aircraft.py- Action execution (cmd_BVR)jsb_gym/utils/utils.py- Action scaling (scale_between_inv)mainBVRGym.py- Training scriptresult/training.log- Training logs
Additional Context
The commented code in the reward function suggests previous attempts at intermediate rewards were disabled:
# if abs(self.angle_to_f16r) < 35:
# return 1
# else:
# return 0This indicates that reward shaping was considered but may need to be re-implemented with a more comprehensive approach.
