Skip to content

Training Failure: PPO Agent Fails to Learn in BVRDog Environment #7

@YellowThree-HS

Description

@YellowThree-HS

Training Failure: PPO Agent Fails to Learn in BVRDog Environment

Problem Description

The PPO agent is unable to learn effectively in the BVRDog (Beyond Visual Range Dogfight) environment. After training for 75+ episodes, the agent shows no improvement and consistently fails to achieve the objective (defeating the red aircraft).

Training Configuration

  • Algorithm: PPO (Proximal Policy Optimization)
  • State Dimension: 8
  • Action Dimension: 2 (heading, altitude)
  • Learning Rate: 1e-05
  • Gamma: 1.0
  • Epsilon Clip: 0.2
  • K Epochs: 80
  • Update Timestep: 600
  • Action Std: 0.3
  • Normalize Rewards: True

Observed Issues

1. Extremely Sparse Rewards (Critical)

The reward function only provides non-zero rewards at episode termination:

  • During episode: All intermediate steps return reward = 0
  • Episode end: Returns +1 (blue wins), -1 (red wins or timeout), or -1 (other termination)

Code Location: jsb_gym/environmets/bvrdog.py:676-692

def get_reward(self, is_done):
    if is_done:
        if not self.f16r_alive:
            return 1
        elif not self.f16_alive:
            return -1
        else:
            return -1
    else:
        return 0

Impact:

  • No learning signal during the episode
  • Agent cannot differentiate between good and bad intermediate actions
  • Makes training extremely difficult, especially for long episodes

2. Abnormal Action Execution

Action mapping from normalized values [-1, 1] to actual control commands shows inconsistent behavior:

Example from Episode 1 (Step 1):

  • Raw action: heading=0.3656, altitude=0.3547
  • Scaled action: heading=245.80°, altitude=9096.2m
  • Before execution: heading=0.05°, alt=8741.3m
  • After execution: heading=306.38°, alt=7118.4m
  • Heading change: 306.33° (should not jump this much in one step)
  • Altitude change: -1622.9m (very large)

Problem: The agent commands absolute heading/altitude values, but the actual state changes are erratic and don't match the intended actions.

3. Short or Unproductive Episodes

  • Early episodes (e.g., Episode 1): Only 12 steps before termination (blue aircraft shot down)
  • Later episodes (e.g., Episode 75-76): Run to timeout (960 seconds, 96 steps) with no progress
  • All episodes end with reward = -1 (no wins observed)
  • Distance may increase instead of decrease (e.g., from 77.97km to 151.69km in Episode 2)

4. No Intermediate Learning Signal

From Episode 1 reward breakdown:

Step rewards: [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 
               0.0000, 0.0000, 0.0000, 0.0000, 0.0000, -1.0000]

All steps except the final one provide zero reward, making it impossible for the agent to learn which actions are beneficial during the episode.

Expected Behavior

The agent should:

  1. Learn to approach and engage the red aircraft
  2. Maintain advantageous position (distance, altitude, angle)
  3. Successfully defeat the red aircraft in some episodes
  4. Show improving performance over training episodes

Actual Behavior

  • Agent consistently fails (all episodes end with reward = -1)
  • No improvement after 75+ episodes
  • Episodes either end too quickly (agent shot down) or timeout without engagement
  • Distance often increases instead of decreases

Log Evidence

From result/training.log:

[Episode 1] 奖励: -1.0000 (正奖励:0, 负奖励:1, 零奖励:11), 
红方胜利 (蓝方F16被击落), Steps: 12, 
初始距离: 77.97km, 最终距离: 42.94km

[Episode 75] 奖励: -1.0000 (正奖励:0, 负奖励:1, 零奖励:95), 
超时 (达到最大时间 960.01秒), Steps: 96, 
初始距离: 77.97km, 最终距离: 16.05km

Suggested Fixes

1. Implement Dense Reward Function (High Priority)

Add intermediate rewards based on:

  • Distance reduction (reward for closing distance)
  • Maintaining advantageous position (angle, altitude)
  • Avoiding missile threats (negative reward for being locked)
  • Approaching engagement envelope (positive reward for getting in range)

Example structure:

def get_reward(self, is_done):
    if is_done:
        # Terminal rewards (existing)
        ...
    else:
        # Intermediate rewards
        distance_reward = ...  # Reward for closing distance
        angle_reward = ...     # Reward for good angle
        altitude_reward = ...  # Reward for maintaining altitude
        return distance_reward + angle_reward + altitude_reward

2. Fix Action Mapping

Investigate why action commands don't translate correctly to state changes:

  • Check PID controller parameters
  • Verify action scaling function (scale_between_inv)
  • Consider using relative actions (delta heading/altitude) instead of absolute values
  • Add action limits to prevent extreme changes

3. Adjust Training Hyperparameters

  • Reduce update_timestep for more frequent updates with sparse rewards
  • Consider curriculum learning (start with easier scenarios)
  • Adjust learning rate if needed
  • Add reward shaping to guide early learning

4. Add Reward Normalization and Scaling

Ensure rewards are properly scaled for PPO:

  • Current rewards (-1, 0, +1) may be too sparse for normalization
  • Consider scaling intermediate rewards appropriately

Environment Details

  • Environment: BVRDog (Beyond Visual Range Dogfight)
  • Blue Aircraft: F-16 (controlled by agent)
  • Red Aircraft: F-16r (controlled by behavior tree)
  • Initial Distance: ~77.97km
  • Max Simulation Time: 960 seconds
  • Action Space: Heading [0-360°], Altitude [3-12km]

Files Involved

  • jsb_gym/environmets/bvrdog.py - Environment and reward function
  • jsb_gym/TAU/aircraft.py - Action execution (cmd_BVR)
  • jsb_gym/utils/utils.py - Action scaling (scale_between_inv)
  • mainBVRGym.py - Training script
  • result/training.log - Training logs

Additional Context

The commented code in the reward function suggests previous attempts at intermediate rewards were disabled:

# if abs(self.angle_to_f16r) < 35:
#     return 1
# else:
#     return 0

This indicates that reward shaping was considered but may need to be re-implemented with a more comprehensive approach.

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions