Training Failure: PPO Agent Fails to Learn in BVRDog Environment


# Training Failure: PPO Agent Fails to Learn in BVRDog Environment

## Problem Description

The PPO agent is unable to learn effectively in the BVRDog (Beyond Visual Range Dogfight) environment. After training for 75+ episodes, the agent shows no improvement and consistently fails to achieve the objective (defeating the red aircraft).

## Training Configuration

- **Algorithm**: PPO (Proximal Policy Optimization)
- **State Dimension**: 8
- **Action Dimension**: 2 (heading, altitude)
- **Learning Rate**: 1e-05
- **Gamma**: 1.0
- **Epsilon Clip**: 0.2
- **K Epochs**: 80
- **Update Timestep**: 600
- **Action Std**: 0.3
- **Normalize Rewards**: True

## Observed Issues

### 1. Extremely Sparse Rewards (Critical)

The reward function only provides non-zero rewards at episode termination:
- **During episode**: All intermediate steps return `reward = 0`
- **Episode end**: Returns `+1` (blue wins), `-1` (red wins or timeout), or `-1` (other termination)

**Code Location**: `jsb_gym/environmets/bvrdog.py:676-692`

```python
def get_reward(self, is_done):
    if is_done:
        if not self.f16r_alive:
            return 1
        elif not self.f16_alive:
            return -1
        else:
            return -1
    else:
        return 0
```

**Impact**: 
- No learning signal during the episode
- Agent cannot differentiate between good and bad intermediate actions
- Makes training extremely difficult, especially for long episodes

### 2. Abnormal Action Execution

Action mapping from normalized values `[-1, 1]` to actual control commands shows inconsistent behavior:

**Example from Episode 1 (Step 1)**:
- Raw action: `heading=0.3656, altitude=0.3547`
- Scaled action: `heading=245.80°, altitude=9096.2m`
- Before execution: `heading=0.05°, alt=8741.3m`
- After execution: `heading=306.38°, alt=7118.4m`
- **Heading change: 306.33°** (should not jump this much in one step)
- **Altitude change: -1622.9m** (very large)

**Problem**: The agent commands absolute heading/altitude values, but the actual state changes are erratic and don't match the intended actions.

### 3. Short or Unproductive Episodes

- **Early episodes** (e.g., Episode 1): Only 12 steps before termination (blue aircraft shot down)
- **Later episodes** (e.g., Episode 75-76): Run to timeout (960 seconds, 96 steps) with no progress
- All episodes end with `reward = -1` (no wins observed)
- Distance may increase instead of decrease (e.g., from 77.97km to 151.69km in Episode 2)

### 4. No Intermediate Learning Signal

From Episode 1 reward breakdown:
```
Step rewards: [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 
               0.0000, 0.0000, 0.0000, 0.0000, 0.0000, -1.0000]
```

All steps except the final one provide zero reward, making it impossible for the agent to learn which actions are beneficial during the episode.

## Expected Behavior

The agent should:
1. Learn to approach and engage the red aircraft
2. Maintain advantageous position (distance, altitude, angle)
3. Successfully defeat the red aircraft in some episodes
4. Show improving performance over training episodes

## Actual Behavior

- Agent consistently fails (all episodes end with `reward = -1`)
- No improvement after 75+ episodes
- Episodes either end too quickly (agent shot down) or timeout without engagement
- Distance often increases instead of decreases

## Log Evidence

From `result/training.log`:

```
[Episode 1] 奖励: -1.0000 (正奖励:0, 负奖励:1, 零奖励:11), 
红方胜利 (蓝方F16被击落), Steps: 12, 
初始距离: 77.97km, 最终距离: 42.94km

[Episode 75] 奖励: -1.0000 (正奖励:0, 负奖励:1, 零奖励:95), 
超时 (达到最大时间 960.01秒), Steps: 96, 
初始距离: 77.97km, 最终距离: 16.05km
```

## Suggested Fixes

### 1. Implement Dense Reward Function (High Priority)

Add intermediate rewards based on:
- Distance reduction (reward for closing distance)
- Maintaining advantageous position (angle, altitude)
- Avoiding missile threats (negative reward for being locked)
- Approaching engagement envelope (positive reward for getting in range)

Example structure:
```python
def get_reward(self, is_done):
    if is_done:
        # Terminal rewards (existing)
        ...
    else:
        # Intermediate rewards
        distance_reward = ...  # Reward for closing distance
        angle_reward = ...     # Reward for good angle
        altitude_reward = ...  # Reward for maintaining altitude
        return distance_reward + angle_reward + altitude_reward
```

### 2. Fix Action Mapping

Investigate why action commands don't translate correctly to state changes:
- Check PID controller parameters
- Verify action scaling function (`scale_between_inv`)
- Consider using relative actions (delta heading/altitude) instead of absolute values
- Add action limits to prevent extreme changes

### 3. Adjust Training Hyperparameters

- Reduce `update_timestep` for more frequent updates with sparse rewards
- Consider curriculum learning (start with easier scenarios)
- Adjust learning rate if needed
- Add reward shaping to guide early learning

### 4. Add Reward Normalization and Scaling

Ensure rewards are properly scaled for PPO:
- Current rewards (-1, 0, +1) may be too sparse for normalization
- Consider scaling intermediate rewards appropriately

## Environment Details

- **Environment**: BVRDog (Beyond Visual Range Dogfight)
- **Blue Aircraft**: F-16 (controlled by agent)
- **Red Aircraft**: F-16r (controlled by behavior tree)
- **Initial Distance**: ~77.97km
- **Max Simulation Time**: 960 seconds
- **Action Space**: Heading [0-360°], Altitude [3-12km]

## Files Involved

- `jsb_gym/environmets/bvrdog.py` - Environment and reward function
- `jsb_gym/TAU/aircraft.py` - Action execution (`cmd_BVR`)
- `jsb_gym/utils/utils.py` - Action scaling (`scale_between_inv`)
- `mainBVRGym.py` - Training script
- `result/training.log` - Training logs

## Additional Context

The commented code in the reward function suggests previous attempts at intermediate rewards were disabled:
```python
# if abs(self.angle_to_f16r) < 35:
#     return 1
# else:
#     return 0
```

This indicates that reward shaping was considered but may need to be re-implemented with a more comprehensive approach.


<img width="999" height="610" alt="Image" src="https://github.com/user-attachments/assets/3e4a0846-526c-4c72-9026-38c19fc29b5d" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Failure: PPO Agent Fails to Learn in BVRDog Environment #7

Training Failure: PPO Agent Fails to Learn in BVRDog Environment

Problem Description

Training Configuration

Observed Issues

1. Extremely Sparse Rewards (Critical)

2. Abnormal Action Execution

3. Short or Unproductive Episodes

4. No Intermediate Learning Signal

Expected Behavior

Actual Behavior

Log Evidence

Suggested Fixes

1. Implement Dense Reward Function (High Priority)

2. Fix Action Mapping

3. Adjust Training Hyperparameters

4. Add Reward Normalization and Scaling

Environment Details

Files Involved

Additional Context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Training Failure: PPO Agent Fails to Learn in BVRDog Environment #7

Description

Training Failure: PPO Agent Fails to Learn in BVRDog Environment

Problem Description

Training Configuration

Observed Issues

1. Extremely Sparse Rewards (Critical)

2. Abnormal Action Execution

3. Short or Unproductive Episodes

4. No Intermediate Learning Signal

Expected Behavior

Actual Behavior

Log Evidence

Suggested Fixes

1. Implement Dense Reward Function (High Priority)

2. Fix Action Mapping

3. Adjust Training Hyperparameters

4. Add Reward Normalization and Scaling

Environment Details

Files Involved

Additional Context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions