[P1][Phase-3][experimental] Beijing Baseline Validation & 3×3 Model Grid Completion

## Metadata
- **Priority**: P1-Major
- **Phase**: Phase 3 (Before Publication)
- **Feasibility**: experimental
- **Category**: validation
- **Effort**: 60-70 hours (3 models × 25 epochs × ~1 hour)
- **Dependencies**: None (can run immediately)

## Problem Statement

Beijing dataset currently has **incomplete model coverage** compared to Porto, and the vanilla baseline needs validation with additional random seeds to ensure reproducibility.

### Current Model Coverage

**Beijing** (6 models - incomplete):
- vanilla: 1 seed (seed42) ⚠️  **Only 1 seed - needs validation**
- distilled: 2 seeds (seed42, seed44) - Missing seed43

**Porto** (9 models - complete):
- vanilla: 3 seeds (42, 43, 44) ✅
- distill_phase1: 3 seeds (42, 43, 44) ✅
- distill_phase2: 3 seeds (42, 43, 44) ✅

## Evidence of Problem

**Concern**: Beijing vanilla baseline (seed42) may have anomalies
- Only 1 vanilla seed makes it impossible to verify consistency
- Cannot compute cross-seed statistics (mean, std, confidence intervals)
- Porto has 3 seeds for robustness - Beijing should match

**Asymmetry**: Beijing has 2 distilled but only 1 vanilla
- Unbalanced comparison (2-seed distilled vs 1-seed vanilla)
- Cannot apply cross-seed analysis tools (designed for N=3)
- Porto's 3×3 grid enables robust statistical analysis

## Required Changes

### Option A: Minimal Validation (2 models, ~50 hours)
**Add 2 vanilla models** to verify baseline:
- vanilla_25epoch_seed43.pth
- vanilla_25epoch_seed44.pth

**Result**: 3 vanilla + 2 distilled (sufficient for validation)

**Runtime**: 2 × 25 epochs × ~1 hour = **50 hours (~2.1 days)**

### Option B: Complete 3×3 Grid (3 models, ~75 hours) ✅ **Recommended**
**Add 2 vanilla + 1 distilled** to match Porto:
- vanilla_25epoch_seed43.pth
- vanilla_25epoch_seed44.pth
- distilled_25epoch_seed43.pth

**Result**: 3 vanilla + 3 distilled (matches Porto's structure)

**Runtime**: 3 × 25 epochs × ~1 hour = **75 hours (~3.1 days)**

### Option C: Just One More Vanilla (1 model, ~25 hours)
**Minimal risk mitigation**:
- vanilla_25epoch_seed43.pth

**Result**: 2 vanilla + 2 distilled (at least one validation point)

**Runtime**: 1 × 25 epochs × ~1 hour = **25 hours (~1 day)**

## Recommended Approach: Option B (Complete 3×3 Grid)

### Why This Matters

**Statistical Rigor**:
- Cross-seed analysis requires N≥3 for meaningful statistics
- Confidence intervals need multiple seeds
- Standard deviation meaningless with N=1 or N=2

**Publication Quality**:
- Reviewers expect multiple random seeds for reproducibility
- Porto has 3 seeds - Beijing should match for consistency
- Asymmetric coverage raises questions

**Risk Mitigation**:
- If vanilla seed42 has issues, we have seeds 43, 44 to validate
- 3 seeds reveal if distillation benefits are seed-dependent
- Enables robust paired statistical tests (Issue #51)

### Benefits of 3×3 Grid

**Enables existing analysis tools**:
- `scripts/analysis/cross_seed_analysis.py` designed for N=3
- Paired tests more robust with matched 3×3 coverage
- Consistent methodology across datasets

**Cross-dataset comparison**:
- Beijing 3×3 vs Porto 3×3 (apples-to-apples)
- Same seed robustness analysis for both
- Fair comparison of distillation benefits

## Implementation

### Training Configuration

Use existing Beijing config with different seeds:

```bash
# Vanilla seed43
uv run python train_with_distill.py \
  --dataset Beijing \
  --config config/Beijing.yaml \
  --seed 43 \
  --cuda 0 \
  --data_dir /home/matt/Dev/HOSER-dataset

# Vanilla seed44
uv run python train_with_distill.py \
  --dataset Beijing \
  --config config/Beijing.yaml \
  --seed 44 \
  --cuda 0 \
  --data_dir /home/matt/Dev/HOSER-dataset

# Distilled seed43 (if Option B)
uv run python train_with_distill.py \
  --dataset Beijing \
  --config config/Beijing.yaml \
  --seed 43 \
  --cuda 0 \
  --data_dir /home/matt/Dev/HOSER-dataset
  # (with distillation enabled in config)
```

**Parameters**:
- Epochs: 25 (same as existing models)
- Config: Use existing `config/Beijing.yaml`
- Seeds: 43, 44 (to complement existing seed42)

### Output Files

**New model files** (in `hoser-distill-optuna-6/models/`):
- `vanilla_25epoch_seed43.pth`
- `vanilla_25epoch_seed44.pth`
- `distilled_25epoch_seed43.pth` (if Option B)

### Evaluation After Training

After training new models, run **full evaluation** (can use tonight's A* eval scripts):

```bash
cd hoser-distill-optuna-6
uv run python ../python_pipeline.py --use-astar
```

This will:
- Generate trajectories for new models
- Evaluate all metrics (including trajectory-level)
- Run paired statistical analysis
- Update cross-seed analysis

## Timeline

### Option A (2 vanilla models)
- Training: 50 hours (~2.1 days)
- Evaluation: 8-10 hours
- Analysis: 2-3 hours
- **Total**: ~2.5 days

### Option B (3 models - 3×3 grid) ✅ **Recommended**
- Training: 75 hours (~3.1 days)
- Evaluation: 12-15 hours
- Analysis: 3-4 hours
- **Total**: ~4 days

### Option C (1 vanilla model)
- Training: 25 hours (~1 day)
- Evaluation: 4-5 hours
- Analysis: 1-2 hours
- **Total**: ~1.5 days

## Validation Steps

- [ ] Train new vanilla models (seed43, seed44)
- [ ] Train new distilled model (seed43) if Option B
- [ ] Run full evaluation pipeline with A* search
- [ ] Verify vanilla baseline consistency across seeds
- [ ] Update cross-seed analysis
- [ ] Validate paired statistical tests work with 3×3 grid
- [ ] Document any seed-dependent effects

## Expected Outcomes

### If Vanilla Baseline is Consistent
- Seeds 42, 43, 44 show similar performance (±5% variance)
- Confirms original vanilla seed42 is valid
- Strengthens publication claims

### If Vanilla Baseline Has Issues
- Discover seed-dependent behavior
- Identify which seed is anomalous
- Report mean ± std across 3 seeds (more robust)
- Avoids publishing potentially flawed baseline

### Cross-Dataset Comparison
- Beijing 3×3 vs Porto 3×3 (symmetric comparison)
- Consistent statistical analysis methodology
- Robust conclusions about distillation benefits

## Success Criteria

- [ ] Beijing has 3 vanilla models (seeds 42, 43, 44)
- [ ] Beijing has 3 distilled models (seeds 42, 43, 44) if Option B
- [ ] Cross-seed variance within acceptable range (<10%)
- [ ] Vanilla baseline validated as consistent
- [ ] Cross-seed analysis updated for Beijing
- [ ] Paired tests run successfully with 3×3 grid

## Files to Generate

**Models**:
- `models/vanilla_25epoch_seed43.pth`
- `models/vanilla_25epoch_seed44.pth`
- `models/distilled_25epoch_seed43.pth` (Option B)

**Evaluations** (auto-generated by pipeline):
- `eval/*/results.json` for new models
- `eval/*/trajectory_metrics.json` for new models
- `paired_analysis/*/` updated with new comparisons

**Documentation**:
- Update cross-seed analysis results
- Document seed consistency findings
- Note any seed-dependent effects discovered

## Priority Justification

**Why P1-Major**:
1. **Validation critical**: Only 1 vanilla seed is insufficient
2. **Pre-publication**: Must verify before submission
3. **Blocks confidence**: Cannot trust results without seed validation
4. **Relatively quick**: 3-4 days vs weeks for other tasks
5. **Prevents embarrassment**: Better to discover issues now than in reviews

## Coordination with Other Issues

**Related**:
- Issue #19: Sample size justification (benefits from 3×3 grid)
- Issue #44: Porto ablation (enables symmetric Beijing-Porto comparison)
- Issue #51: Paired statistical tests (more robust with 3×3 grid)

**Timeline**:
- Can run **immediately** (no dependencies)
- Or run **after tonight's A* evals** complete (~2 days from now)
- Should complete **before paper submission**

## Recommendation

**Pursue Option B** (Complete 3×3 Grid):
- 75 hours (~3 days) is manageable
- Matches Porto's structure exactly
- Enables all cross-seed analysis tools
- Provides maximum confidence in results
- Publication-ready model coverage

**Alternative**: If time-constrained, Option A (2 vanilla) validates baseline in ~2 days

---

**Status**: Pending decision  
**Next Steps**: Approve option and schedule training runs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[P1][Phase-3][experimental] Beijing Baseline Validation & 3×3 Model Grid Completion #55

Metadata

Problem Statement

Current Model Coverage

Evidence of Problem

Required Changes

Option A: Minimal Validation (2 models, ~50 hours)

Option B: Complete 3×3 Grid (3 models, ~75 hours) ✅ Recommended

Option C: Just One More Vanilla (1 model, ~25 hours)

Recommended Approach: Option B (Complete 3×3 Grid)

Why This Matters

Benefits of 3×3 Grid

Implementation

Training Configuration

Output Files

Evaluation After Training

Timeline

Option A (2 vanilla models)

Option B (3 models - 3×3 grid) ✅ Recommended

Option C (1 vanilla model)

Validation Steps

Expected Outcomes

If Vanilla Baseline is Consistent

If Vanilla Baseline Has Issues

Cross-Dataset Comparison

Success Criteria

Files to Generate

Priority Justification

Coordination with Other Issues

Recommendation

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[P1][Phase-3][experimental] Beijing Baseline Validation & 3×3 Model Grid Completion #55

Description

Metadata

Problem Statement

Current Model Coverage

Evidence of Problem

Required Changes

Option A: Minimal Validation (2 models, ~50 hours)

Option B: Complete 3×3 Grid (3 models, ~75 hours) ✅ Recommended

Option C: Just One More Vanilla (1 model, ~25 hours)

Recommended Approach: Option B (Complete 3×3 Grid)

Why This Matters

Benefits of 3×3 Grid

Implementation

Training Configuration

Output Files

Evaluation After Training

Timeline

Option A (2 vanilla models)

Option B (3 models - 3×3 grid) ✅ Recommended

Option C (1 vanilla model)

Validation Steps

Expected Outcomes

If Vanilla Baseline is Consistent

If Vanilla Baseline Has Issues

Cross-Dataset Comparison

Success Criteria

Files to Generate

Priority Justification

Coordination with Other Issues

Recommendation

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions