Skip to content

[P1][Phase-3][experimental] Beijing Baseline Validation & 3×3 Model Grid Completion #55

@matercomus

Description

@matercomus

Metadata

  • Priority: P1-Major
  • Phase: Phase 3 (Before Publication)
  • Feasibility: experimental
  • Category: validation
  • Effort: 60-70 hours (3 models × 25 epochs × ~1 hour)
  • Dependencies: None (can run immediately)

Problem Statement

Beijing dataset currently has incomplete model coverage compared to Porto, and the vanilla baseline needs validation with additional random seeds to ensure reproducibility.

Current Model Coverage

Beijing (6 models - incomplete):

  • vanilla: 1 seed (seed42) ⚠️ Only 1 seed - needs validation
  • distilled: 2 seeds (seed42, seed44) - Missing seed43

Porto (9 models - complete):

  • vanilla: 3 seeds (42, 43, 44) ✅
  • distill_phase1: 3 seeds (42, 43, 44) ✅
  • distill_phase2: 3 seeds (42, 43, 44) ✅

Evidence of Problem

Concern: Beijing vanilla baseline (seed42) may have anomalies

  • Only 1 vanilla seed makes it impossible to verify consistency
  • Cannot compute cross-seed statistics (mean, std, confidence intervals)
  • Porto has 3 seeds for robustness - Beijing should match

Asymmetry: Beijing has 2 distilled but only 1 vanilla

  • Unbalanced comparison (2-seed distilled vs 1-seed vanilla)
  • Cannot apply cross-seed analysis tools (designed for N=3)
  • Porto's 3×3 grid enables robust statistical analysis

Required Changes

Option A: Minimal Validation (2 models, ~50 hours)

Add 2 vanilla models to verify baseline:

  • vanilla_25epoch_seed43.pth
  • vanilla_25epoch_seed44.pth

Result: 3 vanilla + 2 distilled (sufficient for validation)

Runtime: 2 × 25 epochs × ~1 hour = 50 hours (~2.1 days)

Option B: Complete 3×3 Grid (3 models, ~75 hours) ✅ Recommended

Add 2 vanilla + 1 distilled to match Porto:

  • vanilla_25epoch_seed43.pth
  • vanilla_25epoch_seed44.pth
  • distilled_25epoch_seed43.pth

Result: 3 vanilla + 3 distilled (matches Porto's structure)

Runtime: 3 × 25 epochs × ~1 hour = 75 hours (~3.1 days)

Option C: Just One More Vanilla (1 model, ~25 hours)

Minimal risk mitigation:

  • vanilla_25epoch_seed43.pth

Result: 2 vanilla + 2 distilled (at least one validation point)

Runtime: 1 × 25 epochs × ~1 hour = 25 hours (~1 day)

Recommended Approach: Option B (Complete 3×3 Grid)

Why This Matters

Statistical Rigor:

  • Cross-seed analysis requires N≥3 for meaningful statistics
  • Confidence intervals need multiple seeds
  • Standard deviation meaningless with N=1 or N=2

Publication Quality:

  • Reviewers expect multiple random seeds for reproducibility
  • Porto has 3 seeds - Beijing should match for consistency
  • Asymmetric coverage raises questions

Risk Mitigation:

Benefits of 3×3 Grid

Enables existing analysis tools:

  • scripts/analysis/cross_seed_analysis.py designed for N=3
  • Paired tests more robust with matched 3×3 coverage
  • Consistent methodology across datasets

Cross-dataset comparison:

  • Beijing 3×3 vs Porto 3×3 (apples-to-apples)
  • Same seed robustness analysis for both
  • Fair comparison of distillation benefits

Implementation

Training Configuration

Use existing Beijing config with different seeds:

# Vanilla seed43
uv run python train_with_distill.py \
  --dataset Beijing \
  --config config/Beijing.yaml \
  --seed 43 \
  --cuda 0 \
  --data_dir /home/matt/Dev/HOSER-dataset

# Vanilla seed44
uv run python train_with_distill.py \
  --dataset Beijing \
  --config config/Beijing.yaml \
  --seed 44 \
  --cuda 0 \
  --data_dir /home/matt/Dev/HOSER-dataset

# Distilled seed43 (if Option B)
uv run python train_with_distill.py \
  --dataset Beijing \
  --config config/Beijing.yaml \
  --seed 43 \
  --cuda 0 \
  --data_dir /home/matt/Dev/HOSER-dataset
  # (with distillation enabled in config)

Parameters:

  • Epochs: 25 (same as existing models)
  • Config: Use existing config/Beijing.yaml
  • Seeds: 43, 44 (to complement existing seed42)

Output Files

New model files (in hoser-distill-optuna-6/models/):

  • vanilla_25epoch_seed43.pth
  • vanilla_25epoch_seed44.pth
  • distilled_25epoch_seed43.pth (if Option B)

Evaluation After Training

After training new models, run full evaluation (can use tonight's A* eval scripts):

cd hoser-distill-optuna-6
uv run python ../python_pipeline.py --use-astar

This will:

  • Generate trajectories for new models
  • Evaluate all metrics (including trajectory-level)
  • Run paired statistical analysis
  • Update cross-seed analysis

Timeline

Option A (2 vanilla models)

  • Training: 50 hours (~2.1 days)
  • Evaluation: 8-10 hours
  • Analysis: 2-3 hours
  • Total: ~2.5 days

Option B (3 models - 3×3 grid) ✅ Recommended

  • Training: 75 hours (~3.1 days)
  • Evaluation: 12-15 hours
  • Analysis: 3-4 hours
  • Total: ~4 days

Option C (1 vanilla model)

  • Training: 25 hours (~1 day)
  • Evaluation: 4-5 hours
  • Analysis: 1-2 hours
  • Total: ~1.5 days

Validation Steps

  • Train new vanilla models (seed43, seed44)
  • Train new distilled model (seed43) if Option B
  • Run full evaluation pipeline with A* search
  • Verify vanilla baseline consistency across seeds
  • Update cross-seed analysis
  • Validate paired statistical tests work with 3×3 grid
  • Document any seed-dependent effects

Expected Outcomes

If Vanilla Baseline is Consistent

  • Seeds 42, 43, 44 show similar performance (±5% variance)
  • Confirms original vanilla seed42 is valid
  • Strengthens publication claims

If Vanilla Baseline Has Issues

  • Discover seed-dependent behavior
  • Identify which seed is anomalous
  • Report mean ± std across 3 seeds (more robust)
  • Avoids publishing potentially flawed baseline

Cross-Dataset Comparison

  • Beijing 3×3 vs Porto 3×3 (symmetric comparison)
  • Consistent statistical analysis methodology
  • Robust conclusions about distillation benefits

Success Criteria

  • Beijing has 3 vanilla models (seeds 42, 43, 44)
  • Beijing has 3 distilled models (seeds 42, 43, 44) if Option B
  • Cross-seed variance within acceptable range (<10%)
  • Vanilla baseline validated as consistent
  • Cross-seed analysis updated for Beijing
  • Paired tests run successfully with 3×3 grid

Files to Generate

Models:

  • models/vanilla_25epoch_seed43.pth
  • models/vanilla_25epoch_seed44.pth
  • models/distilled_25epoch_seed43.pth (Option B)

Evaluations (auto-generated by pipeline):

  • eval/*/results.json for new models
  • eval/*/trajectory_metrics.json for new models
  • paired_analysis/*/ updated with new comparisons

Documentation:

  • Update cross-seed analysis results
  • Document seed consistency findings
  • Note any seed-dependent effects discovered

Priority Justification

Why P1-Major:

  1. Validation critical: Only 1 vanilla seed is insufficient
  2. Pre-publication: Must verify before submission
  3. Blocks confidence: Cannot trust results without seed validation
  4. Relatively quick: 3-4 days vs weeks for other tasks
  5. Prevents embarrassment: Better to discover issues now than in reviews

Coordination with Other Issues

Related:

Timeline:

  • Can run immediately (no dependencies)
  • Or run after tonight's A evals* complete (~2 days from now)
  • Should complete before paper submission

Recommendation

Pursue Option B (Complete 3×3 Grid):

  • 75 hours (~3 days) is manageable
  • Matches Porto's structure exactly
  • Enables all cross-seed analysis tools
  • Provides maximum confidence in results
  • Publication-ready model coverage

Alternative: If time-constrained, Option A (2 vanilla) validates baseline in ~2 days


Status: Pending decision
Next Steps: Approve option and schedule training runs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions