-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Metadata
- Priority: P1-Major
- Phase: Phase 3 (Before Publication)
- Feasibility: experimental
- Category: validation
- Effort: 60-70 hours (3 models × 25 epochs × ~1 hour)
- Dependencies: None (can run immediately)
Problem Statement
Beijing dataset currently has incomplete model coverage compared to Porto, and the vanilla baseline needs validation with additional random seeds to ensure reproducibility.
Current Model Coverage
Beijing (6 models - incomplete):
- vanilla: 1 seed (seed42)
⚠️ Only 1 seed - needs validation - distilled: 2 seeds (seed42, seed44) - Missing seed43
Porto (9 models - complete):
- vanilla: 3 seeds (42, 43, 44) ✅
- distill_phase1: 3 seeds (42, 43, 44) ✅
- distill_phase2: 3 seeds (42, 43, 44) ✅
Evidence of Problem
Concern: Beijing vanilla baseline (seed42) may have anomalies
- Only 1 vanilla seed makes it impossible to verify consistency
- Cannot compute cross-seed statistics (mean, std, confidence intervals)
- Porto has 3 seeds for robustness - Beijing should match
Asymmetry: Beijing has 2 distilled but only 1 vanilla
- Unbalanced comparison (2-seed distilled vs 1-seed vanilla)
- Cannot apply cross-seed analysis tools (designed for N=3)
- Porto's 3×3 grid enables robust statistical analysis
Required Changes
Option A: Minimal Validation (2 models, ~50 hours)
Add 2 vanilla models to verify baseline:
- vanilla_25epoch_seed43.pth
- vanilla_25epoch_seed44.pth
Result: 3 vanilla + 2 distilled (sufficient for validation)
Runtime: 2 × 25 epochs × ~1 hour = 50 hours (~2.1 days)
Option B: Complete 3×3 Grid (3 models, ~75 hours) ✅ Recommended
Add 2 vanilla + 1 distilled to match Porto:
- vanilla_25epoch_seed43.pth
- vanilla_25epoch_seed44.pth
- distilled_25epoch_seed43.pth
Result: 3 vanilla + 3 distilled (matches Porto's structure)
Runtime: 3 × 25 epochs × ~1 hour = 75 hours (~3.1 days)
Option C: Just One More Vanilla (1 model, ~25 hours)
Minimal risk mitigation:
- vanilla_25epoch_seed43.pth
Result: 2 vanilla + 2 distilled (at least one validation point)
Runtime: 1 × 25 epochs × ~1 hour = 25 hours (~1 day)
Recommended Approach: Option B (Complete 3×3 Grid)
Why This Matters
Statistical Rigor:
- Cross-seed analysis requires N≥3 for meaningful statistics
- Confidence intervals need multiple seeds
- Standard deviation meaningless with N=1 or N=2
Publication Quality:
- Reviewers expect multiple random seeds for reproducibility
- Porto has 3 seeds - Beijing should match for consistency
- Asymmetric coverage raises questions
Risk Mitigation:
- If vanilla seed42 has issues, we have seeds 43, 44 to validate
- 3 seeds reveal if distillation benefits are seed-dependent
- Enables robust paired statistical tests (Issue Integrate paired statistical tests into evaluation pipeline #51)
Benefits of 3×3 Grid
Enables existing analysis tools:
scripts/analysis/cross_seed_analysis.pydesigned for N=3- Paired tests more robust with matched 3×3 coverage
- Consistent methodology across datasets
Cross-dataset comparison:
- Beijing 3×3 vs Porto 3×3 (apples-to-apples)
- Same seed robustness analysis for both
- Fair comparison of distillation benefits
Implementation
Training Configuration
Use existing Beijing config with different seeds:
# Vanilla seed43
uv run python train_with_distill.py \
--dataset Beijing \
--config config/Beijing.yaml \
--seed 43 \
--cuda 0 \
--data_dir /home/matt/Dev/HOSER-dataset
# Vanilla seed44
uv run python train_with_distill.py \
--dataset Beijing \
--config config/Beijing.yaml \
--seed 44 \
--cuda 0 \
--data_dir /home/matt/Dev/HOSER-dataset
# Distilled seed43 (if Option B)
uv run python train_with_distill.py \
--dataset Beijing \
--config config/Beijing.yaml \
--seed 43 \
--cuda 0 \
--data_dir /home/matt/Dev/HOSER-dataset
# (with distillation enabled in config)Parameters:
- Epochs: 25 (same as existing models)
- Config: Use existing
config/Beijing.yaml - Seeds: 43, 44 (to complement existing seed42)
Output Files
New model files (in hoser-distill-optuna-6/models/):
vanilla_25epoch_seed43.pthvanilla_25epoch_seed44.pthdistilled_25epoch_seed43.pth(if Option B)
Evaluation After Training
After training new models, run full evaluation (can use tonight's A* eval scripts):
cd hoser-distill-optuna-6
uv run python ../python_pipeline.py --use-astarThis will:
- Generate trajectories for new models
- Evaluate all metrics (including trajectory-level)
- Run paired statistical analysis
- Update cross-seed analysis
Timeline
Option A (2 vanilla models)
- Training: 50 hours (~2.1 days)
- Evaluation: 8-10 hours
- Analysis: 2-3 hours
- Total: ~2.5 days
Option B (3 models - 3×3 grid) ✅ Recommended
- Training: 75 hours (~3.1 days)
- Evaluation: 12-15 hours
- Analysis: 3-4 hours
- Total: ~4 days
Option C (1 vanilla model)
- Training: 25 hours (~1 day)
- Evaluation: 4-5 hours
- Analysis: 1-2 hours
- Total: ~1.5 days
Validation Steps
- Train new vanilla models (seed43, seed44)
- Train new distilled model (seed43) if Option B
- Run full evaluation pipeline with A* search
- Verify vanilla baseline consistency across seeds
- Update cross-seed analysis
- Validate paired statistical tests work with 3×3 grid
- Document any seed-dependent effects
Expected Outcomes
If Vanilla Baseline is Consistent
- Seeds 42, 43, 44 show similar performance (±5% variance)
- Confirms original vanilla seed42 is valid
- Strengthens publication claims
If Vanilla Baseline Has Issues
- Discover seed-dependent behavior
- Identify which seed is anomalous
- Report mean ± std across 3 seeds (more robust)
- Avoids publishing potentially flawed baseline
Cross-Dataset Comparison
- Beijing 3×3 vs Porto 3×3 (symmetric comparison)
- Consistent statistical analysis methodology
- Robust conclusions about distillation benefits
Success Criteria
- Beijing has 3 vanilla models (seeds 42, 43, 44)
- Beijing has 3 distilled models (seeds 42, 43, 44) if Option B
- Cross-seed variance within acceptable range (<10%)
- Vanilla baseline validated as consistent
- Cross-seed analysis updated for Beijing
- Paired tests run successfully with 3×3 grid
Files to Generate
Models:
models/vanilla_25epoch_seed43.pthmodels/vanilla_25epoch_seed44.pthmodels/distilled_25epoch_seed43.pth(Option B)
Evaluations (auto-generated by pipeline):
eval/*/results.jsonfor new modelseval/*/trajectory_metrics.jsonfor new modelspaired_analysis/*/updated with new comparisons
Documentation:
- Update cross-seed analysis results
- Document seed consistency findings
- Note any seed-dependent effects discovered
Priority Justification
Why P1-Major:
- Validation critical: Only 1 vanilla seed is insufficient
- Pre-publication: Must verify before submission
- Blocks confidence: Cannot trust results without seed validation
- Relatively quick: 3-4 days vs weeks for other tasks
- Prevents embarrassment: Better to discover issues now than in reviews
Coordination with Other Issues
Related:
- Issue [P2][Phase-3][document] Issue 3.1: Sample Size Not Justified #19: Sample size justification (benefits from 3×3 grid)
- Issue [P2][Phase-3][experimental] Porto Dataset Beam Search Ablation Study #44: Porto ablation (enables symmetric Beijing-Porto comparison)
- Issue Integrate paired statistical tests into evaluation pipeline #51: Paired statistical tests (more robust with 3×3 grid)
Timeline:
- Can run immediately (no dependencies)
- Or run after tonight's A evals* complete (~2 days from now)
- Should complete before paper submission
Recommendation
Pursue Option B (Complete 3×3 Grid):
- 75 hours (~3 days) is manageable
- Matches Porto's structure exactly
- Enables all cross-seed analysis tools
- Provides maximum confidence in results
- Publication-ready model coverage
Alternative: If time-constrained, Option A (2 vanilla) validates baseline in ~2 days
Status: Pending decision
Next Steps: Approve option and schedule training runs