Skip to content

Add early structural triage to kill degenerate experiments at 60s#204

Open
a-nom-ali wants to merge 1 commit intokarpathy:masterfrom
a-nom-ali:structural-triage
Open

Add early structural triage to kill degenerate experiments at 60s#204
a-nom-ali wants to merge 1 commit intokarpathy:masterfrom
a-nom-ali:structural-triage

Conversation

@a-nom-ali
Copy link

Summary

  • Adds a one-shot structural health check at the 1-minute mark (configurable via TRIAGE_TIME)
  • Computes effective rank (spectral entropy of weight matrix SVDs) and gradient coherence (cosine similarity of consecutive layer gradients)
  • Kills the experiment early (exit(1)) if effective rank has collapsed below 50% of its initial value (TRIAGE_KILL threshold)
  • Reports eff_rank_init, eff_rank_final, and rank_retention in the final summary alongside val_bpb

Motivation

The existing fast-fail check (loss > 100) only catches catastrophic divergence. Effective rank collapse — where the model's weight matrices lose expressivity — is a subtler failure mode that predicts bad final val_bpb but doesn't necessarily spike the loss. Catching it at 60s saves 4 minutes per degenerate hyperparameter configuration.

Implementation

  • structural_triage(model) — iterates all 2D parameters ≥64 in min dimension, computes SVD, returns mean effective rank and gradient coherence
  • ~50ms one-shot cost at the checkpoint (SVD on ~48 matrices at 512×512)
  • Zero new dependencies — uses torch.linalg.svdvals and F.cosine_similarity
  • 44 lines added, 0 deleted
  • Set TRIAGE_TIME = 0 to disable entirely

Test plan

  • Verify Initial effective rank: X.X prints at startup
  • Verify [triage@60s] checkpoint fires at ~60s with rank ratio and coherence
  • Verify eff_rank_init, eff_rank_final, rank_retention appear in final summary
  • Confirm no performance regression (triage runs once, not per-step)
  • Test early kill by setting TRIAGE_KILL = 0.99 (should kill immediately)
  • Test disable by setting TRIAGE_TIME = 0 (no triage output)

🤖 Generated with Claude Code

Computes effective rank (spectral entropy of weight matrix SVDs) and
gradient coherence (cosine similarity of consecutive layer gradients)
at the 1-minute mark. If effective rank has collapsed below 50% of its
initial value, the experiment is killed early instead of running the
full 5-minute budget.

Two configurable hyperparameters: TRIAGE_TIME (seconds, 0 to disable)
and TRIAGE_KILL (fraction threshold). Rank retention is reported in
the final summary alongside val_bpb.

Zero new dependencies — pure PyTorch (torch.linalg.svdvals,
F.cosine_similarity). ~50ms one-shot cost at the checkpoint.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant