Skip to content

Eval: loss-only path + compute receipt#27

Open
StanByriukov02 wants to merge 1 commit into1x-technologies:mainfrom
StanByriukov02:perf/loss-only-receipt
Open

Eval: loss-only path + compute receipt#27
StanByriukov02 wants to merge 1 commit into1x-technologies:mainfrom
StanByriukov02:perf/loss-only-receipt

Conversation

@StanByriukov02
Copy link
Copy Markdown

I was evaluating GENIE-style checkpoints and kept running into the same thing: sometimes I only need CE loss (teacher-forced logits), but the current eval loop still does MaskGIT refinement + decode/LPIPS/acc work.

This PR adds a clean fast path for that case.

What changed

  • --loss_only: compute CE loss from temporally teacher-forced logits and skip MaskGIT sampling (no refinement loop), skip decoding/LPIPS, and skip accuracy.
  • --skip_decode: keep sampling/logits, but skip decoding + LPIPS (useful if you still want generation metrics without the heavy decode pipeline).
  • --device {cpu|cuda}: makes it easier to reproduce runs on CPU-only machines too.
  • Receipt: prints compute_logits_calls and compute_logits_calls_per_frame, so you can see how many full forward-logits passes actually happened.
  • xformers becomes optional: if it’s not installed (or XFORMERS_DISABLED=true), the code falls back to the basic attention path.

Guarantees / scaling / what I’m not claiming

  • Guaranteed GPU-compute reduction for --loss_only: per generated frame, full forward-logits passes drop from ~maskgit_steps to 1. For maskgit_steps=2 that’s ~2× on the forward-pass-heavy part.
  • Scales with quality mode: if someone runs maskgit_steps=K, the compute ceiling is ~K× (so 5–10× is possible at K=5..10).
  • CPU side matters too: skipping decode/LPIPS/acc avoids a big chunk of CPU + pipeline overhead when you only care about loss, and the new receipt makes it obvious you’re not doing extra work.
  • Not promised: end-to-end Joules/frame for the full eval script (depends on dataset IO, decode, etc). I did run a small forward-pass microbench on an H100 to sanity check direction.

H100 sanity check (forward-pass microbench)

  • maskgit_steps=2, baseline does 2 compute_logits calls per timestep, loss_only does 1
  • measured ratio_J ≈ 1.83× lower Joules for the forward-pass segment (same model config, repeated loop to smooth power sampling)

Repro (receipt)

  • baseline:
    python genie/evaluate.py --checkpoint_dir --maskgit_steps 2
  • loss-only:
    python genie/evaluate.py --checkpoint_dir --maskgit_steps 2 --loss_only

You should see compute_logits_calls_per_frame drop ~2.00 -> ~1.00 for maskgit_steps=2.

This adds a fast path for CE loss evaluation that skips MaskGIT refinement sampling and the decode/LPIPS pipeline when you only care about loss.

Also prints compute_logits_calls (and per-frame) so you can see exactly how many full forward logits passes happened.

Notes:

- Guaranteed: with --loss_only, compute_logits calls per generated frame drop from ~maskgit_steps to 1 (so maskgit_steps=2 is ~2x on the GPU-compute-heavy part).

- Scales: if someone runs maskgit_steps=K for quality/sampling, the compute ceiling is ~Kx.

- Not promised: end-to-end Joules/frame depends on your full eval setup; I only measured a small forward-pass microbench on H100 as a sanity check.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants