-
Notifications
You must be signed in to change notification settings - Fork 260
Description
Summary
The non-greedy rejection sampling path does not compute the draft distribution
lossless acceptance/replacement rule from speculative decoding. It currently uses the target distribution
only (with some candidate masking), which deviates from the referenced algorithms.
Expected (lossless) behavior
From EAGLE-3 paper (end of Sec. 2.1):
Additionally, from Fast Inference from Transformers via Speculative Decoding (2022) (Sec. 2.3, Algorithm 1):
- Accept with probability:
$\min(1, p/q)$ - If rejected, sample from:
Current behavior in this repo
In eagle/model/utils.py, the non-greedy branch uses target logits only:
-
evaluate_posteriorcomputes:gt_logits = logits[fi, i - 1][None] gt_logits = logits_processor(None, gt_logits)[0] gtp = torch.softmax(gt_logits, dim=0) ... qx = 1.0 acp = px / qx
This is not min(1, p/q).
-
Replacement sampling uses
sample_p = gtp(target-only), and
update_inference_inputssamples fromsample_p, not
$\mathrm{norm}(\max(0, p - q))$ .
Code references
eagle/model/utils.py(non-greedy branch):evaluate_posterior(...)around lines 360–416- Replacement sampling:
update_inference_inputs(...)around lines 460–466
This makes the non-greedy path lossy, deviating from the algorithmic guarantees described in both papers.
Suggested fix
At rejection time:
- Compute draft logits
$q$ for the same position (run draft model on accepted prefix or use cached KV if available), - Apply
$\mathrm{norm}(\max(0, p - q))$ for replacement, - Use acceptance probability
$\min(1, p/q)$ for the proposed token.