Skip to content

Technical Questions on Comparison Between This Project and vLLM #4

@EanWang211123

Description

@EanWang211123

Technical Questions on Comparison Between This Project and vLLM

Hi,

After carefully studying your code and the related paper (https://arxiv.org/html/2510.22876v1), I have two key technical questions regarding the comparison between this project and vLLM:

1. About Padding Requirements and Variable-Length Sequence Processing

In src/baselines/inference.py, the code requires all input sequences to be padded to the same length:

batch_inputs = tokenizer.pad(batch_inputs, padding_side='left', padding=True, return_tensors="pt").to(device)
batch_inputs.attention_mask = (batch_inputs.input_ids != tokenizer.pad_token_id).long().to(device)
input_length = batch_inputs.input_ids.shape[1]

However, in practical systems like vLLM, techniques such as Paged Attention and Flash Attention enable batched forward propagation with variable-length sequences (without requiring uniform padding).

My Question: Would vLLM or similar systems with variable-length input support have the same accuracy issues you mentioned in your paper? Since vLLM doesn't need to pad sequences to the same length, how does this affect the accuracy comparison presented in your evaluation?

2. About Realignment Overhead in Cross-Batch Speculative Decoding

From the code, I noticed that in cross-batch speculative decoding, when sequences are not aligned, a realignment process is performed.

My Questions:

  1. How significant is the time overhead of this realignment process? Does it potentially negate the performance gains from speculative decoding?
  2. Your paper primarily focuses on correctness comparisons, but I couldn't find detailed performance/throughput comparisons between the standard approach and your method with realignment overhead included.

Specifically:

  • Could you provide latency/throughput measurements that include the realignment overhead?
  • How does the performance scale with different batch sizes and sequence length variations?
  • Are there any optimizations in place to minimize realignment costs?

Additional Context

For fair comparison with systems like vLLM that natively support variable-length sequences, it would be helpful to understand:

  1. Whether the padding requirement is a fundamental limitation of your approach or an implementation choice
  2. How the realignment overhead compares to the gains from speculative decoding in realistic workloads
  3. Whether there are plans to integrate with variable-length attention mechanisms (PagedAttention, FlashAttention) in future versions

Thank you for your excellent work and for considering these questions. I'm very interested in understanding the practical performance implications of your method compared to state-of-the-art systems like vLLM.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions