Technical Questions on Comparison Between This Project and vLLM

# Technical Questions on Comparison Between This Project and vLLM

Hi,

After carefully studying your code and the related paper (https://arxiv.org/html/2510.22876v1), I have two key technical questions regarding the comparison between this project and vLLM:

## 1. About Padding Requirements and Variable-Length Sequence Processing

In `src/baselines/inference.py`, the code requires all input sequences to be padded to the same length:

```python
batch_inputs = tokenizer.pad(batch_inputs, padding_side='left', padding=True, return_tensors="pt").to(device)
batch_inputs.attention_mask = (batch_inputs.input_ids != tokenizer.pad_token_id).long().to(device)
input_length = batch_inputs.input_ids.shape[1]
```

However, in practical systems like vLLM, techniques such as **Paged Attention** and **Flash Attention** enable batched forward propagation with variable-length sequences (without requiring uniform padding).

**My Question**: Would vLLM or similar systems with variable-length input support have the same accuracy issues you mentioned in your paper? Since vLLM doesn't need to pad sequences to the same length, how does this affect the accuracy comparison presented in your evaluation?

## 2. About Realignment Overhead in Cross-Batch Speculative Decoding

From the code, I noticed that in cross-batch speculative decoding, when sequences are not aligned, a **realignment process** is performed.

**My Questions**:
1. How significant is the time overhead of this realignment process? Does it potentially negate the performance gains from speculative decoding?
2. Your paper primarily focuses on **correctness comparisons**, but I couldn't find detailed **performance/throughput comparisons** between the standard approach and your method with realignment overhead included.

**Specifically**:
- Could you provide latency/throughput measurements that include the realignment overhead?
- How does the performance scale with different batch sizes and sequence length variations?
- Are there any optimizations in place to minimize realignment costs?

## Additional Context

For fair comparison with systems like vLLM that natively support variable-length sequences, it would be helpful to understand:

1. Whether the padding requirement is a fundamental limitation of your approach or an implementation choice
2. How the realignment overhead compares to the gains from speculative decoding in realistic workloads
3. Whether there are plans to integrate with variable-length attention mechanisms (PagedAttention, FlashAttention) in future versions

Thank you for your excellent work and for considering these questions. I'm very interested in understanding the practical performance implications of your method compared to state-of-the-art systems like vLLM.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Technical Questions on Comparison Between This Project and vLLM #4

Technical Questions on Comparison Between This Project and vLLM

1. About Padding Requirements and Variable-Length Sequence Processing

2. About Realignment Overhead in Cross-Batch Speculative Decoding

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Technical Questions on Comparison Between This Project and vLLM #4

Description

Technical Questions on Comparison Between This Project and vLLM

1. About Padding Requirements and Variable-Length Sequence Processing

2. About Realignment Overhead in Cross-Batch Speculative Decoding

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions