Skip to content

[Optimize] PageAttention Decode Kernel in BLOCKM Processing with TokenBlock Parallelization #64

@zip95297

Description

@zip95297

Description

I have optimized the PageAttention decode kernel (located in src/layers/attention/paged_attention_decode_kernel), specifically targeting the BLOCKM processing logic.

Background

The original implementation of BLOCKM processing in the PageAttention decode kernel used a linear, token-by-token processing approach, which introduced performance bottlenecks for decode-stage attention computations. Additionally, the upstream MinivLLM repository has a known bug when force_eager is disabled, making direct testing in the original codebase unreliable.

Optimization Details

To address the performance issue:

  1. Replaced the linear token-by-token processing in BLOCKM with tokenblock parallel processing to leverage parallel computation capabilities.
  2. Integrated the modified PageAttention code into NanoVLLM for validation (adjusted the attention logic to remove an extra scaling factor to adapt to NanoVLLM's architecture).

Performance Results

After testing, the optimization achieved a ~2.5x reduction in average latency for decode-stage attention operations, with no observed correctness regressions in output results.

Request

  1. Review the tokenblock parallelization approach for BLOCKM processing in the PageAttention decode kernel to ensure alignment with MinivLLM's design principles.
  2. Consider merging the parallelized BLOCKM processing logic into the main branch to improve decode performance for MinivLLM.

Additional Context

  • The only code change to adapt to NanoVLLM was removing an extra scaling factor in the attention computation (no core logic modifications to the PageAttention kernel itself).
  • All performance tests were conducted on identical hardware/input configurations (batch size: 1, sequence length: 1024, model: LLaMA-7B) to isolate the impact of the BLOCKM processing optimization.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions