[Optimize] PageAttention Decode Kernel in BLOCKM Processing with TokenBlock Parallelization

### Description
I have optimized the PageAttention decode kernel (located in `src/layers/attention/paged_attention_decode_kernel`), specifically targeting the BLOCKM processing logic.

#### Background
The original implementation of BLOCKM processing in the PageAttention decode kernel used a linear, token-by-token processing approach, which introduced performance bottlenecks for decode-stage attention computations. Additionally, the upstream MinivLLM repository has a known bug when `force_eager` is disabled, making direct testing in the original codebase unreliable.

#### Optimization Details
To address the performance issue:
1. Replaced the linear token-by-token processing in BLOCKM with **tokenblock parallel processing** to leverage parallel computation capabilities.
2. Integrated the modified PageAttention code into NanoVLLM for validation (adjusted the attention logic to remove an extra scaling factor to adapt to NanoVLLM's architecture).

#### Performance Results
After testing, the optimization achieved a **~2.5x reduction in average latency** for decode-stage attention operations, with no observed correctness regressions in output results.

### Request
1. Review the tokenblock parallelization approach for BLOCKM processing in the PageAttention decode kernel to ensure alignment with MinivLLM's design principles.
2. Consider merging the parallelized BLOCKM processing logic into the main branch to improve decode performance for MinivLLM.

### Additional Context
- The only code change to adapt to NanoVLLM was removing an extra scaling factor in the attention computation (no core logic modifications to the PageAttention kernel itself).
- All performance tests were conducted on identical hardware/input configurations (batch size: 1, sequence length: 1024, model: LLaMA-7B) to isolate the impact of the BLOCKM processing optimization.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Optimize] PageAttention Decode Kernel in BLOCKM Processing with TokenBlock Parallelization #64

Description

Background

Optimization Details

Performance Results

Request

Additional Context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Optimize] PageAttention Decode Kernel in BLOCKM Processing with TokenBlock Parallelization #64

Description

Description

Background

Optimization Details

Performance Results

Request

Additional Context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions