Skip to content

Feature: Add Fused Attention Kernel #1

@ShlokVFX

Description

@ShlokVFX

🚀 Feature: Fused Attention Kernel

Overview

Implement a fused attention kernel to improve performance by combining
QKV projection, attention score computation, softmax, and output projection
into a single optimized kernel.

Motivation

  • Reduce memory bandwidth overhead
  • Minimize kernel launch latency
  • Improve end-to-end attention throughput
  • Align with FlashAttention-style optimizations

Scope

  • Single-GPU support initially
  • FP16 / BF16 (FP8 optional later)
  • CUDA kernel (Triton or CUDA C++)
  • Drop-in replacement for existing attention path

Planned Approach

  • Implement fused QKV + attention + output
  • Use shared memory / SRAM tiling
  • Optimize for RTX 3060 (SM86)
  • Benchmark against baseline attention

Acceptance Criteria

  • Correctness matches reference attention
  • Lower latency than unfused implementation
  • Reduced memory reads/writes
  • Benchmark CSV generated

Out of Scope (for now)

  • Multi-GPU
  • Backward pass
  • Training support

Notes

This is a planned feature and will be implemented in a future PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    featurePlanned feature request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions