Skip to content

Conversation

@guoqingbao
Copy link
Contributor

No description provided.

@DrJesseGlass
Copy link
Contributor

I'll review this PR to make sure that I built along the same lines however I had integrated GPU flash attention and CPU flash attention into Qwen3 (but not MoE) but as I was doing it found substantial reason to upgrade the CPU flash attention and made an early simple submission #3254 to get an improved CPU flash structure before I started serious optimizations.

However, someone else was simultaneously working on #3250 for integrating varlen and the agreement at the time was once varlen was integrated that I would tweak mine to incorporate but it's been sitting unfinished.

@guoqingbao
Copy link
Contributor Author

I'll review this PR to make sure that I built along the same lines however I had integrated GPU flash attention and CPU flash attention into Qwen3 (but not MoE) but as I was doing it found substantial reason to upgrade the CPU flash attention and made an early simple submission #3254 to get an improved CPU flash structure before I started serious optimizations.

This uses the previously defined flash attention interface and remains compatible with existing implementations.

@DrJesseGlass
Copy link
Contributor

Yes. It makes sense to get this integrated sooner if possible. I just meant to call out the redundancy.

@guoqingbao
Copy link
Contributor Author

Yes. It makes sense to get this integrated sooner if possible. I just meant to call out the redundancy.

It would be better to have a unified entry point for varlen attention (CPU and GPU). I think I can help with GPU varlen attention, but it seems we are missing corresponding use cases in candle. Varlen attention requires parallel requests (though it can work with a single request) to meaningfully demonstrate performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants