Great job!
However, the biggest problem with this type of work lies in how to integrate it efficiently into existing inference engines, such as sglang or vllm, because the additional overhead they bring and the damage to existing features (paged attention, prefix cache) are hard to avoid, making them very useless. To address this issue, How did you solve it