Hello,
Thanks for the excellent paper and code.
I noticed the current repository covers the kernel optimizations (Sec 3–5) but seems to lack the Overlapped Streaming Inference implementation described in Section 6.1. Specifically, I am looking for the logic that manages concurrent VLM and Action Expert execution via separate CUDA streams.
Do you plan to release the code for this concurrent execution mode? It would be very helpful to see how the multi-stream synchronization is handled to reproduce the real-time throughput results.
Thank you for your work.
Hello,
Thanks for the excellent paper and code.
I noticed the current repository covers the kernel optimizations (Sec 3–5) but seems to lack the Overlapped Streaming Inference implementation described in Section 6.1. Specifically, I am looking for the logic that manages concurrent VLM and Action Expert execution via separate CUDA streams.
Do you plan to release the code for this concurrent execution mode? It would be very helpful to see how the multi-stream synchronization is handled to reproduce the real-time throughput results.
Thank you for your work.