Conversation
|
@ztang2370 Thanks for the great work! The direction this PR heading to looks goo to me. To show the benefits of kvcached, I think in the test, we need to run at least two models using ollama concurrently. The README has some repeating words generated by AI. Also the setup script, please clean the symbols added by AI. We also need a cool example to show off this. For example, in webui https://github.com/open-webui/open-webui, we can have two models running together in the model list. Just some quick thoughts---you could think about the most reasonable and easist way to show this. |
There was a problem hiding this comment.
The setup script has changed a lot for vllm and sglang. Maybe we can have separate script just for ollama.
ivanium
left a comment
There was a problem hiding this comment.
Good job! In general I also love the direction of this PR. I think a key thing to add is a running example of co-running two models on the same GPU, and some performance numbers of their throughput, P99 TTFT, and P99 ITL.
I left some comments, but they are minor.
7d692f9 to
f4031de
Compare
Signed-off-by: zt2370 <ztang2370@gmail.com>
8a32284 to
90db4fe
Compare
Issue #81
Marked as WIP. Feedback welcome on design and direction.
9.16 update:
https://docs.google.com/document/d/1mDTKBoCZslLcSu2OsgCNVzl-J6HeY-Vl7s19V938PHY/edit?tab=t.0
9.17 update:
Test branch:
https://github.com/ztang2370/kvcached/tree/ztang/test-ollama-integration
https://github.com/ztang2370/ollama/tree/my-v0.11.8
9.21 update:
webui: https://drive.google.com/file/d/1ZUGWDK3JleCciizZyTybe33inmGvAmVS/view?usp=sharing
TODO: