Wonderful work! Following Q and looking forward ur reply.
- I am curious about the method in your paper that copy the KV cache from cpu memory to gpu memory.
Since I have test the following code
model = nn.Linear()
model.to('cuda')
sometimes the model is large, then the time is costly.
- Otherwise, does the method in ur paper is time efficient?