-
Notifications
You must be signed in to change notification settings - Fork 26
Open
Description
The core code in fastv_kvcache.py is as follows. To apply FastV in layer K, we need to obtain the attention matrix in layer K-1. The following code implements this by extra forward in K-1th LlamaDecoderLayer with output_attentions=True. However, the extra forward would update the KV Cache, resulting in an additional KV Cache copy in layer K-1, which I verified during single-step debugging. Could you explain this phenomenon? Is it a bug?
K = 3
ratio = 0.5
if decoder_layer.self_attn.layer_idx == K and seq_length > 1:
device = hidden_states.device
image_attention_score = self.last_attention.mean(dim=1)[0][-1][35:611]
top_attention_rank_index = image_attention_score.topk(int(576 * ratio)).indices + 35
keep_indexs = torch.cat((torch.arange(35,device=device), top_attention_rank_index, torch.arange(611,seq_length,device=device)))
keep_indexs = keep_indexs.sort().values
hidden_states = hidden_states[:,keep_indexs,:]
if attention_mask is not None:
attention_mask = attention_mask[:,:,:hidden_states.shape[1],:hidden_states.shape[1]]
position_ids = keep_indexs.unsqueeze(0)
if decoder_layer.self_attn.layer_idx == K - 1:
temp_layer_outputs = decoder_layer(
hidden_states,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_value=past_key_values,
output_attentions=True,
use_cache=use_cache,
)
self.last_attention = temp_layer_outputs[1]
layer_outputs = decoder_layer(
hidden_states,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_value=past_key_values,
output_attentions=output_attentions,
use_cache=use_cache,
)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels