-
Notifications
You must be signed in to change notification settings - Fork 9
Description
@andy-yang-1 My team really appreciate your work on Double Sparsity! Thank you for everything you do!
Question - In group_channel_config.py, the scores for identifying outlier channels are decided via q * k, which is the dot product of an individual token's query and key vectors.
However, it is possible that an outlier (or feature detection) occurs via channel C' during the interaction of query vector of token T and the key vector of let's say token T-3 (and this channel C' does not show strong outlier behavior when multiplied within key-value vectors of either token T or T-3)
I am curious about how the offline calibration approach works, if only interactions between query and key vectors of the same token is considered to measure strength of outlier channels. Thanks!