thanks for your paper. but , Is it wrong on figure? Why use s_k^i -1 instead of s_k^i for k scaling in the attention layer?