AI 计算集群概述中code1中,Attention 层参数量计算时,公式是否有问题?标准多头注意力(不考虑GQA等技术),参数量是否应该是P_{attn_per_layer} = (d_{model} \times d_{model})Q8 + (d*{model} \times d_{model})K8 + (d*{model} \times d_{model})V8 + (d*{model} \times d_{model})O,也就是需要QKV的参数量应该是d_modeld_model*n_heads