Hello:)
Thank you so much for sharing your code. It has been very useful in understanding the paper.
There is still something I don't quite get from the paper and the code. It From my understanding, the paper mentioned that the attention is calculated separately on multiple heads before being combined. However, I could not find how many heads are used in the code (not sure if I am confused about multi-head part). Could you point to where you defined the number of heads? I understand that in the paper, they did an experiment to analyse the number of attention heads for MHSA from 1 to 8.
Thank you.
Charlene
Hello:)
Thank you so much for sharing your code. It has been very useful in understanding the paper.
There is still something I don't quite get from the paper and the code. It From my understanding, the paper mentioned that the attention is calculated separately on multiple heads before being combined. However, I could not find how many heads are used in the code (not sure if I am confused about multi-head part). Could you point to where you defined the number of heads? I understand that in the paper, they did an experiment to analyse the number of attention heads for MHSA from 1 to 8.
Thank you.
Charlene