Skip to content

Understand the attention design #124

@HelloWorldLTY

Description

@HelloWorldLTY

Hi, thanks for your great work. I intend to compute the attention scores between tokens and here is my code:

import torch
from transformers import BertModel, BertConfig, DNATokenizer

dir_to_pretrained_model = "./6-new-12w-0/"

config = BertConfig.from_pretrained('../src/transformers/dnabert-config/bert-config-6/config.json')
tokenizer = DNATokenizer.from_pretrained('dna6')
print(config)

model = BertModel.from_pretrained(dir_to_pretrained_model, config=config).cuda()

sequence = "AATCTAATCTAGTCTAGCCTAGCA"
model_input = tokenizer.encode_plus(sequence, add_special_tokens=True, max_length=512)["input_ids"]
inputs = tokenizer.encode_plus(sequence, add_special_tokens=True, max_length=512)

model_input = torch.tensor(model_input, dtype=torch.long).cuda()
model_input = model_input.unsqueeze(0)   # to generate a fake batch with batch size one

output = model(model_input)

print(output[-1][-1])
print(output[-1][-1].shape)

I think the output[-1] will contain attention matrices and I took out the last item, whose shape is [1,12,3,3]. Does this 12 means the 12 heads? And 3,3 represents two sets of tokens? May I know how to comptue the correct attention for these tokens? Just averaging all the attention in each layer? Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions