Skip to content

ChemGPT padding #2

@davidegraff

Description

@davidegraff

Hey @ncfrey!

Thanks for the great repo! I'm currently trying to use the ChemGPT model to generate some molecular embeddings but I'm having some issues with batched input. Because it's a GPT-style model, ChemGPT should be padding on the left but the model pulled from the transformers hub seems to load with right-side padding

from transformers import pipeline
featurizer = pipeline(
    "feature-extraction",
    model="ncfrey/ChemGPT-1.2B",
    framework="pt",
    return_tensors=True
)
print(featurizer.tokenizer.padding_side)
# "right"

When not batching inputs, the padding side (naturally) has no impact:

featurizer = pipeline(
    "feature-extraction", model="ncfrey/ChemGPT-1.2B", framework="pt", return_tensors=True, 
)
featurizer.tokenizer.add_special_tokens({'pad_token': '[PAD]'})
sfs = [
    '[C][C][N][=C][Branch1_1][O][N][C][C][C][C][C][C][C][Ring1][Branch1_3][S][C][Expl=Ring1][=N][C][Branch1_2][C][=O][O-expl]',
    '[C][N][Branch1_1][Branch2_2][C][C][C][C][C][C][Ring1][Branch1_2][S][Branch1_2][C][=O][Branch1_2][C][=O][C][=C][C][=C][Branch2_1][Ring1][Branch2_2][N][C][Branch1_2][C][=O][C][C][N][C][Branch1_2][C][=O][C@Hexpl][C][C][=C][C][C@@Hexpl][Ring1][Branch1_2][C][Ring1][Branch2_3][=O][C][=C][Ring2][Ring1][Branch1_2]',
    '[C][C@Hexpl][C][C][C@Hexpl][Branch2_1][Ring1][Branch1_3][NH+expl][C][C][C][C@Hexpl][Branch1_1][=N][C@Hexpl][Branch1_1][C][O][C][=N][C][=C][N][Ring1][Branch1_1][C][C][Ring1][=C][C][Ring2][Ring1][Ring1]',
    '[N][/C][Branch2_1][Ring1][Ring2][C][N][C][Branch1_2][C][=O][C@@Hexpl][C][C][=C][C][=C][C][=C][Ring1][Branch1_2][S][Ring1][Branch2_2][=N][\\O]'
]
featurizer.tokenizer.padding_side = 'right'
X_unpadded_r = torch.stack([H[0, -1, :] for H in featurizer(sfs)])
featurizer.tokenizer.padding_side = 'left'
X_unpadded_l = torch.stack([H[0, -1, :] for H in featurizer(sfs)])
print(torch.allclose(X_unpadded_r, X_unpadded_l))
# True

However, upon batching, I start to run into some trouble. As expected, left vs. right padding results in different outputs:

featurizer.tokenizer.padding_side = 'right'
X_padded_r = torch.stack([H[0, -1, :] for H in featurizer(sfs, batch_size=4)])
featurizer.tokenizer.padding_side = 'left'
X_padded_l = torch.stack([H[0, -1, :] for H in featurizer(sfs, batch_size=4)])
print(torch.allclose(X_padded_r, X_padded_l))
# False

But the confusing thing is that regardless of the padding side, the batched output is different than the unpadded output:

print(torch.allclose(X_unpadded_r, X_padded_l), torch.allclose(X_unpadded_r, X_padded_r))
# (False, False)

I've looked through this repo and can't find any mention of left-side padding for ChemGPT training. Additionally, when uploading a model/tokenizer to the hub, it preserves the padding side information. Given that (1) the default in HuggingFace is right-side padding and (2) that the model loads with right-side padding, is it correct to assume that ChemGPT was trained with right-side padding?

Thanks for the help and the great work!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions