-
Notifications
You must be signed in to change notification settings - Fork 17
Description
Hey @ncfrey!
Thanks for the great repo! I'm currently trying to use the ChemGPT model to generate some molecular embeddings but I'm having some issues with batched input. Because it's a GPT-style model, ChemGPT should be padding on the left but the model pulled from the transformers hub seems to load with right-side padding
from transformers import pipeline
featurizer = pipeline(
"feature-extraction",
model="ncfrey/ChemGPT-1.2B",
framework="pt",
return_tensors=True
)
print(featurizer.tokenizer.padding_side)
# "right"When not batching inputs, the padding side (naturally) has no impact:
featurizer = pipeline(
"feature-extraction", model="ncfrey/ChemGPT-1.2B", framework="pt", return_tensors=True,
)
featurizer.tokenizer.add_special_tokens({'pad_token': '[PAD]'})
sfs = [
'[C][C][N][=C][Branch1_1][O][N][C][C][C][C][C][C][C][Ring1][Branch1_3][S][C][Expl=Ring1][=N][C][Branch1_2][C][=O][O-expl]',
'[C][N][Branch1_1][Branch2_2][C][C][C][C][C][C][Ring1][Branch1_2][S][Branch1_2][C][=O][Branch1_2][C][=O][C][=C][C][=C][Branch2_1][Ring1][Branch2_2][N][C][Branch1_2][C][=O][C][C][N][C][Branch1_2][C][=O][C@Hexpl][C][C][=C][C][C@@Hexpl][Ring1][Branch1_2][C][Ring1][Branch2_3][=O][C][=C][Ring2][Ring1][Branch1_2]',
'[C][C@Hexpl][C][C][C@Hexpl][Branch2_1][Ring1][Branch1_3][NH+expl][C][C][C][C@Hexpl][Branch1_1][=N][C@Hexpl][Branch1_1][C][O][C][=N][C][=C][N][Ring1][Branch1_1][C][C][Ring1][=C][C][Ring2][Ring1][Ring1]',
'[N][/C][Branch2_1][Ring1][Ring2][C][N][C][Branch1_2][C][=O][C@@Hexpl][C][C][=C][C][=C][C][=C][Ring1][Branch1_2][S][Ring1][Branch2_2][=N][\\O]'
]
featurizer.tokenizer.padding_side = 'right'
X_unpadded_r = torch.stack([H[0, -1, :] for H in featurizer(sfs)])
featurizer.tokenizer.padding_side = 'left'
X_unpadded_l = torch.stack([H[0, -1, :] for H in featurizer(sfs)])
print(torch.allclose(X_unpadded_r, X_unpadded_l))
# TrueHowever, upon batching, I start to run into some trouble. As expected, left vs. right padding results in different outputs:
featurizer.tokenizer.padding_side = 'right'
X_padded_r = torch.stack([H[0, -1, :] for H in featurizer(sfs, batch_size=4)])
featurizer.tokenizer.padding_side = 'left'
X_padded_l = torch.stack([H[0, -1, :] for H in featurizer(sfs, batch_size=4)])
print(torch.allclose(X_padded_r, X_padded_l))
# FalseBut the confusing thing is that regardless of the padding side, the batched output is different than the unpadded output:
print(torch.allclose(X_unpadded_r, X_padded_l), torch.allclose(X_unpadded_r, X_padded_r))
# (False, False)I've looked through this repo and can't find any mention of left-side padding for ChemGPT training. Additionally, when uploading a model/tokenizer to the hub, it preserves the padding side information. Given that (1) the default in HuggingFace is right-side padding and (2) that the model loads with right-side padding, is it correct to assume that ChemGPT was trained with right-side padding?
Thanks for the help and the great work!