ChemGPT padding

Hey @ncfrey!

Thanks for the great repo! I'm currently trying to use the ChemGPT model to generate some molecular embeddings but I'm having some issues with batched input. Because it's a GPT-style model, ChemGPT should be padding on the left but the model pulled from the transformers hub seems to load with right-side padding
```python
from transformers import pipeline
featurizer = pipeline(
    "feature-extraction",
    model="ncfrey/ChemGPT-1.2B",
    framework="pt",
    return_tensors=True
)
print(featurizer.tokenizer.padding_side)
# "right"
```

When not batching inputs, the padding side (naturally) has no impact:
```python
featurizer = pipeline(
    "feature-extraction", model="ncfrey/ChemGPT-1.2B", framework="pt", return_tensors=True, 
)
featurizer.tokenizer.add_special_tokens({'pad_token': '[PAD]'})
sfs = [
    '[C][C][N][=C][Branch1_1][O][N][C][C][C][C][C][C][C][Ring1][Branch1_3][S][C][Expl=Ring1][=N][C][Branch1_2][C][=O][O-expl]',
    '[C][N][Branch1_1][Branch2_2][C][C][C][C][C][C][Ring1][Branch1_2][S][Branch1_2][C][=O][Branch1_2][C][=O][C][=C][C][=C][Branch2_1][Ring1][Branch2_2][N][C][Branch1_2][C][=O][C][C][N][C][Branch1_2][C][=O][C@Hexpl][C][C][=C][C][C@@Hexpl][Ring1][Branch1_2][C][Ring1][Branch2_3][=O][C][=C][Ring2][Ring1][Branch1_2]',
    '[C][C@Hexpl][C][C][C@Hexpl][Branch2_1][Ring1][Branch1_3][NH+expl][C][C][C][C@Hexpl][Branch1_1][=N][C@Hexpl][Branch1_1][C][O][C][=N][C][=C][N][Ring1][Branch1_1][C][C][Ring1][=C][C][Ring2][Ring1][Ring1]',
    '[N][/C][Branch2_1][Ring1][Ring2][C][N][C][Branch1_2][C][=O][C@@Hexpl][C][C][=C][C][=C][C][=C][Ring1][Branch1_2][S][Ring1][Branch2_2][=N][\\O]'
]
featurizer.tokenizer.padding_side = 'right'
X_unpadded_r = torch.stack([H[0, -1, :] for H in featurizer(sfs)])
featurizer.tokenizer.padding_side = 'left'
X_unpadded_l = torch.stack([H[0, -1, :] for H in featurizer(sfs)])
print(torch.allclose(X_unpadded_r, X_unpadded_l))
# True
```

However, upon batching, I start to run into some trouble. As expected, left vs. right padding results in different outputs:
```python
featurizer.tokenizer.padding_side = 'right'
X_padded_r = torch.stack([H[0, -1, :] for H in featurizer(sfs, batch_size=4)])
featurizer.tokenizer.padding_side = 'left'
X_padded_l = torch.stack([H[0, -1, :] for H in featurizer(sfs, batch_size=4)])
print(torch.allclose(X_padded_r, X_padded_l))
# False
```

But the confusing thing is that regardless of the padding side, the batched output is different than the unpadded output:
```python
print(torch.allclose(X_unpadded_r, X_padded_l), torch.allclose(X_unpadded_r, X_padded_r))
# (False, False)
```

I've looked through this repo and can't find any mention of left-side padding for ChemGPT training. Additionally, when uploading a model/tokenizer to the hub, it preserves the padding side information. Given that (1) the default in HuggingFace is right-side padding and (2) that the model loads with right-side padding, **is it correct to assume that ChemGPT was trained with right-side padding?**

Thanks for the help and the great work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ChemGPT padding #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

ChemGPT padding #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions