The read_data method in ChemDataReader currently parses any string that contains at least one valid token, even if the rest/entire string as whole is invalid. This leads to invalid strings being fed to the model.
....
....
try:
return [self._get_token_index(v[1]) for v in _tokenize(raw_data)]
except ValueError as e:
print(f"could not process {raw_data}")
print(f"\t{e}")
return None
Because of this logic, an entire invalid string with just one valid token gets parsed and is accepted and fed to the model for classification.
Eg. the string ADASDAD is parsed because there is a valid token S in it. So the entire string is represented as by [64].
64 here corresponds to the index of S token.
[print(v) for v in _tokenize("ADASDAD")]
(<TokenType.ATOM: 1>, 'S')
self._get_token_index('S')
64
This should be avoided and the reader should return None. Hence, this logic must be improvised.