`ChemDataReader.read_data` incorrectly accepts entire invalid strings with atleast one valid token

The `read_data` method in `ChemDataReader` currently parses any string that contains at least one valid token, even if the rest/**entire** string as whole is invalid. This leads to invalid strings being fed to the model.

```python 
        ....
        ....
        try:
            return [self._get_token_index(v[1]) for v in _tokenize(raw_data)]
        except ValueError as e:
            print(f"could not process {raw_data}")
            print(f"\t{e}")
            return None
```
Because of this logic, an entire invalid string with just one valid token gets parsed and is accepted and fed to the model for classification. 

Eg. the string `ADASDAD` is parsed because there is a valid token `S` in it.  So the entire string is represented as by `[64]`.  
> 64 here corresponds to the index of `S` token.

```bash
[print(v) for v in _tokenize("ADASDAD")]
(<TokenType.ATOM: 1>, 'S')

self._get_token_index('S') 
64
```
This should be avoided and the reader should return None.  Hence, this logic must be improvised.  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`ChemDataReader.read_data` incorrectly accepts entire invalid strings with atleast one valid token #137

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ChemDataReader.read_data incorrectly accepts entire invalid strings with atleast one valid token #137

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`ChemDataReader.read_data` incorrectly accepts entire invalid strings with atleast one valid token #137