Skip to content

Buffer overflow error (C tokenizer error) #5

@karim-sharkawy

Description

@karim-sharkawy

I ran into an issue when trying to load some reddit files and especially when breaking up the twitter files. It fails with this error:

Error loading [file_name]: Error tokenizing data.
C error: Buffer overflow caught - possible malformed input file.

It looks like the file might have some malformed rows or unusually long lines that the parser can’t handle. I'm assuming it's either corrupted or malformed rows (obviously from the note) or encoding issues

Edit: I know this can't be because of long lines, because this happens with the twitter datasets if they're broken up, and most of them are 140 characters or less

Metadata

Metadata

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions