Skip to content

CodeBert-Preprocessor fails on unicode references #3

@lapplislazuli

Description

@lapplislazuli

The CodeBert-Preprocessor fails to preprocess javafiles containing characters starting with "",
such as "\u00".
The resulting jsonl has an unescaped "" and fails to be parsed.

To Reproduce

  1. Move in the CodeBert Preprocessing Folder
  2. Add the content of error.txt to the example java file
  3. Run the Preprocessing on the example java file using the docker-compose
  4. Inspect the altered_java.jsonl for \u00 characters

Expected behavior

The Character should be properly escaped as \u00.
In any way, the resulting json must be correct.

Additional context

This was needed for the GridExperiment, and has been currently addressed by removing the 3 datapoints that have a \u in them from the test-data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugSomething isn't workingPython

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions