CodeBert-Preprocessor fails on unicode references

The CodeBert-Preprocessor fails to preprocess javafiles containing characters starting with "\", 
such as "\u00". 
The resulting jsonl has an unescaped "\" and fails to be parsed.

**To Reproduce**
1. Move in the CodeBert Preprocessing Folder 
2. Add the content of [error.txt](https://github.com/ciselab/Lampion/files/5841001/error.txt) to the example java file
3. Run the Preprocessing on the example java file using the docker-compose
4. Inspect the altered_java.jsonl for \u00 characters

**Expected behavior**

The Character should be properly escaped as \\u00. 
In any way, the resulting json must be correct.

**Additional context**

This was needed for the GridExperiment, and has been currently addressed by removing the 3 datapoints that have a \u in them from the test-data.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CodeBert-Preprocessor fails on unicode references #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CodeBert-Preprocessor fails on unicode references #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions