Skip to content

Duplicate example in train and test set #6

@BramVanroy

Description

@BramVanroy

Hi

I was doing some sanity checking and found a duplicate item in the train and test set:

  • DBRD/train/neg/2074_2.txt
  • DBRD/test/neg/20602_2.txt

Content-wise they are identical, with the only difference being that the file in the train set has more newlines. But we filter out these new lines anyway during the training of our models (or at least I do and replace them with single spaces).

This seems important enough to have a revised version 3.1 where the duplicate is removed, as it impacts model training. Together with language filtering (#2), this might even be warranting a v4. Alternatively, I can make a fork and rework the whole thing - of course with acknowledgments to this repo.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions