Skip to content

Exact dedup details #115

@jordane95

Description

@jordane95

Hi, from your blog post it seems that the redpajama-v2 has performed an exact dedup for all dumps.

My question is: did you perform dedup for each dump individually or, is it done across different dumps? In the latter case, wouldn't there be a large memory-overhead to load all previous text hashes in the memory? Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions