-
Notifications
You must be signed in to change notification settings - Fork 368
Open
Description
Hi, from your blog post it seems that the redpajama-v2 has performed an exact dedup for all dumps.
My question is: did you perform dedup for each dump individually or, is it done across different dumps? In the latter case, wouldn't there be a large memory-overhead to load all previous text hashes in the memory? Thanks.
Metadata
Metadata
Assignees
Labels
No labels