Hello,
I’m trying to replication the 400M-1x experiment using the same dataset on HuggingFace (https://huggingface.co/datasets/mlfoundations/dclm-pool-400m-1x). My end-to-end retention is 1.64%(1.49&) vs. your 1.4% on the paper.
-
Dedup (BFF) step My retention after dedup: 6.2% vs. your reported 4.88%
I’ve tried expected-ngram-count 250B, 100B, and 65B although the results are similar. Could you please share the exact BFF parameters you used?
-
FastText filter
I’m calculated a threshold to make it keep 10% of docs. Am I correct? This gives me an overall retention from 1.64% -> 1.49%
-
Token count (8.2B)
How was the 8.2B token count obtained? Is it:
(a) the actual token count after the full pipeline, or
(b) derived from a scaling law for the 412M model?
Any clarification on these points would be very helpful. Thank you.
Attached is the current breakdown for calculating the threshold or not
stage input output ret%
Step3a lang_en 439,021,781 181,436,613 41.33%
Step3b heuristic 181,436,613 86,895,922 47.89%
Step4 dedup(BFF) 86,895,922 65,463,613 75.34%
Step5 fasttext_10pct 65,463,613 6,546,370 10.00%
END-TO-END 439,021,781 6,546,370 1.49%
stage input output ret%
Step3a lang_en 439,021,781 181,436,613 41.33%
Step3b heuristic 181,436,613 86,895,922 47.89%
Step4 dedup(BFF) 86,895,922 65,463,613 75.34%
Step5 fasttext 65,463,613 7,192,242 10.99%
END-TO-END 439,021,781 7,192,242 1.64%
Hello,
I’m trying to replication the 400M-1x experiment using the same dataset on HuggingFace (https://huggingface.co/datasets/mlfoundations/dclm-pool-400m-1x). My end-to-end retention is 1.64%(1.49&) vs. your 1.4% on the paper.
Dedup (BFF) step My retention after dedup: 6.2% vs. your reported 4.88%
I’ve tried expected-ngram-count 250B, 100B, and 65B although the results are similar. Could you please share the exact BFF parameters you used?
FastText filter
I’m calculated a threshold to make it keep 10% of docs. Am I correct? This gives me an overall retention from 1.64% -> 1.49%
Token count (8.2B)
How was the 8.2B token count obtained? Is it:
(a) the actual token count after the full pipeline, or
(b) derived from a scaling law for the 412M model?
Any clarification on these points would be very helpful. Thank you.
Attached is the current breakdown for calculating the threshold or not
stage input output ret%
Step3a lang_en 439,021,781 181,436,613 41.33%
Step3b heuristic 181,436,613 86,895,922 47.89%
Step4 dedup(BFF) 86,895,922 65,463,613 75.34%
Step5 fasttext_10pct 65,463,613 6,546,370 10.00%
END-TO-END 439,021,781 6,546,370 1.49%
stage input output ret%
Step3a lang_en 439,021,781 181,436,613 41.33%
Step3b heuristic 181,436,613 86,895,922 47.89%
Step4 dedup(BFF) 86,895,922 65,463,613 75.34%
Step5 fasttext 65,463,613 7,192,242 10.99%
END-TO-END 439,021,781 7,192,242 1.64%