Problems when replicating the 400M-1x experiment

Hello,

I’m trying to replication the 400M-1x experiment using the same dataset on HuggingFace (https://huggingface.co/datasets/mlfoundations/dclm-pool-400m-1x). My end-to-end retention is 1.64%(1.49&) vs. your 1.4% on the paper.

1. Dedup (BFF) step My retention after dedup: 6.2% vs. your reported 4.88%
I’ve tried expected-ngram-count 250B, 100B, and 65B although the results are similar. Could you please share the exact BFF parameters you used?

2. FastText filter
I’m calculated a threshold to make it keep 10% of docs. Am I correct? This gives me an overall retention from 1.64% -> 1.49%

3. Token count (8.2B)
How was the 8.2B token count obtained? Is it:
(a) the actual token count after the full pipeline, or
(b) derived from a scaling law for the 412M model?

Any clarification on these points would be very helpful. Thank you.

Attached is the current breakdown for calculating the threshold or not
  stage                              input         output     ret%
  ---------------------------------------------------------------
  Step3a  lang_en              439,021,781    181,436,613   41.33%
  Step3b  heuristic            181,436,613     86,895,922   47.89%
  Step4   dedup(BFF)            86,895,922     65,463,613   75.34%
  Step5   fasttext_10pct        65,463,613      6,546,370   10.00%
  ---------------------------------------------------------------
  END-TO-END                   439,021,781      6,546,370    1.49%


  stage                              input         output     ret%
  ---------------------------------------------------------------
  Step3a  lang_en              439,021,781    181,436,613   41.33%
  Step3b  heuristic            181,436,613     86,895,922   47.89%
  Step4   dedup(BFF)            86,895,922     65,463,613   75.34%
  Step5   fasttext              65,463,613      7,192,242   10.99%
  ---------------------------------------------------------------
  END-TO-END                   439,021,781      7,192,242    1.64%


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems when replicating the 400M-1x experiment #120

Attached is the current breakdown for calculating the threshold or not
stage input output ret%

Step3a lang_en 439,021,781 181,436,613 41.33%
Step3b heuristic 181,436,613 86,895,922 47.89%
Step4 dedup(BFF) 86,895,922 65,463,613 75.34%
Step5 fasttext_10pct 65,463,613 6,546,370 10.00%

stage input output ret%

Step3a lang_en 439,021,781 181,436,613 41.33%
Step3b heuristic 181,436,613 86,895,922 47.89%
Step4 dedup(BFF) 86,895,922 65,463,613 75.34%
Step5 fasttext 65,463,613 7,192,242 10.99%

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Problems when replicating the 400M-1x experiment #120

Description

Attached is the current breakdown for calculating the threshold or not stage input output ret%

Step3a lang_en 439,021,781 181,436,613 41.33% Step3b heuristic 181,436,613 86,895,922 47.89% Step4 dedup(BFF) 86,895,922 65,463,613 75.34% Step5 fasttext_10pct 65,463,613 6,546,370 10.00%

stage input output ret%

Step3a lang_en 439,021,781 181,436,613 41.33% Step3b heuristic 181,436,613 86,895,922 47.89% Step4 dedup(BFF) 86,895,922 65,463,613 75.34% Step5 fasttext 65,463,613 7,192,242 10.99%

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Attached is the current breakdown for calculating the threshold or not
stage input output ret%

Step3a lang_en 439,021,781 181,436,613 41.33%
Step3b heuristic 181,436,613 86,895,922 47.89%
Step4 dedup(BFF) 86,895,922 65,463,613 75.34%
Step5 fasttext_10pct 65,463,613 6,546,370 10.00%

Step3a lang_en 439,021,781 181,436,613 41.33%
Step3b heuristic 181,436,613 86,895,922 47.89%
Step4 dedup(BFF) 86,895,922 65,463,613 75.34%
Step5 fasttext 65,463,613 7,192,242 10.99%