Skip to content

Problems when replicating the 400M-1x experiment #120

@JerryYan123

Description

@JerryYan123

Hello,

I’m trying to replication the 400M-1x experiment using the same dataset on HuggingFace (https://huggingface.co/datasets/mlfoundations/dclm-pool-400m-1x). My end-to-end retention is 1.64%(1.49&) vs. your 1.4% on the paper.

  1. Dedup (BFF) step My retention after dedup: 6.2% vs. your reported 4.88%
    I’ve tried expected-ngram-count 250B, 100B, and 65B although the results are similar. Could you please share the exact BFF parameters you used?

  2. FastText filter
    I’m calculated a threshold to make it keep 10% of docs. Am I correct? This gives me an overall retention from 1.64% -> 1.49%

  3. Token count (8.2B)
    How was the 8.2B token count obtained? Is it:
    (a) the actual token count after the full pipeline, or
    (b) derived from a scaling law for the 412M model?

Any clarification on these points would be very helpful. Thank you.

Attached is the current breakdown for calculating the threshold or not
stage input output ret%

Step3a lang_en 439,021,781 181,436,613 41.33%
Step3b heuristic 181,436,613 86,895,922 47.89%
Step4 dedup(BFF) 86,895,922 65,463,613 75.34%
Step5 fasttext_10pct 65,463,613 6,546,370 10.00%

END-TO-END 439,021,781 6,546,370 1.49%

stage input output ret%

Step3a lang_en 439,021,781 181,436,613 41.33%
Step3b heuristic 181,436,613 86,895,922 47.89%
Step4 dedup(BFF) 86,895,922 65,463,613 75.34%
Step5 fasttext 65,463,613 7,192,242 10.99%

END-TO-END 439,021,781 7,192,242 1.64%

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions