Skip to content

Question about data generation process for olmo-mix-1124. #859

@WwwwYz666

Description

@WwwwYz666

❓ The question

Hi allenai team!

I was exploring the allenai/olmo-mix-1124 dataset and noticed references to preprocessing scripts in the repository. Specifically, I found prepare_memmap_dataset.py which seems related to data preparation.

Could you kindly clarify:

How was the olmo-mix-1124 dataset generated?
Is prepare_memmap_dataset.py the primary script used? If so, could you provide a basic reference command for generating a similar dataset?
If other scripts/configurations were involved, would you mind sharing any pointers?
Thank you for your work and guidance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/questionAn issue that's a question

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions