❓ The question
Hi allenai team!
I was exploring the allenai/olmo-mix-1124 dataset and noticed references to preprocessing scripts in the repository. Specifically, I found prepare_memmap_dataset.py which seems related to data preparation.
Could you kindly clarify:
How was the olmo-mix-1124 dataset generated?
Is prepare_memmap_dataset.py the primary script used? If so, could you provide a basic reference command for generating a similar dataset?
If other scripts/configurations were involved, would you mind sharing any pointers?
Thank you for your work and guidance!