Hi! I am currently trying to tokenize the processed 400m-1x data, but I'm running into object store memory issues where the tokenize_shuffle.py script seems to be attempting to tokenize the entire processed dataset instead of periodically writing into disk memory. For context, I don't have S3 access so I modified the script slightly to save to a local disk. I tried enabling --no_shuffle, in case that was preventing periodic memory writes, and also played around with force_num_cores, num_writer_per_node, and allow_imbalanced_write to little effect.
Are there any other tips to manage memory usage with the tokenize_shuffle script on the 400m-1x data, or is it by design that the memory write happens only at the end? Thanks!
Hi! I am currently trying to tokenize the processed 400m-1x data, but I'm running into object store memory issues where the tokenize_shuffle.py script seems to be attempting to tokenize the entire processed dataset instead of periodically writing into disk memory. For context, I don't have S3 access so I modified the script slightly to save to a local disk. I tried enabling --no_shuffle, in case that was preventing periodic memory writes, and also played around with force_num_cores, num_writer_per_node, and allow_imbalanced_write to little effect.
Are there any other tips to manage memory usage with the tokenize_shuffle script on the 400m-1x data, or is it by design that the memory write happens only at the end? Thanks!