Skip to content

tokenization memory usage #88

@brian-ham

Description

@brian-ham

Hi! I am currently trying to tokenize the processed 400m-1x data, but I'm running into object store memory issues where the tokenize_shuffle.py script seems to be attempting to tokenize the entire processed dataset instead of periodically writing into disk memory. For context, I don't have S3 access so I modified the script slightly to save to a local disk. I tried enabling --no_shuffle, in case that was preventing periodic memory writes, and also played around with force_num_cores, num_writer_per_node, and allow_imbalanced_write to little effect.

Are there any other tips to manage memory usage with the tokenize_shuffle script on the 400m-1x data, or is it by design that the memory write happens only at the end? Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions