Skip to content

Big OpenLM/DCLM <-> AI2 PR # 1#12

Open
revbucket wants to merge 67 commits intoallenai:mainfrom
revbucket:ai2
Open

Big OpenLM/DCLM <-> AI2 PR # 1#12
revbucket wants to merge 67 commits intoallenai:mainfrom
revbucket:ai2

Conversation

@revbucket
Copy link
Copy Markdown

Lots of changes here (may be considered a refactor more than a PR, but will still require some heavy code reviews and discussion about which changes to keep/fold in).

Summary of changes:

  • Added commands for bff and sysreq to get sense of how much memory a given BFF run will require
  • Changed some defaults of arguments:
    • min-ngram/max-ngram now default to [20,20]
    • by default the bloom filter file is not saved (this can be specified)
    • annotations have been merged into a single argument
  • progress bar present (but a no-progress-bar arg is also present)
  • some more abstraction/functions to break things up and eventually not repeat code when I push the S3 PR
  • added BOTH level removal type (some discussion about what this does in the RemoveType enum)
  • Added some printouts with BFF sparsity, removal rates, time
  • misc performance-y things, like parallel iteration in some places

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants