Skip to content

Added i) directory support, ii) FP rate args, iii) No-Save option#11

Open
revbucket wants to merge 5 commits intoallenai:mainfrom
revbucket:tolerance-dir
Open

Added i) directory support, ii) FP rate args, iii) No-Save option#11
revbucket wants to merge 5 commits intoallenai:mainfrom
revbucket:tolerance-dir

Conversation

@revbucket
Copy link
Copy Markdown

Several changes to main.rs:

  1. Added progress bar printouts vs printouts at each filename (tried to use similar formatting as in wimbd)
  2. Added directory support for inputs (can pass dir as input now, only selects .json*.gz files). Same as wimbd
  3. Added option to not save bloom filter at the end (but update during)
  4. Added option to compute filter size dynamically given a desired false-positive rate

@revbucket
Copy link
Copy Markdown
Author

Oh and some more notes about point 4:
We search for a size of bloom filter using an accurate ngram count and then find the smallest filter that results in that fp_rate. We return min(filter_size, 0.90 * systemRAM) (so we don't allocate too much memory -- maybe I should warn if we get to this point though!)

Copy link
Copy Markdown
Contributor

@dirkgr dirkgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens with the progress bar if the output of bff is piped into a file? Or the terminal is in a weird state?

This is why I generally prefer sticking to writing to stdout and stderr, one line at a time, and that's all. It causes the least amount of trouble in unexpected settings.

Copy link
Copy Markdown
Contributor

@dirkgr dirkgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one suggestion. If that's too hard, let's ship it like this.

src/main.rs Outdated
threadpool.execute(move || {
if args.no_progress {
println!("Processing {input:?}...");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra newline?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😅
Yes will fix this before shipping

src/main.rs Outdated
Comment on lines +638 to +639
&pbar,
args.no_progress,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can Arc pointers be null? Then you could save a parameter and pass in null here instead.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes...? Pretty sure we can wrap with an option and then we don't have to pass the second parameter

@revbucket revbucket requested a review from dirkgr March 13, 2024 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants