Barebones python/S3 implementation by revbucket · Pull Request #10 · allenai/bff

revbucket · 2024-02-29T21:05:37Z

Added bff_v0.py which is a simple python script to:

download all .jsonl.gz's from a specified S3 directory
Run BFF on ^
Upload the outputs back to S3

Also modified the main.rs to use the faster init for bloom filters

dirkgr · 2024-03-01T23:09:29Z

src/main.rs

-        for _ in 0..number_of_u32 {
-            bits.push(AtomicU32::new(0));
-        }
+        let mut bits = {


We definitely want this change!

dirkgr · 2024-03-01T23:14:01Z

bff_v0.py

@@ -0,0 +1,169 @@
+""" Quick'n'dirty mapping of bff for python before I can make it pure-rust


This is fine, if you want to put it into a scripts/ directory, but I feel like this is not really necessary. You could replace the entire thing with a few lines of bash:

aws cp "s3://bucket/input_dir/*.jsonl.gz" ./input_files bff --bloom-filter-file filter.bff --bloom-filter-size <something> --expected-ngram-count <something> --output-directory ./output_files ./input_files/*.json.gz rm -r input_files aws cp ./output_files s3://bucket/output_dir

… to describe some new features

Barebones python/S3 implementation

77e5160

dirkgr suggested changes Mar 1, 2024

View reviewed changes

Matt Jordan added 8 commits March 5, 2024 15:15

Bash script written

ab02649

Added i) directory support, ii) FP rate args, iii) No-Save option

8aaf3c9

Made changes requseted in PR

10b192a

Added --no-progress flag

a6b63fd

Added options for no-progress (cleaner signature) | Updated README.md…

a7b7ccf

… to describe some new features

Cleaned up readme duplicates

2cc9cac

Merged main

f035820

Bash script fixed

5be38a8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Barebones python/S3 implementation#10

Barebones python/S3 implementation#10
revbucket wants to merge 9 commits intoallenai:mainfrom
revbucket:s3_python

revbucket commented Feb 29, 2024

Uh oh!

dirkgr Mar 1, 2024

Uh oh!

dirkgr Mar 1, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,169 @@
		""" Quick'n'dirty mapping of bff for python before I can make it pure-rust

Conversation

revbucket commented Feb 29, 2024

Uh oh!

dirkgr Mar 1, 2024

Choose a reason for hiding this comment

Uh oh!

dirkgr Mar 1, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants