Skip to content

Barebones python/S3 implementation#10

Open
revbucket wants to merge 9 commits intoallenai:mainfrom
revbucket:s3_python
Open

Barebones python/S3 implementation#10
revbucket wants to merge 9 commits intoallenai:mainfrom
revbucket:s3_python

Conversation

@revbucket
Copy link
Copy Markdown

Added bff_v0.py which is a simple python script to:

  1. download all .jsonl.gz's from a specified S3 directory
  2. Run BFF on ^
  3. Upload the outputs back to S3

Also modified the main.rs to use the faster init for bloom filters

for _ in 0..number_of_u32 {
bits.push(AtomicU32::new(0));
}
let mut bits = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely want this change!

@@ -0,0 +1,169 @@
""" Quick'n'dirty mapping of bff for python before I can make it pure-rust
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, if you want to put it into a scripts/ directory, but I feel like this is not really necessary. You could replace the entire thing with a few lines of bash:

aws cp "s3://bucket/input_dir/*.jsonl.gz" ./input_files
bff --bloom-filter-file filter.bff --bloom-filter-size <something> --expected-ngram-count <something> --output-directory ./output_files ./input_files/*.json.gz
rm -r input_files
aws cp ./output_files s3://bucket/output_dir

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants