split S3 files into smaller files to send large union file by yuyashiraki · Pull Request #77 · facebookresearch/Private-ID

yuyashiraki · 2022-09-04T08:46:32Z

Summary:

Context

We found that AWS-SDK S3 API would fail when we try to write more than 5GB of data. It is a blocking us to do capacity testing for a larger FARGATE container.

In this diff, as mentioned in the post, we are splitting union file based on number of rows.

Description

We have made following changes.

Added new arg s3api_max_rows in the private-id-multi-key-client and private-id-multi-key-server binaries. We will use this to split a file for S3 upload.
Added an optional arg num_split in save_id_map() and writer_helper(). When num_split is specified, it would use the arg path as its prefix and save files in {path}_0, {path}_1, etc.
In rpc_server.rs and client.rs, calculates the num_split based on s3api_max_rows, and passes the num_split arg for S3 only. Then, for each split file, it calls copy_from_local().

Differential Revision: D39219674

facebook-github-bot · 2022-09-04T08:47:09Z

This pull request was exported from Phabricator. Differential Revision: D39219674

…esearch#77) Summary: Pull Request resolved: facebookresearch#77 # Context We found that AWS-SDK S3 API would fail when we try to write more than 5GB of data. It is a blocking us to do capacity testing for a larger FARGATE container. In this diff, as mentioned in [the post](https://fb.workplace.com/groups/pidmatchingxfn/posts/493743615908631), we are splitting union file based on number of rows. # Description We have made following changes. - Added new arg `s3api_max_rows` in the private-id-multi-key-client and private-id-multi-key-server binaries. We will use this to split a file for S3 upload. - Added an optional arg `num_split` in save_id_map() and writer_helper(). When `num_split` is specified, it would use the arg `path` as its prefix and save files in `{path}_0`, `{path}_1`, etc. - In rpc_server.rs and client.rs, calculates the num_split based on s3api_max_rows, and passes the num_split arg for S3 only. Then, for each split file, it calls copy_from_local(). Differential Revision: D39219674 fbshipit-source-id: 82dc1788b0d4db5cf9c3de07178b52a8cc11633c

facebook-github-bot · 2022-09-04T22:34:40Z

This pull request was exported from Phabricator. Differential Revision: D39219674

Summary: # What * Add unit tests for encrypt and create_id_map funcion on partner side * Add create_key function to create fixed keys for testing. * encrypt and create_id_map function both use partner.private_keys.1 to encrypt. * self_permutation also needs to be fixed when we test create_id_map() # Why * need to improve code coverage Differential Revision: https://internalfb.com/D39127178 fbshipit-source-id: 22acb4c9d2d642b8df1348547098a7539f6ce7df

Summary: Pull Request resolved: facebookresearch#76 # What * Add unit tests for save_id_map funcion on partner side. * save_id_map function is called after the create_id_map(). * Add create_key function to create fixed keys for testing. * create_id_map function use partner.private_keys.1 to encrypt. * self_permutation also needs to be fixed when we test create_id_map(). * Create a temp file and pass the path to save_id_map() and check the string in the file is correct or not. # Why * need to improve code coverage Differential Revision: D39142927 fbshipit-source-id: 82884647935873fe1f2feef5b061f3cc5385bba2

…esearch#77) Summary: Pull Request resolved: facebookresearch#77 # Context We found that AWS-SDK S3 API would fail when we try to write more than 5GB of data. It is a blocking us to do capacity testing for a larger FARGATE container. In this diff, as mentioned in [the post](https://fb.workplace.com/groups/pidmatchingxfn/posts/493743615908631), we are splitting union file based on number of rows. # Description We have made following changes. - Added new arg `s3api_max_rows` in the private-id-multi-key-client and private-id-multi-key-server binaries. We will use this to split a file for S3 upload. - Added an optional arg `num_split` in save_id_map() and writer_helper(). When `num_split` is specified, it would use the arg `path` as its prefix and save files in `{path}_0`, `{path}_1`, etc. - In rpc_server.rs and client.rs, calculates the num_split based on s3api_max_rows, and passes the num_split arg for S3 only. Then, for each split file, it calls copy_from_local(). Differential Revision: D39219674 fbshipit-source-id: 871df40d1a377ef8115422e39a868a26e09e027d

facebook-github-bot · 2022-09-04T22:45:34Z

This pull request was exported from Phabricator. Differential Revision: D39219674

facebook-github-bot added CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported labels Sep 4, 2022

yuyashiraki force-pushed the export-D39219674 branch from 4ba74f7 to 4f28018 Compare September 4, 2022 22:34

Jian Cao and others added 3 commits September 4, 2022 15:44

yuyashiraki force-pushed the export-D39219674 branch from 4f28018 to 48c8aa6 Compare September 4, 2022 22:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

split S3 files into smaller files to send large union file#77

split S3 files into smaller files to send large union file#77
yuyashiraki wants to merge 3 commits intofacebookresearch:mainfrom
yuyashiraki:export-D39219674

yuyashiraki commented Sep 4, 2022

Uh oh!

facebook-github-bot commented Sep 4, 2022

Uh oh!

facebook-github-bot commented Sep 4, 2022

Uh oh!

facebook-github-bot commented Sep 4, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yuyashiraki commented Sep 4, 2022

Context

Description

Uh oh!

facebook-github-bot commented Sep 4, 2022

Uh oh!

facebook-github-bot commented Sep 4, 2022

Uh oh!

facebook-github-bot commented Sep 4, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants