-
Notifications
You must be signed in to change notification settings - Fork 37
Add MaxKeys configurable to respect constraints for rate-limited Ingestion use cases #88
Description
I was excited about using S3P for it parallel processing benefits for my use case; I have to facilitate a process which has hard constraints to "bulk-move" exactly 300 xml files at a time that are sitting in S3 from one S3 bucket prefix to another prefix within the same S3 bucket, and only doing that once every hour.
I had thought I was going to be able to leverage S3P in a Cron'd workflow of "npx s3p: ls -> map -> cp -> compare -> each -> s3api: rm -> npx s3p: summarize (as report)"... but after a full day lost unsuccessfully trying to figure out how to get it to process an expressly limited number of files in a specific S3 Bucket at a time (i.e. 300 Keys per batch, per hour), I felt let down by the promise of it and ended up going back to the slower AWS s3api directly instead.
It would be perfect if S3P had the option to use the "--max-keys" = Number (1 as floor and 1000 as ceiling) config option, which really is one of the best way to limit a result set for anything the ultimately leverages the GetObject method in AWS.
I can see in your codebase that you are not leveraging "MaxKeys" as a configurable item, even though for AWS that is the purpose of its existence.
#Master Branch source/S3Parallel/Lib/S3.caf - starting at line 38
@list: ({bucket, prefix, limit=1000, fetchOwner, startAfter}) =>
startTime = currentSecond()
@s3.listObjectsV2
Bucket: bucket
Prefix: prefix
MaxKeys: limit
StartAfter: startAfter
FetchOwner: fetchOwner
.tapCatch (error) ->
log.error S3.list-error: {} bucket, prefix, startAfter, limit, error
IMHO; you should not set a static limit on it because setting the MaxKeys configurable to a Static Limit disadvantages S3P users.
I get that S3P should respect AWS's "request/response action returns up to 1,000 key names" but in the doing of that you could choose to set it up using a default value of 1000 when it isn't being actioned by a config override, and you prevent users from configuring it to exceed AWS's hard result limit of 1000 by ignoring the override if it doesn't meet spec.
Examples:
for setting default and override IN @commonOptions for copyObject and largeCopy
maxKeys: max-keys (default: 1000)
And for wherever 'limit' is instantiated
(maxKeys > 0 && maxKeys <= 1000) ? let limit=maxKeys : let limit=1000;
I haven't been able to figure out how to do it with what S3P offers today. I tried using combinations of s3p ls, cp, each, and sync (with map/reduce , map-list/reduce, and filter options too).
The biggest issue I bump up against here is my own ignorance... I'm not understanding how to get the obj.indexOf(${item}) for keys returned in the reducer processing, so that I can get the key for every 300th file to use in the start-at and stop-at config setters.
NOTE: If anybody knows of a way to limit the request/response keys returned to a dynamic value of user's choice instead of the 1000 that it is statically set to then I would love to hear about it. Ultimately though, I think the --max-keys would be the easiest user method on request/response scope gating.