-
Notifications
You must be signed in to change notification settings - Fork 74
Description
Hi! I am trying to run a script that can be parallelized trivially - basically each line of the input is independent from one another. I have a 15k line input file that takes around 10 minutes to process on a single core. A single line takes ~1.4s to run. I split the input file into 15k files containing single lines, generated the makefile and passed it to lambda through gg with -j=2000. I encountered a few issues:
- the time the full job took to run was almost 3 minutes. It should have taken a few seconds plus download of around 200 MB from s3 to lambda which shouldn't be much. Is there a way to debug which step is taking time on lambda and why? It is probably the file download, but could be something else.
- the job never runs to completion the first time. there are always ~1% jobs stuck. Killing the process and restarting it once or twice always completes the execution though. Is there a way to specify retries in case of error? Or something else that I can do to fix this?
- the creation of thunks takes around 45 seconds using
gg infer make -j$(nproc)on a 6 core machine. And the wrapper usesgg create-thunkbecause I used pyinstaller to build the binary. The wrapper does just two things - it collects the single line input file and creates a thunk for the command. The binary and common input file are collected just once manually outside the wrapper function. I was considering using gg to convert the whole process to something that takes around 5-10sec max to run so that it can be run as an api. Is there a faster way to create these thunks?
The command I use is of the format
command single_line_input_file.txt big_constant_input_file.txt single_line_output_file.txt
The single_line_input_file.txt and single_line_output_file.txt change for each execution. The command and big_constant_input_file.txt are the same for every execution. the command is around 70MB and the big_constant_input_file.txt is around 140MB. So the same file is being downloaded to 2k lambdas parallely. I remember @sadjad, you mentioned in a talk that gg does each file transfer in parallel chunks. Perhaps this combined with the 2k parallel lambdas trying to download the same 2 files is hitting the s3 concurrency limit?