Issues with parallelizing thousands of tasks

Hi! I am trying to run a script that can be parallelized trivially - basically each line of the input is independent from one another. I have a 15k line input file that takes around 10 minutes to process on a single core. A single line takes ~1.4s to run. I split the input file into 15k files containing single lines, generated the makefile and passed it to lambda through gg with -j=2000. I encountered a few issues:
- the time the full job took to run was almost 3 minutes. It should have taken a few seconds plus download of around 200 MB from s3 to lambda which shouldn't be much. Is there a way to debug which step is taking time on lambda and why? It is probably the file download, but could be something else.
- the job never runs to completion the first time. there are always ~1% jobs stuck. Killing the process and restarting it once or twice always completes the execution though. Is there a way to specify retries in case of error? Or something else that I can do to fix this?
- the creation of thunks takes around 45 seconds using `gg infer make -j$(nproc)` on a 6 core machine. And the wrapper uses `gg create-thunk` because I used pyinstaller to build the binary. The wrapper does just two things - it collects the single line input file and creates a thunk for the command. The binary and common input file are collected just once manually outside the wrapper function. I was considering using gg to convert the whole process to something that takes around 5-10sec max to run so that it can be run as an api. Is there a faster way to create these thunks?

The command I use is of the format
```
command single_line_input_file.txt big_constant_input_file.txt single_line_output_file.txt
```
The `single_line_input_file.txt` and `single_line_output_file.txt` change for each execution. The `command` and `big_constant_input_file.txt` are the same for every execution. the `command` is around 70MB and the `big_constant_input_file.txt` is around 140MB. So the same file is being downloaded to 2k lambdas parallely. I remember @sadjad, you mentioned in a talk that gg does each file transfer in parallel chunks. Perhaps this combined with the 2k parallel lambdas trying to download the same 2 files is hitting the s3 concurrency limit?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with parallelizing thousands of tasks #49

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issues with parallelizing thousands of tasks #49

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions