Skip to content

Issues with parallelizing thousands of tasks #49

@drunksaint

Description

@drunksaint

Hi! I am trying to run a script that can be parallelized trivially - basically each line of the input is independent from one another. I have a 15k line input file that takes around 10 minutes to process on a single core. A single line takes ~1.4s to run. I split the input file into 15k files containing single lines, generated the makefile and passed it to lambda through gg with -j=2000. I encountered a few issues:

  • the time the full job took to run was almost 3 minutes. It should have taken a few seconds plus download of around 200 MB from s3 to lambda which shouldn't be much. Is there a way to debug which step is taking time on lambda and why? It is probably the file download, but could be something else.
  • the job never runs to completion the first time. there are always ~1% jobs stuck. Killing the process and restarting it once or twice always completes the execution though. Is there a way to specify retries in case of error? Or something else that I can do to fix this?
  • the creation of thunks takes around 45 seconds using gg infer make -j$(nproc) on a 6 core machine. And the wrapper uses gg create-thunk because I used pyinstaller to build the binary. The wrapper does just two things - it collects the single line input file and creates a thunk for the command. The binary and common input file are collected just once manually outside the wrapper function. I was considering using gg to convert the whole process to something that takes around 5-10sec max to run so that it can be run as an api. Is there a faster way to create these thunks?

The command I use is of the format

command single_line_input_file.txt big_constant_input_file.txt single_line_output_file.txt

The single_line_input_file.txt and single_line_output_file.txt change for each execution. The command and big_constant_input_file.txt are the same for every execution. the command is around 70MB and the big_constant_input_file.txt is around 140MB. So the same file is being downloaded to 2k lambdas parallely. I remember @sadjad, you mentioned in a talk that gg does each file transfer in parallel chunks. Perhaps this combined with the 2k parallel lambdas trying to download the same 2 files is hitting the s3 concurrency limit?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions