Skip to content

Improvement of unbalanced datasets in multiprocessing #47

@maxgalli

Description

@maxgalli

As it was noticed during the last benchmark tests run, the treatment of unbalanced datasets is suboptimal when running with multiprocessing enabled if one of the RDataFrames is built on top of a dataset whose size is much bigger than the others, the worker that process it end up creating a bottleneck for the entire analysis. Several ways (to be investigated and implemented separately) can fix this issue:

  • combine the usage of multiprocessing and multithreading: detect in advance the larger datasets and split the workers that get to process these into multiple threads; in order not to increase the number of cores used, the overall number of workers decreases;
  • using only multiprocessing: detect in advance the larger datasets and split them into different RDataFrames, so that they are taken by different workers; the results can be easily merged at the end to get the proper histograms; this solution also requires something to check that the largest RDataFrames are the first ones sent to the workers.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions