Improvement of unbalanced datasets in multiprocessing

As it was noticed during the last benchmark tests run, the treatment of unbalanced datasets is suboptimal when running with multiprocessing enabled if one of the RDataFrames is built on top of a dataset whose size is much bigger than the others, the worker that process it end up creating a bottleneck for the entire analysis. Several ways (to be investigated and implemented separately) can fix this issue:
- combine the usage of multiprocessing and multithreading: detect in advance the larger datasets and split the workers that get to process these into multiple threads; in order not to increase the number of cores used, the overall number of workers decreases;
- using only multiprocessing: detect in advance the larger datasets and split them into different RDataFrames, so that they are taken by different workers; the results can be easily merged at the end to get the proper histograms; this solution also requires something to check that the largest RDataFrames are the first ones sent to the workers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improvement of unbalanced datasets in multiprocessing #47

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improvement of unbalanced datasets in multiprocessing #47

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions