Execute all readColumnChunk concurrently for a given RowGroup #33

ZJONSSON · 2018-01-22T21:57:18Z

Offers significant speed improvement when the reader has slow i/o (over network instead of from disk)

asmuth · 2018-02-11T20:57:30Z

LGTM. If I/O reduction over the network is a concern, we could also optionally disable reading the header entirely as it is just a sanity check and not required to understand the file.

asmuth · 2018-02-11T21:02:05Z

However, we might also want to benchmark this on files backed by a spinning disk and/or give the user the option to disable parallel/out-of-order reading; I'm not sure off the top of my head if our writer does it, but other writers might write out the column chunks in order (in the data file) so that readers can benefit from read ahead optimization.

kessler · 2018-02-11T21:35:08Z

@ZJONSSON,
@asmuth and I were discussing this and we want to have two new classes, one for parallel reading and the other for sequential one. @asmuth have serious concerns regarding disk performance.

We also need a benchmark test suite to make sure we are indeed improving stuff and in which scenarios.

We're gonna spend some time tomorrow morning doing this.

ZJONSSON · 2018-02-12T01:56:15Z

I agree that concurrency should not be infinite. However I think there are better ways to control it than hard-coding sequential executing for tasks that could be in parallel One way to create controls around maximum concurrent reads would be to wrap the get method in a simple queue where maximum concurrency is defined in options (and a default value)

Additionally: number of actual requests could be optimized by inspecting any simultaneous requests (in the get wrapper) and see if any of the requested buffers are overlapping (or sufficiently close) - in which case a single underlying get command is executed and then the individual get promises are resolved by the corresponding parts of the incoming buffer Ideally we would stream the buffer and resolve the parts when we have them, instead of waiting for the whole buffer to load.

ZJONSSON · 2018-02-13T22:15:31Z

On the second point, here is a quick branch (very much wip) on the optimization of simultaneous requests. Any reads with close to consecutive segments, i.e. the offset of next request is close to offset + length of previous, will be bundled into a single read.

Offers significant speed improvement when the reader has slow i/o (over network instead of from disk)

Read both header and footer concurrently, but make header error the first one to throw (if there are errors)

Update readme with correct package name

ZJONSSON mentioned this pull request Jan 22, 2018

Add the ability to fetch remote files (s3 and http[s]) #34

Open

ZJONSSON added 2 commits February 18, 2018 15:50

Execute all readColumnChunk concurrently for a given RowGroup

6157f76

Offers significant speed improvement when the reader has slow i/o (over network instead of from disk)

Don't block on reading the header

6d1376a

Read both header and footer concurrently, but make header error the first one to throw (if there are errors)

ZJONSSON force-pushed the concurrent-columns branch from 516d098 to 6d1376a Compare February 18, 2018 20:51

jeffbski-rga pushed a commit to jeffbski/parquetjs that referenced this pull request Mar 2, 2020

Merge pull request ironSource#33 from mytusshar/master

c1743ec

Update readme with correct package name

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Execute all readColumnChunk concurrently for a given RowGroup #33

Execute all readColumnChunk concurrently for a given RowGroup #33

Uh oh!

ZJONSSON commented Jan 22, 2018

Uh oh!

asmuth commented Feb 11, 2018

Uh oh!

asmuth commented Feb 11, 2018 •

edited

Loading

Uh oh!

kessler commented Feb 11, 2018

Uh oh!

ZJONSSON commented Feb 12, 2018 •

edited

Loading

Uh oh!

ZJONSSON commented Feb 13, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Execute all readColumnChunk concurrently for a given RowGroup #33

Are you sure you want to change the base?

Execute all readColumnChunk concurrently for a given RowGroup #33

Uh oh!

Conversation

ZJONSSON commented Jan 22, 2018

Uh oh!

asmuth commented Feb 11, 2018

Uh oh!

asmuth commented Feb 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kessler commented Feb 11, 2018

Uh oh!

ZJONSSON commented Feb 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZJONSSON commented Feb 13, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

asmuth commented Feb 11, 2018 •

edited

Loading

ZJONSSON commented Feb 12, 2018 •

edited

Loading