Skip to content

Conversation

@ZJONSSON
Copy link
Contributor

Offers significant speed improvement when the reader has slow i/o (over network instead of from disk)

@asmuth
Copy link
Contributor

asmuth commented Feb 11, 2018

LGTM. If I/O reduction over the network is a concern, we could also optionally disable reading the header entirely as it is just a sanity check and not required to understand the file.

@asmuth
Copy link
Contributor

asmuth commented Feb 11, 2018

However, we might also want to benchmark this on files backed by a spinning disk and/or give the user the option to disable parallel/out-of-order reading; I'm not sure off the top of my head if our writer does it, but other writers might write out the column chunks in order (in the data file) so that readers can benefit from read ahead optimization.

@kessler
Copy link
Contributor

kessler commented Feb 11, 2018

@ZJONSSON,
@asmuth and I were discussing this and we want to have two new classes, one for parallel reading and the other for sequential one. @asmuth have serious concerns regarding disk performance.

We also need a benchmark test suite to make sure we are indeed improving stuff and in which scenarios.

We're gonna spend some time tomorrow morning doing this.

@ZJONSSON
Copy link
Contributor Author

ZJONSSON commented Feb 12, 2018

I agree that concurrency should not be infinite. However I think there are better ways to control it than hard-coding sequential executing for tasks that could be in parallel One way to create controls around maximum concurrent reads would be to wrap the get method in a simple queue where maximum concurrency is defined in options (and a default value)

Additionally: number of actual requests could be optimized by inspecting any simultaneous requests (in the get wrapper) and see if any of the requested buffers are overlapping (or sufficiently close) - in which case a single underlying get command is executed and then the individual get promises are resolved by the corresponding parts of the incoming buffer Ideally we would stream the buffer and resolve the parts when we have them, instead of waiting for the whole buffer to load.

@ZJONSSON
Copy link
Contributor Author

On the second point, here is a quick branch (very much wip) on the optimization of simultaneous requests. Any reads with close to consecutive segments, i.e. the offset of next request is close to offset + length of previous, will be bundled into a single read.

Offers significant speed improvement when the reader has slow i/o (over network instead of from disk)
Read both header and footer concurrently, but make header error the first one to throw (if there are errors)
jeffbski-rga pushed a commit to jeffbski/parquetjs that referenced this pull request Mar 2, 2020
Update readme with correct package name
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants