-
Notifications
You must be signed in to change notification settings - Fork 179
Execute all readColumnChunk concurrently for a given RowGroup #33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
LGTM. If I/O reduction over the network is a concern, we could also optionally disable reading the header entirely as it is just a sanity check and not required to understand the file. |
|
However, we might also want to benchmark this on files backed by a spinning disk and/or give the user the option to disable parallel/out-of-order reading; I'm not sure off the top of my head if our writer does it, but other writers might write out the column chunks in order (in the data file) so that readers can benefit from read ahead optimization. |
|
@ZJONSSON, We also need a benchmark test suite to make sure we are indeed improving stuff and in which scenarios. We're gonna spend some time tomorrow morning doing this. |
|
I agree that concurrency should not be infinite. However I think there are better ways to control it than hard-coding sequential executing for tasks that could be in parallel One way to create controls around maximum concurrent reads would be to wrap the get method in a simple queue where maximum concurrency is defined in options (and a default value) Additionally: number of actual requests could be optimized by inspecting any simultaneous requests (in the |
|
On the second point, here is a quick branch (very much wip) on the optimization of simultaneous requests. Any reads with close to consecutive segments, i.e. the |
Offers significant speed improvement when the reader has slow i/o (over network instead of from disk)
Read both header and footer concurrently, but make header error the first one to throw (if there are errors)
516d098 to
6d1376a
Compare
Update readme with correct package name
Offers significant speed improvement when the reader has slow i/o (over network instead of from disk)