-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
First findings
- It seems that DuckdDB might be able to read multiple parquet-files in concurrently -- but not one file concurrently
Thoughts
- In theory, we could do this by copy from with exactly the same number of threads & use each thread the location info of the sheetreader thread.
- Would it be possible to partition excel sheet in 2048 / (number of threads) rows? + make the buffers that size? Probably tricky, because we would have to know the number of columns before (because buffer size / columns is the numbers of rows, which fit into one buffer)
TODO
A multi-threaded scan would be interesting, since our copy/scan function takes some time.
Have a look at:
https://github.com/duckdb/duckdb_delta/blob/main/src/functions/delta_scan.cpp
According to the README, it supports a multi-threaded scan. I suspect that this doesn't need any new implementation, since they are reading the parquet files.
- Find out whether this is due to the parquet files
- Find out whether DuckDB supports also a multi-threaded scan of Apache Arrow format
- Have a look at how the multi-threaded scan is implemented
- Find out whether we could copy concurrently -- this might not be possible, because
sheetreader-coresaves the data in a special way (per thread & some rows are split in multiple threads -- and there is only an implicit order)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels