Skip to content

Multi-threaded scan #47

@freddie-freeloader

Description

@freddie-freeloader

First findings

  • It seems that DuckdDB might be able to read multiple parquet-files in concurrently -- but not one file concurrently

Thoughts

  • In theory, we could do this by copy from with exactly the same number of threads & use each thread the location info of the sheetreader thread.
  • Would it be possible to partition excel sheet in 2048 / (number of threads) rows? + make the buffers that size? Probably tricky, because we would have to know the number of columns before (because buffer size / columns is the numbers of rows, which fit into one buffer)

TODO

A multi-threaded scan would be interesting, since our copy/scan function takes some time.

Have a look at:

https://github.com/duckdb/duckdb_delta/blob/main/src/functions/delta_scan.cpp

According to the README, it supports a multi-threaded scan. I suspect that this doesn't need any new implementation, since they are reading the parquet files.

  • Find out whether this is due to the parquet files
  • Find out whether DuckDB supports also a multi-threaded scan of Apache Arrow format
  • Have a look at how the multi-threaded scan is implemented
  • Find out whether we could copy concurrently -- this might not be possible, because sheetreader-core saves the data in a special way (per thread & some rows are split in multiple threads -- and there is only an implicit order)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions