Skip to content

Dictionary unification fails across multiple matches in large files #3

@bmschmidt

Description

@bmschmidt

Reading from a CSV where dictionary types are inferred, multiple batches seem to produce dictionaries that can't be unified if new entries appear not present in the first batch (or something like that).

I thought this was addressed by tile.remap_all_dicts, but it is not.

Not yet reproduced, but log trace below. In this case, fixable by increasing csv_batch_size to float("inf") or equivalent; that won't be possible for larger-than-memory data, though.

DEBUG:quadtiler:Opening overflow on (1, 0, 0)
INFO:quadtiler:Done inserting block 4 of 7
INFO:quadtiler:15 partially filled tiles buffered in memory and 2 flushing overflow directly to disk.
INFO:quadtiler:Inserting block 5 of 7
Traceback (most recent call last):
  File "/Users/ayelton/.virtualenvs/lc_etl-BGe5voTg/bin/quadfeather", line 8, in <module>
    sys.exit(main())
  File "/Users/ayelton/.virtualenvs/lc_etl-BGe5voTg/lib/python3.9/site-packages/quadfeather/tiler.py", line 264, in main
    tiler.insert(tab, remaining_tiles)
  File "/Users/ayelton/.virtualenvs/lc_etl-BGe5voTg/lib/python3.9/site-packages/quadfeather/tiler.py", line 608, in insert
    child_tile.insert(subset, tiles_allowed - tiles_allowed_overflow)
  File "/Users/ayelton/.virtualenvs/lc_etl-BGe5voTg/lib/python3.9/site-packages/quadfeather/tiler.py", line 612, in insert
    self.overflow_buffer.write_batch(
  File "pyarrow/ipc.pxi", line 408, in pyarrow.lib._CRecordBatchWriter.write_batch
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single dictionary for a given field across all batches.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions