Skip to content

Conversation

@hypsakata
Copy link
Contributor

@hypsakata hypsakata commented Dec 31, 2025

Rationale for this change

When building an Arrow::Table from a Ruby Hash passed to Arrow::Table.new, nested Integer arrays are incorrectly inferred as list<uint8> or list<int8> regardless of the actual values contained. Nested integer arrays should be correctly inferred as the appropriate list type (e.g., list<int64>, list<uint64>) based on their values, similar to how flat arrays are handled, unless they contain values out of range for any integer type.

What changes are included in this PR?

This PR modifies the logic in detect_builder_info() to fix the inference issue. Specifically:

  • Persist sub_builder_info across sub-array elements: Previously, sub_builder_info was recreated for each sub-array element in the Array. The logic has been updated to accumulate and carry over the builder information across elements to ensure correct type inference for the entire list.
  • Refactor Integer builder logic: Following the pattern used for BigDecimal, the logic for determining the Integer builder has been moved to create_builder(). detect_builder_info() now calls this function.

Note: 



  • As a side effect of this refactoring, nested lists of BigDecimal (which were previously inferred as string) may now have their types inferred. However, comprehensive testing and verification for nested BigDecimal support will be addressed in a separate issue to keep this PR focused.
  • We stopped using IntArrayBuilder for inference logic to ensure correctness. This results in a performance overhead (array building is approximately 2x slower) as we can no longer rely on the specialized builder's detection.
                                           user     system      total        real
    array_builder int32 100000         0.085867   0.000194   0.086061 (  0.086369)
int_array_builder int32 100000         0.042163   0.001033   0.043196 (  0.043268)
    array_builder int64 100000         0.086799   0.000015   0.086814 (  0.086828)
int_array_builder int64 100000         0.044493   0.000973   0.045466 (  0.045469)
    array_builder uint32 100000        0.085748   0.000009   0.085757 (  0.085768)
int_array_builder uint32 100000        0.044463   0.001034   0.045497 (  0.045498)
    array_builder uint64 100000        0.084548   0.000987   0.085535 (  0.085537)
int_array_builder uint64 100000        0.044206   0.000017   0.044223 (  0.044225)

Are these changes tested?

Yes. ruby ruby/red-arrow/test/run-test.rb

Are there any user-facing changes?

Yes.

Github Issue: #48481

@hypsakata hypsakata requested a review from kou as a code owner December 31, 2025 06:59
@github-actions
Copy link

⚠️ GitHub issue #48481 has been automatically assigned in GitHub to PR creator.

@hypsakata
Copy link
Contributor Author

Benchmark code

require "benchmark"
require "arrow"

SIZES = [100_000].freeze

# Pre-generate arrays so we only measure builder cost.
DATASETS = SIZES.each_with_object({}) do |size, datasets|
  datasets[["int32", size]] = Array.new(size) {|i| (i % 2_000_000) - 1_000_000 }
  datasets[["int64", size]] = Array.new(size) {|i| (i % 2**40) - 2**39 }
  datasets[["uint32", size]] = Array.new(size) {|i| i % 4_000_000_000 }
  datasets[["uint64", size]] = Array.new(size) {|i| i % 2**40 }
end.freeze

Benchmark.bm(36) do |x|
  DATASETS.each do |(name, size), values|
    x.report("    array_builder #{name} #{size}") do
      Arrow::ArrayBuilder.build(values)
    end

    next unless name.start_with?("int", "uint")

    builder_class = name.start_with?("uint") ? Arrow::UIntArrayBuilder : Arrow::IntArrayBuilder
    x.report("int_array_builder #{name} #{size}") do
      builder = builder_class.new
      builder.build(values)
    end
  end
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant