Skip to content

Commit addf74d

Browse files
Jefffreyalamb
andauthored
Improve RunArray documentation (#9019)
# Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> N/A # Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> Whilst reviewing apache/datafusion#18981 I found it very confusing trying to follow along the logic for accounting for slicing of RunArrays. Decided to see if I can improve the documentation around it, especially to make it clear slicing acts logically not physically. # What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> Improve docstrings for RunArrays and RunEndBuffers. Also add some tests for kernel operations on sliced RunArrays which seemed to be missing. # Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Doc changes only. Added some (expected) failing tests to showcase #9018 too. # Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. --> Doc changes only. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
1 parent b318293 commit addf74d

File tree

7 files changed

+285
-96
lines changed

7 files changed

+285
-96
lines changed

arrow-array/src/array/run_array.rs

Lines changed: 48 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -30,16 +30,15 @@ use crate::{
3030
types::{Int16Type, Int32Type, Int64Type, RunEndIndexType},
3131
};
3232

33-
/// An array of [run-end encoded values](https://arrow.apache.org/docs/format/Columnar.html#run-end-encoded-layout)
33+
/// An array of [run-end encoded values].
3434
///
35-
/// This encoding is variation on [run-length encoding (RLE)](https://en.wikipedia.org/wiki/Run-length_encoding)
36-
/// and is good for representing data containing same values repeated consecutively.
37-
///
38-
/// [`RunArray`] contains `run_ends` array and `values` array of same length.
39-
/// The `run_ends` array stores the indexes at which the run ends. The `values` array
40-
/// stores the value of each run. Below example illustrates how a logical array is represented in
41-
/// [`RunArray`]
35+
/// This encoding is variation on [run-length encoding (RLE)] and is good for representing
36+
/// data containing the same values repeated consecutively.
4237
///
38+
/// A [`RunArray`] consists of a `run_ends` buffer and a `values` array of equivalent
39+
/// lengths. The `run_ends` buffer stores the indexes at which the run ends. The
40+
/// `values` array stores the corresponding value of each run. The below example
41+
/// illustrates how a logical array is represented by a [`RunArray`]:
4342
///
4443
/// ```text
4544
/// ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─┐
@@ -60,6 +59,9 @@ use crate::{
6059
/// Logical array
6160
/// Contents
6261
/// ```
62+
///
63+
/// [run-end encoded values]: https://arrow.apache.org/docs/format/Columnar.html#run-end-encoded-layout
64+
/// [run-length encoding (RLE)]: https://en.wikipedia.org/wiki/Run-length_encoding
6365
pub struct RunArray<R: RunEndIndexType> {
6466
data_type: DataType,
6567
run_ends: RunEndBuffer<R::Native>,
@@ -77,8 +79,8 @@ impl<R: RunEndIndexType> Clone for RunArray<R> {
7779
}
7880

7981
impl<R: RunEndIndexType> RunArray<R> {
80-
/// Calculates the logical length of the array encoded
81-
/// by the given run_ends array.
82+
/// Calculates the logical length of the array encoded by treating the `run_ends`
83+
/// array as if it were a [`RunEndBuffer`].
8284
pub fn logical_len(run_ends: &PrimitiveArray<R>) -> usize {
8385
let len = run_ends.len();
8486
if len == 0 {
@@ -87,9 +89,13 @@ impl<R: RunEndIndexType> RunArray<R> {
8789
run_ends.value(len - 1).as_usize()
8890
}
8991

90-
/// Attempts to create RunArray using given run_ends (index where a run ends)
91-
/// and the values (value of the run). Returns an error if the given data is not compatible
92-
/// with RunEndEncoded specification.
92+
/// Attempts to create a [`RunArray`] using the given `run_ends` and `values`.
93+
///
94+
/// # Errors
95+
///
96+
/// - If `run_ends` and `values` have different lengths
97+
/// - If `run_ends` has any null values
98+
/// - If `run_ends` doesn't consist of strictly increasing positive integers
9399
pub fn try_new(run_ends: &PrimitiveArray<R>, values: &dyn Array) -> Result<Self, ArrowError> {
94100
let run_ends_type = run_ends.data_type().clone();
95101
let values_type = values.data_type().clone();
@@ -117,25 +123,29 @@ impl<R: RunEndIndexType> RunArray<R> {
117123
Ok(array_data.into())
118124
}
119125

120-
/// Returns a reference to [`RunEndBuffer`]
126+
/// Returns a reference to the [`RunEndBuffer`].
121127
pub fn run_ends(&self) -> &RunEndBuffer<R::Native> {
122128
&self.run_ends
123129
}
124130

125-
/// Returns a reference to values array
131+
/// Returns a reference to the values array.
126132
///
127-
/// Note: any slicing of this [`RunArray`] array is not applied to the returned array
128-
/// and must be handled separately
133+
/// Any slicing of this [`RunArray`] array is **not** applied to the returned
134+
/// values here and must be handled separately.
129135
pub fn values(&self) -> &ArrayRef {
130136
&self.values
131137
}
132138

133139
/// Returns the physical index at which the array slice starts.
140+
///
141+
/// See [`RunEndBuffer::get_start_physical_index`].
134142
pub fn get_start_physical_index(&self) -> usize {
135143
self.run_ends.get_start_physical_index()
136144
}
137145

138146
/// Returns the physical index at which the array slice ends.
147+
///
148+
/// See [`RunEndBuffer::get_end_physical_index`].
139149
pub fn get_end_physical_index(&self) -> usize {
140150
self.run_ends.get_end_physical_index()
141151
}
@@ -152,7 +162,6 @@ impl<R: RunEndIndexType> RunArray<R> {
152162
/// assert_eq!(typed.value(1), "b");
153163
/// assert!(typed.values().is_null(2));
154164
/// ```
155-
///
156165
pub fn downcast<V: 'static>(&self) -> Option<TypedRunArray<'_, R, V>> {
157166
let values = self.values.as_any().downcast_ref()?;
158167
Some(TypedRunArray {
@@ -161,22 +170,31 @@ impl<R: RunEndIndexType> RunArray<R> {
161170
})
162171
}
163172

164-
/// Returns index to the physical array for the given index to the logical array.
165-
/// This function adjusts the input logical index based on `ArrayData::offset`
166-
/// Performs a binary search on the run_ends array for the input index.
173+
/// Calls [`RunEndBuffer::get_physical_index`].
167174
///
168175
/// The result is arbitrary if `logical_index >= self.len()`
169176
pub fn get_physical_index(&self, logical_index: usize) -> usize {
170177
self.run_ends.get_physical_index(logical_index)
171178
}
172179

173-
/// Returns the physical indices of the input logical indices. Returns error if any of the logical
174-
/// index cannot be converted to physical index. The logical indices are sorted and iterated along
175-
/// with run_ends array to find matching physical index. The approach used here was chosen over
176-
/// finding physical index for each logical index using binary search using the function
177-
/// `get_physical_index`. Running benchmarks on both approaches showed that the approach used here
180+
/// Given the input `logical_indices`, return the corresponding physical index
181+
/// for each, according to the underlying [`RunEndBuffer`], taking into account
182+
/// any slicing that has occurred.
183+
///
184+
/// Returns an error if any of the provided logical indices is out of range.
185+
///
186+
/// # Implementation
187+
///
188+
/// The logical indices are sorted and iterated along with the `run_ends` buffer
189+
/// to find the matching physical index. The approach used here was chosen over
190+
/// finding the physical index for each logical index using binary search via
191+
/// the function [`RunEndBuffer::get_physical_index`].
192+
///
193+
/// Running benchmarks on both approaches showed that the approach used here
178194
/// scaled well for larger inputs.
195+
///
179196
/// See <https://github.com/apache/arrow-rs/pull/3622#issuecomment-1407753727> for more details.
197+
// TODO: this technically should be a method on RunEndBuffer
180198
#[inline]
181199
pub fn get_physical_indices<I>(&self, logical_indices: &[I]) -> Result<Vec<usize>, ArrowError>
182200
where
@@ -244,6 +262,10 @@ impl<R: RunEndIndexType> RunArray<R> {
244262
}
245263

246264
/// Returns a zero-copy slice of this array with the indicated offset and length.
265+
///
266+
/// # Panics
267+
///
268+
/// - Specified slice (`offset` + `length`) exceeds existing length
247269
pub fn slice(&self, offset: usize, length: usize) -> Self {
248270
Self {
249271
data_type: self.data_type.clone(),

0 commit comments

Comments
 (0)