Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions parquet/src/arrow/arrow_reader/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -508,6 +508,59 @@ impl ArrowReaderOptions {
/// let mut reader = builder.build().unwrap();
/// let _batch = reader.next().unwrap().unwrap();
/// ```
///
/// # Example: Preserving Dictionary Encoding
///
/// By default, Parquet string columns are read as `Utf8Array` (or `LargeUtf8Array`),
/// even if the underlying Parquet data uses dictionary encoding. You can preserve
/// the dictionary encoding by specifying a `Dictionary` type in the schema hint:
///
/// ```
/// use std::sync::Arc;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to prefix the setup of the example with # so it isn't rendered in the docs

So instead of

    /// use tempfile::tempfile;

Do

    /// # use tempfile::tempfile;

That still runs, but will not be shown

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, given this example follows the others in this, file, I think it is fine to keep it this way and we can improve all the examples as a follow on PR

/// use tempfile::tempfile;
/// use arrow_array::{ArrayRef, RecordBatch, StringArray};
/// use arrow_schema::{DataType, Field, Schema};
/// use parquet::arrow::arrow_reader::{ArrowReaderOptions, ParquetRecordBatchReaderBuilder};
/// use parquet::arrow::ArrowWriter;
///
/// // Write a Parquet file with string data
/// let file = tempfile().unwrap();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also see that this follows the example above -- I think we could make these examples smaller (and thus easier to follow) if we wrote into an in memory Vec rather than a file

let mut file = Vec::new();
...
let mut writer = ArrowWriter::try_new(&mut file, batch.schema(), None).unwrap();
writer.write(&batch).unwrap();
writer.close().unwrap();
...
// now read from the "file"

Since what you have here is following the same pattern of the other examples, I think it is good. Maybe we can improve the examples as a follow on PR

/// let schema = Arc::new(Schema::new(vec![
/// Field::new("city", DataType::Utf8, false)
/// ]));
/// let cities = StringArray::from(vec!["Berlin", "Berlin", "Paris", "Berlin", "Paris"]);
/// let batch = RecordBatch::try_new(schema.clone(), vec![Arc::new(cities)]).unwrap();
///
/// let mut writer = ArrowWriter::try_new(file.try_clone().unwrap(), batch.schema(), None).unwrap();
/// writer.write(&batch).unwrap();
/// writer.close().unwrap();
///
/// // Read the file back, requesting dictionary encoding preservation
/// let dict_schema = Arc::new(Schema::new(vec![
/// Field::new("city", DataType::Dictionary(
/// Box::new(DataType::Int32),
/// Box::new(DataType::Utf8)
/// ), false)
/// ]));
/// let options = ArrowReaderOptions::new().with_schema(dict_schema);
/// let builder = ParquetRecordBatchReaderBuilder::try_new_with_options(
/// file.try_clone().unwrap(),
/// options
/// ).unwrap();
///
/// let mut reader = builder.build().unwrap();
/// let batch = reader.next().unwrap().unwrap();
///
/// // The column is now a DictionaryArray
/// assert!(matches!(
/// batch.column(0).data_type(),
/// DataType::Dictionary(_, _)
/// ));
/// ```
///
/// **Note**: Dictionary encoding preservation works best when:
/// 1. The original column was dictionary encoded (the default for string columns)
/// 2. There are a small number of distinct values
pub fn with_schema(self, schema: SchemaRef) -> Self {
Self {
supplied_schema: Some(schema),
Expand Down
Loading