-
Notifications
You must be signed in to change notification settings - Fork 1.1k
docs(parquet): add example for preserving dictionary encoding #9116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -508,6 +508,59 @@ impl ArrowReaderOptions { | |
| /// let mut reader = builder.build().unwrap(); | ||
| /// let _batch = reader.next().unwrap().unwrap(); | ||
| /// ``` | ||
| /// | ||
| /// # Example: Preserving Dictionary Encoding | ||
| /// | ||
| /// By default, Parquet string columns are read as `Utf8Array` (or `LargeUtf8Array`), | ||
| /// even if the underlying Parquet data uses dictionary encoding. You can preserve | ||
| /// the dictionary encoding by specifying a `Dictionary` type in the schema hint: | ||
| /// | ||
| /// ``` | ||
| /// use std::sync::Arc; | ||
| /// use tempfile::tempfile; | ||
| /// use arrow_array::{ArrayRef, RecordBatch, StringArray}; | ||
| /// use arrow_schema::{DataType, Field, Schema}; | ||
| /// use parquet::arrow::arrow_reader::{ArrowReaderOptions, ParquetRecordBatchReaderBuilder}; | ||
| /// use parquet::arrow::ArrowWriter; | ||
| /// | ||
| /// // Write a Parquet file with string data | ||
| /// let file = tempfile().unwrap(); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I also see that this follows the example above -- I think we could make these examples smaller (and thus easier to follow) if we wrote into an in memory let mut file = Vec::new();
...
let mut writer = ArrowWriter::try_new(&mut file, batch.schema(), None).unwrap();
writer.write(&batch).unwrap();
writer.close().unwrap();
...
// now read from the "file"Since what you have here is following the same pattern of the other examples, I think it is good. Maybe we can improve the examples as a follow on PR |
||
| /// let schema = Arc::new(Schema::new(vec![ | ||
| /// Field::new("city", DataType::Utf8, false) | ||
| /// ])); | ||
| /// let cities = StringArray::from(vec!["Berlin", "Berlin", "Paris", "Berlin", "Paris"]); | ||
| /// let batch = RecordBatch::try_new(schema.clone(), vec![Arc::new(cities)]).unwrap(); | ||
| /// | ||
| /// let mut writer = ArrowWriter::try_new(file.try_clone().unwrap(), batch.schema(), None).unwrap(); | ||
| /// writer.write(&batch).unwrap(); | ||
| /// writer.close().unwrap(); | ||
| /// | ||
| /// // Read the file back, requesting dictionary encoding preservation | ||
| /// let dict_schema = Arc::new(Schema::new(vec![ | ||
| /// Field::new("city", DataType::Dictionary( | ||
| /// Box::new(DataType::Int32), | ||
| /// Box::new(DataType::Utf8) | ||
| /// ), false) | ||
| /// ])); | ||
| /// let options = ArrowReaderOptions::new().with_schema(dict_schema); | ||
| /// let builder = ParquetRecordBatchReaderBuilder::try_new_with_options( | ||
| /// file.try_clone().unwrap(), | ||
| /// options | ||
| /// ).unwrap(); | ||
| /// | ||
| /// let mut reader = builder.build().unwrap(); | ||
| /// let batch = reader.next().unwrap().unwrap(); | ||
| /// | ||
| /// // The column is now a DictionaryArray | ||
| /// assert!(matches!( | ||
| /// batch.column(0).data_type(), | ||
| /// DataType::Dictionary(_, _) | ||
| /// )); | ||
| /// ``` | ||
| /// | ||
| /// **Note**: Dictionary encoding preservation works best when: | ||
| /// 1. The original column was dictionary encoded (the default for string columns) | ||
| /// 2. There are a small number of distinct values | ||
| pub fn with_schema(self, schema: SchemaRef) -> Self { | ||
| Self { | ||
| supplied_schema: Some(schema), | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to prefix the setup of the example with
#so it isn't rendered in the docsSo instead of
/// use tempfile::tempfile;Do
/// # use tempfile::tempfile;That still runs, but will not be shown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, given this example follows the others in this, file, I think it is fine to keep it this way and we can improve all the examples as a follow on PR