Skip to content

Commit 4c35657

Browse files
committed
fix prettier
1 parent 6b19fc8 commit 4c35657

1 file changed

Lines changed: 37 additions & 38 deletions

File tree

docs/source/library-user-guide/custom-table-providers.md

Lines changed: 37 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -39,26 +39,26 @@ to produce results:
3939

4040
1. **[TableProvider]** -- Describes the table (schema, capabilities) and
4141
produces an execution plan when queried. This is part of the **Logical Plan**.
42-
2. **[ExecutionPlan]** -- Describes *how* to compute the result: partitioning,
42+
2. **[ExecutionPlan]** -- Describes _how_ to compute the result: partitioning,
4343
ordering, and child plan relationships. This is part of the **Physical Plan**.
44-
3. **[SendableRecordBatchStream]** -- The async stream that *actually does the
45-
work*, yielding `RecordBatch`es one at a time.
44+
3. **[SendableRecordBatchStream]** -- The async stream that _actually does the
45+
work_, yielding `RecordBatch`es one at a time.
4646

4747
Think of these as a funnel: `TableProvider::scan()` is called once during
4848
planning to create an `ExecutionPlan`, then `ExecutionPlan::execute()` is called
4949
once per partition to create a stream, and those streams are where rows are
5050
actually produced during execution.
5151

52-
[TableProvider]: https://docs.rs/datafusion/latest/datafusion/catalog/trait.TableProvider.html
53-
[ExecutionPlan]: https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html
54-
[SendableRecordBatchStream]: https://docs.rs/datafusion/latest/datafusion/execution/type.SendableRecordBatchStream.html
55-
[MemTable]: https://docs.rs/datafusion/latest/datafusion/datasource/memory/struct.MemTable.html
56-
[StreamTable]: https://docs.rs/datafusion/latest/datafusion/datasource/stream/struct.StreamTable.html
57-
[ListingTable]: https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html
58-
[ViewTable]: https://docs.rs/datafusion/latest/datafusion/datasource/view/struct.ViewTable.html
59-
[PlanProperties]: https://docs.rs/datafusion/latest/datafusion/physical_plan/struct.PlanProperties.html
60-
[StreamingTableExec]: https://docs.rs/datafusion/latest/datafusion/physical_plan/streaming/struct.StreamingTableExec.html
61-
[DataSourceExec]: https://docs.rs/datafusion/latest/datafusion/datasource/source/struct.DataSourceExec.html
52+
[tableprovider]: https://docs.rs/datafusion/latest/datafusion/catalog/trait.TableProvider.html
53+
[executionplan]: https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html
54+
[sendablerecordbatchstream]: https://docs.rs/datafusion/latest/datafusion/execution/type.SendableRecordBatchStream.html
55+
[memtable]: https://docs.rs/datafusion/latest/datafusion/datasource/memory/struct.MemTable.html
56+
[streamtable]: https://docs.rs/datafusion/latest/datafusion/datasource/stream/struct.StreamTable.html
57+
[listingtable]: https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html
58+
[viewtable]: https://docs.rs/datafusion/latest/datafusion/datasource/view/struct.ViewTable.html
59+
[planproperties]: https://docs.rs/datafusion/latest/datafusion/physical_plan/struct.PlanProperties.html
60+
[streamingtableexec]: https://docs.rs/datafusion/latest/datafusion/physical_plan/streaming/struct.StreamingTableExec.html
61+
[datasourceexec]: https://docs.rs/datafusion/latest/datafusion/datasource/source/struct.DataSourceExec.html
6262

6363
## Background: Logical and Physical Planning
6464

@@ -77,7 +77,7 @@ SQL / DataFrame API
7777

7878
### Logical Planning
7979

80-
A **logical plan** describes *what* the query computes without specifying *how*.
80+
A **logical plan** describes _what_ the query computes without specifying _how_.
8181
It is a tree of relational operators -- `TableScan`, `Filter`, `Projection`,
8282
`Aggregate`, `Join`, `Sort`, `Limit`, and so on. The logical optimizer rewrites
8383
this tree to reduce work while preserving the query's meaning. Some logical
@@ -114,7 +114,7 @@ to scan" are made. The physical optimizer then refines this tree further with re
114114

115115
Your `TableProvider` sits at the boundary between logical and physical planning.
116116
During logical optimization, DataFusion determines which filters and projections
117-
*could* be pushed down to the source. When `scan()` is called during physical
117+
_could_ be pushed down to the source. When `scan()` is called during physical
118118
planning, those hints are passed to you. By implementing capabilities like
119119
`supports_filters_pushdown`, you influence what the optimizer can do -- and the
120120
metadata you declare in your `ExecutionPlan` (partitioning, ordering) directly
@@ -126,18 +126,18 @@ Not every custom data source requires implementing all three layers from
126126
scratch. DataFusion provides building blocks that let you plug in at whatever
127127
level makes sense:
128128

129-
| If your data is... | Start with | You implement |
130-
|---|---|---|
131-
| Already in `RecordBatch`es in memory | [MemTable] | Nothing -- just construct it |
132-
| An async stream of batches | [StreamTable] | A stream factory |
133-
| A logical transformation of other tables | [ViewTable] wrapping a logical plan | The logical plan |
134-
| A variant of an existing file format | [ListingTable] with a custom [FileFormat] wrapping an existing one | A thin `FileFormat` wrapper |
129+
| If your data is... | Start with | You implement |
130+
| -------------------------------------------------- | ------------------------------------------------------------------------- | ------------------------------ |
131+
| Already in `RecordBatch`es in memory | [MemTable] | Nothing -- just construct it |
132+
| An async stream of batches | [StreamTable] | A stream factory |
133+
| A logical transformation of other tables | [ViewTable] wrapping a logical plan | The logical plan |
134+
| A variant of an existing file format | [ListingTable] with a custom [FileFormat] wrapping an existing one | A thin `FileFormat` wrapper |
135135
| Files in a custom format on disk or object storage | [ListingTable] with a custom [FileFormat], [FileSource], and [FileOpener] | The format, source, and opener |
136-
| A custom source needing full control | `TableProvider` + `ExecutionPlan` + stream | All three layers |
136+
| A custom source needing full control | `TableProvider` + `ExecutionPlan` + stream | All three layers |
137137

138-
[FileFormat]: https://docs.rs/datafusion/latest/datafusion/datasource/file_format/trait.FileFormat.html
139-
[FileSource]: https://docs.rs/datafusion-datasource/latest/datafusion_datasource/file/trait.FileSource.html
140-
[FileOpener]: https://docs.rs/datafusion-datasource/latest/datafusion_datasource/file_stream/trait.FileOpener.html
138+
[fileformat]: https://docs.rs/datafusion/latest/datafusion/datasource/file_format/trait.FileFormat.html
139+
[filesource]: https://docs.rs/datafusion-datasource/latest/datafusion_datasource/file/trait.FileSource.html
140+
[fileopener]: https://docs.rs/datafusion-datasource/latest/datafusion_datasource/file_stream/trait.FileOpener.html
141141

142142
If your data is file-based, `ListingTable` handles file discovery, partition
143143
column inference, and plan construction -- you only need to implement
@@ -147,8 +147,8 @@ or [ParquetSource] and [ParquetOpener] for a full custom implementation to
147147
use as a reference.
148148

149149
[custom_file_format example]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/custom_data_source/custom_file_format.rs
150-
[ParquetSource]: https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.ParquetSource.html
151-
[ParquetOpener]: https://github.com/apache/datafusion/blob/main/datafusion/datasource-parquet/src/opener.rs
150+
[parquetsource]: https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.ParquetSource.html
151+
[parquetopener]: https://github.com/apache/datafusion/blob/main/datafusion/datasource-parquet/src/opener.rs
152152

153153
The rest of this post focuses on the full `TableProvider` + `ExecutionPlan` +
154154
stream path, which gives you complete control and applies to any data source.
@@ -209,7 +209,7 @@ variant that provides additional pushdown information for other advanced use cas
209209

210210
This is a critical point: **`scan()` runs during planning, not execution.** It
211211
should return quickly. Best practice is to avoid performing I/O, network
212-
calls, or heavy computation here. The `scan` method's job is to *describe* how
212+
calls, or heavy computation here. The `scan` method's job is to _describe_ how
213213
the data will be produced, not to produce it. All the real work belongs in the
214214
stream (Layer 3).
215215

@@ -308,7 +308,7 @@ Consider how your data source naturally divides its data:
308308
you can split it into ranges.
309309

310310
**Advanced: aligning with `target_partitions`.** Once you have something
311-
working, you can tune further. Having *too many* partitions is not free: each
311+
working, you can tune further. Having _too many_ partitions is not free: each
312312
partition adds scheduling overhead, and downstream operators may need to
313313
repartition the data anyway. The session configuration exposes a
314314
**target partition count** that reflects how many partitions the optimizer
@@ -383,8 +383,7 @@ APIs, transforming data, etc.
383383
### Using RecordBatchStreamAdapter
384384

385385
The easiest way to create a `SendableRecordBatchStream` is with
386-
[RecordBatchStreamAdapter]. It bridges any `futures::Stream<Item =
387-
Result<RecordBatch>>` into the `SendableRecordBatchStream` type:
386+
[RecordBatchStreamAdapter]. It bridges any `futures::Stream<Item = Result<RecordBatch>>` into the `SendableRecordBatchStream` type:
388387

389388
```rust,ignore
390389
use datafusion::physical_plan::stream::RecordBatchStreamAdapter;
@@ -414,7 +413,7 @@ fn execute(
414413
}
415414
```
416415

417-
[RecordBatchStreamAdapter]: https://docs.rs/datafusion/latest/datafusion/physical_plan/stream/struct.RecordBatchStreamAdapter.html
416+
[recordbatchstreamadapter]: https://docs.rs/datafusion/latest/datafusion/physical_plan/stream/struct.RecordBatchStreamAdapter.html
418417

419418
### Blocking Work: Use a Separate Thread Pool
420419

@@ -461,11 +460,11 @@ in the DataFusion repository.
461460

462461
This table summarizes what belongs at each layer:
463462

464-
| Layer | Runs During | Should Do | Should NOT Do |
465-
|---|---|---|---|
466-
| `TableProvider::scan()` | Planning | Build an `ExecutionPlan` with metadata | I/O, network calls, heavy computation |
467-
| `ExecutionPlan::execute()` | Execution (once per partition) | Construct a stream, set up channels | Block on async work, read data |
468-
| `RecordBatchStream` (polling) | Execution | All I/O, computation, data production | -- |
463+
| Layer | Runs During | Should Do | Should NOT Do |
464+
| ----------------------------- | ------------------------------ | -------------------------------------- | ------------------------------------- |
465+
| `TableProvider::scan()` | Planning | Build an `ExecutionPlan` with metadata | I/O, network calls, heavy computation |
466+
| `ExecutionPlan::execute()` | Execution (once per partition) | Construct a stream, set up channels | Block on async work, read data |
467+
| `RecordBatchStream` (polling) | Execution | All I/O, computation, data production | -- |
469468

470469
The guiding principle: **push work as late as possible.** Planning should be
471470
fast so the optimizer can do its job. Execution setup should be fast so all
@@ -498,7 +497,7 @@ need, rather than reading everything and filtering it afterward.
498497
When DataFusion plans a query with a `WHERE` clause, it passes the filter
499498
predicates to your `scan()` method as the `filters` parameter. By default,
500499
DataFusion assumes your provider cannot handle any filters and inserts a
501-
`FilterExec` node above your scan to apply them. But if your source *can*
500+
`FilterExec` node above your scan to apply them. But if your source _can_
502501
evaluate some predicates during scanning -- for example, by skipping files,
503502
partitions, or row groups that cannot match -- you can eliminate a huge amount
504503
of unnecessary I/O.

0 commit comments

Comments
 (0)