@@ -39,26 +39,26 @@ to produce results:
3939
40401 . ** [ TableProvider] ** -- Describes the table (schema, capabilities) and
4141 produces an execution plan when queried. This is part of the ** Logical Plan** .
42- 2 . ** [ ExecutionPlan] ** -- Describes * how * to compute the result: partitioning,
42+ 2 . ** [ ExecutionPlan] ** -- Describes _ how _ to compute the result: partitioning,
4343 ordering, and child plan relationships. This is part of the ** Physical Plan** .
44- 3 . ** [ SendableRecordBatchStream] ** -- The async stream that * actually does the
45- work * , yielding ` RecordBatch ` es one at a time.
44+ 3 . ** [ SendableRecordBatchStream] ** -- The async stream that _ actually does the
45+ work _ , yielding ` RecordBatch ` es one at a time.
4646
4747Think of these as a funnel: ` TableProvider::scan() ` is called once during
4848planning to create an ` ExecutionPlan ` , then ` ExecutionPlan::execute() ` is called
4949once per partition to create a stream, and those streams are where rows are
5050actually produced during execution.
5151
52- [ TableProvider ] : https://docs.rs/datafusion/latest/datafusion/catalog/trait.TableProvider.html
53- [ ExecutionPlan ] : https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html
54- [ SendableRecordBatchStream ] : https://docs.rs/datafusion/latest/datafusion/execution/type.SendableRecordBatchStream.html
55- [ MemTable ] : https://docs.rs/datafusion/latest/datafusion/datasource/memory/struct.MemTable.html
56- [ StreamTable ] : https://docs.rs/datafusion/latest/datafusion/datasource/stream/struct.StreamTable.html
57- [ ListingTable ] : https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html
58- [ ViewTable ] : https://docs.rs/datafusion/latest/datafusion/datasource/view/struct.ViewTable.html
59- [ PlanProperties ] : https://docs.rs/datafusion/latest/datafusion/physical_plan/struct.PlanProperties.html
60- [ StreamingTableExec ] : https://docs.rs/datafusion/latest/datafusion/physical_plan/streaming/struct.StreamingTableExec.html
61- [ DataSourceExec ] : https://docs.rs/datafusion/latest/datafusion/datasource/source/struct.DataSourceExec.html
52+ [ tableprovider ] : https://docs.rs/datafusion/latest/datafusion/catalog/trait.TableProvider.html
53+ [ executionplan ] : https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html
54+ [ sendablerecordbatchstream ] : https://docs.rs/datafusion/latest/datafusion/execution/type.SendableRecordBatchStream.html
55+ [ memtable ] : https://docs.rs/datafusion/latest/datafusion/datasource/memory/struct.MemTable.html
56+ [ streamtable ] : https://docs.rs/datafusion/latest/datafusion/datasource/stream/struct.StreamTable.html
57+ [ listingtable ] : https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html
58+ [ viewtable ] : https://docs.rs/datafusion/latest/datafusion/datasource/view/struct.ViewTable.html
59+ [ planproperties ] : https://docs.rs/datafusion/latest/datafusion/physical_plan/struct.PlanProperties.html
60+ [ streamingtableexec ] : https://docs.rs/datafusion/latest/datafusion/physical_plan/streaming/struct.StreamingTableExec.html
61+ [ datasourceexec ] : https://docs.rs/datafusion/latest/datafusion/datasource/source/struct.DataSourceExec.html
6262
6363## Background: Logical and Physical Planning
6464
@@ -77,7 +77,7 @@ SQL / DataFrame API
7777
7878### Logical Planning
7979
80- A ** logical plan** describes * what * the query computes without specifying * how * .
80+ A ** logical plan** describes _ what _ the query computes without specifying _ how _ .
8181It is a tree of relational operators -- ` TableScan ` , ` Filter ` , ` Projection ` ,
8282` Aggregate ` , ` Join ` , ` Sort ` , ` Limit ` , and so on. The logical optimizer rewrites
8383this tree to reduce work while preserving the query's meaning. Some logical
@@ -114,7 +114,7 @@ to scan" are made. The physical optimizer then refines this tree further with re
114114
115115Your ` TableProvider ` sits at the boundary between logical and physical planning.
116116During logical optimization, DataFusion determines which filters and projections
117- * could * be pushed down to the source. When ` scan() ` is called during physical
117+ _ could _ be pushed down to the source. When ` scan() ` is called during physical
118118planning, those hints are passed to you. By implementing capabilities like
119119` supports_filters_pushdown ` , you influence what the optimizer can do -- and the
120120metadata you declare in your ` ExecutionPlan ` (partitioning, ordering) directly
@@ -126,18 +126,18 @@ Not every custom data source requires implementing all three layers from
126126scratch. DataFusion provides building blocks that let you plug in at whatever
127127level makes sense:
128128
129- | If your data is... | Start with | You implement |
130- | ---| ---| ---|
131- | Already in ` RecordBatch ` es in memory | [ MemTable] | Nothing -- just construct it |
132- | An async stream of batches | [ StreamTable] | A stream factory |
133- | A logical transformation of other tables | [ ViewTable] wrapping a logical plan | The logical plan |
134- | A variant of an existing file format | [ ListingTable] with a custom [ FileFormat] wrapping an existing one | A thin ` FileFormat ` wrapper |
129+ | If your data is... | Start with | You implement |
130+ | -------------------------------------------------- | ------------------------------------------------------------------------- | ------------------------------ |
131+ | Already in ` RecordBatch ` es in memory | [ MemTable] | Nothing -- just construct it |
132+ | An async stream of batches | [ StreamTable] | A stream factory |
133+ | A logical transformation of other tables | [ ViewTable] wrapping a logical plan | The logical plan |
134+ | A variant of an existing file format | [ ListingTable] with a custom [ FileFormat] wrapping an existing one | A thin ` FileFormat ` wrapper |
135135| Files in a custom format on disk or object storage | [ ListingTable] with a custom [ FileFormat] , [ FileSource] , and [ FileOpener] | The format, source, and opener |
136- | A custom source needing full control | ` TableProvider ` + ` ExecutionPlan ` + stream | All three layers |
136+ | A custom source needing full control | ` TableProvider ` + ` ExecutionPlan ` + stream | All three layers |
137137
138- [ FileFormat ] : https://docs.rs/datafusion/latest/datafusion/datasource/file_format/trait.FileFormat.html
139- [ FileSource ] : https://docs.rs/datafusion-datasource/latest/datafusion_datasource/file/trait.FileSource.html
140- [ FileOpener ] : https://docs.rs/datafusion-datasource/latest/datafusion_datasource/file_stream/trait.FileOpener.html
138+ [ fileformat ] : https://docs.rs/datafusion/latest/datafusion/datasource/file_format/trait.FileFormat.html
139+ [ filesource ] : https://docs.rs/datafusion-datasource/latest/datafusion_datasource/file/trait.FileSource.html
140+ [ fileopener ] : https://docs.rs/datafusion-datasource/latest/datafusion_datasource/file_stream/trait.FileOpener.html
141141
142142If your data is file-based, ` ListingTable ` handles file discovery, partition
143143column inference, and plan construction -- you only need to implement
@@ -147,8 +147,8 @@ or [ParquetSource] and [ParquetOpener] for a full custom implementation to
147147use as a reference.
148148
149149[ custom_file_format example ] : https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/custom_data_source/custom_file_format.rs
150- [ ParquetSource ] : https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.ParquetSource.html
151- [ ParquetOpener ] : https://github.com/apache/datafusion/blob/main/datafusion/datasource-parquet/src/opener.rs
150+ [ parquetsource ] : https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.ParquetSource.html
151+ [ parquetopener ] : https://github.com/apache/datafusion/blob/main/datafusion/datasource-parquet/src/opener.rs
152152
153153The rest of this post focuses on the full ` TableProvider ` + ` ExecutionPlan ` +
154154stream path, which gives you complete control and applies to any data source.
@@ -209,7 +209,7 @@ variant that provides additional pushdown information for other advanced use cas
209209
210210This is a critical point: ** ` scan() ` runs during planning, not execution.** It
211211should return quickly. Best practice is to avoid performing I/O, network
212- calls, or heavy computation here. The ` scan ` method's job is to * describe * how
212+ calls, or heavy computation here. The ` scan ` method's job is to _ describe _ how
213213the data will be produced, not to produce it. All the real work belongs in the
214214stream (Layer 3).
215215
@@ -308,7 +308,7 @@ Consider how your data source naturally divides its data:
308308 you can split it into ranges.
309309
310310** Advanced: aligning with ` target_partitions ` .** Once you have something
311- working, you can tune further. Having * too many * partitions is not free: each
311+ working, you can tune further. Having _ too many _ partitions is not free: each
312312partition adds scheduling overhead, and downstream operators may need to
313313repartition the data anyway. The session configuration exposes a
314314** target partition count** that reflects how many partitions the optimizer
@@ -383,8 +383,7 @@ APIs, transforming data, etc.
383383### Using RecordBatchStreamAdapter
384384
385385The easiest way to create a ` SendableRecordBatchStream ` is with
386- [ RecordBatchStreamAdapter] . It bridges any `futures::Stream<Item =
387- Result<RecordBatch >>` into the ` SendableRecordBatchStream` type:
386+ [ RecordBatchStreamAdapter] . It bridges any ` futures::Stream<Item = Result<RecordBatch>> ` into the ` SendableRecordBatchStream ` type:
388387
389388``` rust,ignore
390389use datafusion::physical_plan::stream::RecordBatchStreamAdapter;
@@ -414,7 +413,7 @@ fn execute(
414413}
415414```
416415
417- [ RecordBatchStreamAdapter ] : https://docs.rs/datafusion/latest/datafusion/physical_plan/stream/struct.RecordBatchStreamAdapter.html
416+ [ recordbatchstreamadapter ] : https://docs.rs/datafusion/latest/datafusion/physical_plan/stream/struct.RecordBatchStreamAdapter.html
418417
419418### Blocking Work: Use a Separate Thread Pool
420419
@@ -461,11 +460,11 @@ in the DataFusion repository.
461460
462461This table summarizes what belongs at each layer:
463462
464- | Layer | Runs During | Should Do | Should NOT Do |
465- | ---| ---| ---| ---|
466- | ` TableProvider::scan() ` | Planning | Build an ` ExecutionPlan ` with metadata | I/O, network calls, heavy computation |
467- | ` ExecutionPlan::execute() ` | Execution (once per partition) | Construct a stream, set up channels | Block on async work, read data |
468- | ` RecordBatchStream ` (polling) | Execution | All I/O, computation, data production | -- |
463+ | Layer | Runs During | Should Do | Should NOT Do |
464+ | ----------------------------- | ------------------------------ | -------------------------------------- | ------------------------------------- |
465+ | ` TableProvider::scan() ` | Planning | Build an ` ExecutionPlan ` with metadata | I/O, network calls, heavy computation |
466+ | ` ExecutionPlan::execute() ` | Execution (once per partition) | Construct a stream, set up channels | Block on async work, read data |
467+ | ` RecordBatchStream ` (polling) | Execution | All I/O, computation, data production | -- |
469468
470469The guiding principle: ** push work as late as possible.** Planning should be
471470fast so the optimizer can do its job. Execution setup should be fast so all
@@ -498,7 +497,7 @@ need, rather than reading everything and filtering it afterward.
498497When DataFusion plans a query with a ` WHERE ` clause, it passes the filter
499498predicates to your ` scan() ` method as the ` filters ` parameter. By default,
500499DataFusion assumes your provider cannot handle any filters and inserts a
501- ` FilterExec ` node above your scan to apply them. But if your source * can *
500+ ` FilterExec ` node above your scan to apply them. But if your source _ can _
502501evaluate some predicates during scanning -- for example, by skipping files,
503502partitions, or row groups that cannot match -- you can eliminate a huge amount
504503of unnecessary I/O.
0 commit comments