Skip to content

[Draft]Datafusion and Parquet plugin config interaction#47

Open
alchemist51 wants to merge 33 commits intobharath-techie:feature/datafusionfrom
alchemist51:feature/datafusion-config
Open

[Draft]Datafusion and Parquet plugin config interaction#47
alchemist51 wants to merge 33 commits intobharath-techie:feature/datafusionfrom
alchemist51:feature/datafusion-config

Conversation

@alchemist51
Copy link
Collaborator

@alchemist51 alchemist51 commented Oct 19, 2025

The PR brings in the pub-sub model for Dataformat and Engine plugins for settings interactions.

We have followed the following hierarchy model:
Override Settings in Dataformat >> Defaults in Dataformat >> Settings defined in Engine plugin.

In case of no overrides in Dataformat, Engine will decide the settings to be used. However in case of conflicts, dataformat will get the priority.

The below diagram depicts the engine initialisation:

┌─────────────────┐
│ DataFusion      │
│ Engine          │
└────────┬────────┘
         │ 1. Get base config
         │
         ▼
┌─────────────────────────┐
│ DataFusionEngineConfig  │
│ batchSize = 4096        │
└────────┬────────────────┘
         │ 2. Request override
         │
         ▼
┌─────────────────────────┐
│ ParquetDataSourceCodec  │
│ updateEngineConfig()    │
└────────┬────────────────┘
         │ 3. Override values
         │
         ▼
┌─────────────────────────┐
│ EngineConfig            │
│ batchSize = 8192        │
│ (Parquet overrides)     │
└─────────────────────────┘

The below diagram depicts on how the setting flow work in this model

┌─────────────────┐
│ Parquet Plugin  │ 
│ (Publisher)     │
└────────┬────────┘
         │ 1. Settings change
         │
         ▼
┌─────────────────────────┐
│ ParquetSessionConfig    │
│ setBatchSize(8192)      │
└────────┬────────────────┘
         │ 2. Publish update
         │
         ▼
┌─────────────────────────┐
│ SessionConfigRegistry   │
│ publishSessionConfig()  │
└────────┬────────────────┘
         │ 3. Notify listeners
         │
         ▼
┌─────────────────────────┐
│ DataFusionPlugin        │
│ (Subscriber)            │
│ onSessionConfigUpdate() │
└────────┬────────────────┘
         │ 4. Merge config
         │
         ▼
┌─────────────────────────┐
│ DataFusionEngineConfig  │
│ updateSessionConfig()   │
└─────────────────────────┘

There are few problems which I could see with this approach:

  1. In case of 1000+ listeners from a dataformat, the subscribers might see delay in the updates
  2. In case of index settings being introduced, we will need a subscriber method to listen to index setting updates. Probably clusterStateUpdate event? Though I'm not sure on how it will create an impact since it will get called in everyUpdate and will need to be propagated to all of the subscriber.

mch2 and others added 30 commits September 26, 2025 12:12
Signed-off-by: Marc Handalian <marc.handalian@gmail.com>
(cherry picked from commit cb75910)
Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com>
(cherry picked from commit eb01905)
…project#19399)

Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com>
(cherry picked from commit c9d5b17)
Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com>
(cherry picked from commit 98de93e)
Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com>
(cherry picked from commit 5fef617)
Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com>
(cherry picked from commit e4ebf59)
Signed-off-by: bharath-techie <bharath78910@gmail.com>
Signed-off-by: bharath-techie <bharath78910@gmail.com>
Signed-off-by: bharath-techie <bharath78910@gmail.com>
Signed-off-by: bharath-techie <bharath78910@gmail.com>
Signed-off-by: bharath-techie <bharath78910@gmail.com>
Co-authored-by: Arpit Bandejiya <abandeji@amazon.com>
Signed-off-by: bharath-techie <bharath78910@gmail.com>
Co-authored-by: Arpit Bandejiya <abandeji@amazon.com>
Signed-off-by: Arpit Bandejiya <abandeji@amazon.com>
Signed-off-by: Arpit Bandejiya <abandeji@amazon.com>
* Abstracting lucene away: part 1

* initial abstractions to reduce indexing engine coupling

* Text backed engine testing

---------

Co-authored-by: Mohit Godwani <mgodwan@amazon.com>
Signed-off-by: bharath-techie <bharath78910@gmail.com>
Signed-off-by: Arpit Bandejiya <abandeji@amazon.com>
Signed-off-by: Arpit Bandejiya <abandeji@amazon.com>
* Changes in dataformat for CSVEngine

Signed-off-by: Arpit Bandejiya <abandeji@amazon.com>

* Changes for Reader to work

Signed-off-by: Arpit Bandejiya <abandeji@amazon.com>

---------

Signed-off-by: Arpit Bandejiya <abandeji@amazon.com>
Co-authored-by: Bharathwaj G <bharath78910@gmail.com>
Signed-off-by: bharath-techie <bharath78910@gmail.com>
Signed-off-by: bharath-techie <bharath78910@gmail.com>
* Integrate aggregators to convert result from datafusion

Signed-off-by: expani <anijainc@amazon.com>

* Initialised bigArrays and queryCollManagers for DatafusionContext

Signed-off-by: expani <anijainc@amazon.com>

* Refactored to set agg result within utility

Signed-off-by: expani <anijainc@amazon.com>

---------

Signed-off-by: expani <anijainc@amazon.com>
Co-authored-by: Arpit Bandejiya <abandeji@amazon.com>

Signed-off-by: bharath-techie <bharath78910@gmail.com>
Signed-off-by: bharath-techie <bharath78910@gmail.com>
Signed-off-by: bharath-techie <bharath78910@gmail.com>
Signed-off-by: bharath-techie <bharath78910@gmail.com>
* Feature/datafusion 4 (#46)

* Composite document writer pool initial implementation

* Committer interface and lucene based commit engine implementation

* Catalog snapshot changes to create segment view during commit

---------

Co-authored-by: Shashank Gowri <shnkgo@amazon.com>

* fix build for commit integration

Signed-off-by: bharath-techie <bharath78910@gmail.com>

---------

Signed-off-by: bharath-techie <bharath78910@gmail.com>
Co-authored-by: Shashank Gowri <shnkgo@amazon.com>
Signed-off-by: Arpit Bandejiya <abandeji@amazon.com>
@alchemist51 alchemist51 changed the title [Draft]Datafusion and Parquet plugin setting interaction [Draft]Datafusion and Parquet plugin config interaction Oct 19, 2025
Signed-off-by: Arpit Bandejiya <abandeji@amazon.com>
@bharath-techie bharath-techie force-pushed the feature/datafusion branch 2 times, most recently from 31f431b to 53e2fa9 Compare January 22, 2026 11:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants