Skip to content

[GLUTEN-10113][VL] Refactor hive connector register to accept session level config#10585

Closed
zhouyuan wants to merge 1 commit intoapache:mainfrom
zhouyuan:wip_refactor_connector
Closed

[GLUTEN-10113][VL] Refactor hive connector register to accept session level config#10585
zhouyuan wants to merge 1 commit intoapache:mainfrom
zhouyuan:wip_refactor_connector

Conversation

@zhouyuan
Copy link
Member

@zhouyuan zhouyuan commented Aug 29, 2025

What changes are proposed in this pull request?

This patch adds a refactor on hive connector registration. instead of registering when instancing Velox, we defer it to when creating the wholestage context. This allows us to get the session level configurations, and pass to hive/velox

How was this patch tested?

local verified and pass existing GHA

@github-actions github-actions bot added the VELOX label Aug 29, 2025
@github-actions
Copy link

#10113

@zhouyuan zhouyuan force-pushed the wip_refactor_connector branch 7 times, most recently from 0f51d4b to 62eb36e Compare September 1, 2025 12:53
hiveConfMap[facebook::velox::connector::hive::HiveConfig::kOrcUseColumnNames] = "true";

return std::make_shared<facebook::velox::config::ConfigBase>(std::move(hiveConfMap));
return std::make_shared<facebook::velox::config::ConfigBase>(std::move(hiveConfMap), false/*mutable*/);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value of parameter 'mutable' is false, no need to pass it.


// register the hive connectors
std::call_once(
gluten::VeloxBackend::get()->regFlag, [&]() { gluten::VeloxBackend::get()->initConnector(veloxCfg_); });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder if we could register with a unique connector ID for each WholeStageResultIterator and call velox::connector::unregisterConnector from the dtor?

The connector ID can use similar naming as the queryID

fmt::format(
          "Gluten_Stage_{}_TID_{}_VTID_{}",
          std::to_string(taskInfo_.stageId),
          std::to_string(taskInfo_.taskId),
          std::to_string(taskInfo_.vId))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like in Spark case, sometimes two stages will be scheduled to run at the same time.

void VeloxBackend::initConnector(const std::shared_ptr<velox::config::ConfigBase>& hiveConf) {
for (const auto& [key, value] : hiveConf->rawConfigs()) {
// always update to use new session level conf
backendConf_->set(key, value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the initConnector is only called from WholeStageResultIterator, is it possible to remove the filesystem related configurations from the backend conf?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users may pass/modify the filesystem related configurations at session level, by setting the keys here it will allow us to pick the latest values

@zhouyuan zhouyuan force-pushed the wip_refactor_connector branch 2 times, most recently from 3d4b7db to ff0de62 Compare September 2, 2025 21:15
@github-actions github-actions bot added the BUILD label Sep 2, 2025
@@ -294,15 +293,24 @@ void VeloxBackend::initCache() {
}

void VeloxBackend::initConnector(const std::shared_ptr<velox::config::ConfigBase>& hiveConf) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move the function to WholeStageResultIterator?

task_->setSpillDirectory(spillDir);

// register the hive connectors
std::call_once(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what case, we may register twice?

const std::unordered_map<std::string, std::string>& conf) {
backendConf_ =
std::make_shared<facebook::velox::config::ConfigBase>(std::unordered_map<std::string, std::string>(conf));
backendConf_ = std::make_shared<facebook::velox::config::ConfigBase>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mutable true make copy the configs when iterating the configs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we generate the HiveConfig by veloxCfg_ in WholeStageResultIterator every time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, here veloxCfg_ contains only session level configurations, it does not cover the configurations defined in spark.conf

@github-actions github-actions bot removed the BUILD label Oct 10, 2025
@zhouyuan zhouyuan force-pushed the wip_refactor_connector branch from 8faee76 to f722074 Compare October 10, 2025 02:55
@zhouyuan zhouyuan force-pushed the wip_refactor_connector branch 3 times, most recently from 25dfdda to 0d6201b Compare November 14, 2025 21:03
Signed-off-by: Yuan <yuanzhou@apache.org>

remove duplicated call

Signed-off-by: Yuan <yuanzhou@apache.org>

revert to use inmutable config map

Signed-off-by: Yuan <yuanzhou@apache.org>

fix to use dynamic config

Signed-off-by: Yuan <yuanzhou@apache.org>

fix return config copy

Signed-off-by: Yuan <yuanzhou@apache.org>

unregister hive connector before register

Signed-off-by: Yuan <yuanzhou@apache.org>

register io connector before visit filesystem

Signed-off-by: Yuan <yuanzhou@apache.org>

refactor

Signed-off-by: Yuan <yuanzhou@apache.org>

Revert "refactor"

This reverts commit 007472e.

Revert "Revert "refactor""

This reverts commit 4d2acca.

omit register if same session config detected

Signed-off-by: Yuan <yuanzhou@apache.org>
@zhouyuan zhouyuan force-pushed the wip_refactor_connector branch from 0d6201b to 701e64c Compare November 14, 2025 21:10
@github-actions
Copy link

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the stale stale label Dec 30, 2025
@github-actions
Copy link

github-actions bot commented Jan 9, 2026

This PR was auto-closed because it has been stalled for 10 days with no activity. Please feel free to reopen if it is still valid. Thanks.

@github-actions github-actions bot closed this Jan 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants