Implement 1-op models for EPs #1895

qjia7 · 2025-11-26T08:35:21Z

This pull request introduces a new utility for building, executing, and caching single-operator (1-op) ONNX models, which can be leveraged by different execution providers (EPs) for efficient operator execution. The changes add a complete infrastructure for dynamic 1-op model creation, session management, and execution, along with proper resource cleanup. The most important changes are grouped below.

1-op Model Infrastructure

Added new files one_op_model_builder.h/cpp and one_op_model_executor.h/cpp implementing utilities for constructing ONNX protobuf models for a single operator, managing model configuration, encoding, and execution, including session caching and helpers for common ops like Cast. [1] [2] [3] [4]

Integration and Resource Management

Integrated the new executor into the main generator pipeline by including one_op_model_executor.h in generators.cpp and adding an explicit destructor to OrtGlobals to ensure cached sessions are cleared before ONNX environment destruction. [1] [2] [3]

With this change, phi4 with graph capture becomes 130.4 tps from 116.2 tps on NV 5080.

Copilot

Pull request overview

This PR implements efficient Cast operator support for the WebGPU backend by dynamically generating minimal ONNX models and caching inference sessions. The implementation enables type conversion operations to be performed on WebGPU devices without requiring external ONNX library dependencies.

Adds manual protobuf-based ONNX model generation for Cast operations
Implements thread-safe session caching to avoid redundant model creation
Provides WebGPU-specific Cast method using ONNX Runtime's IOBinding

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
src/webgpu/cast_model_builder.h	Declares the function to create ONNX Cast model bytes from input/output types
src/webgpu/cast_model_builder.cpp	Implements manual protobuf encoding to generate minimal ONNX Cast operator models without ONNX library dependency
src/webgpu/interface.cpp	Adds CastSessionCache for thread-safe session reuse and implements Cast method with element size helper for tensor creation

src/webgpu/interface.cpp

qjia7 · 2025-11-27T02:37:59Z

@kunal-vaishnavi @fs-eire @guschmue In the latest commit, I move the cached cast_sessions_ from InterfaceImpl level to OrtGlobals due to that I found the release of InterfaceImpl happens after env_ which results webgpu context been cleared, but later cast_session release will trigger context is not found error.

src/webgpu/cast_model_builder.cpp

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.

src/models/one_op_model_builder.h

src/models/one_op_model_executor.cpp

src/models/one_op_model_builder.h

Copilot · 2026-01-07T05:06:40Z

src/models/one_op_model_executor.cpp

+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "one_op_model_executor.h"
+#include "one_op_model_builder.h"
+#include "../generators.h"
+#include <functional>
+#include <mutex>
+#include <unordered_map>
+
+namespace Generators {
+
+// Global cache for 1-op model sessions
+// Stored in OrtGlobals to ensure proper cleanup before OrtEnv destruction
+struct OneOpSessionCache {
+  std::unordered_map<uint64_t, std::unique_ptr<OrtSession>> sessions_;
+  std::mutex mutex_;
+};
+
+// Get the global session cache (stored in OrtGlobals)
+static OneOpSessionCache& GetOneOpSessionCache() {
+  static OneOpSessionCache cache;
+  return cache;
+}
+
+// Generate a cache key from the model configuration and EP name
+uint64_t OneOpModelExecutor::GenerateCacheKey(const OneOpModelConfig& config, const std::string& ep_name) {
+  // Simple hash combining op_type, input/output types, and EP name
+  // For more complex operators with attributes, we'd need a more sophisticated hash
+  std::hash<std::string> hasher;
+  uint64_t key = hasher(config.op_type);
+
+  // Hash EP name
+  key ^= hasher(ep_name) + 0x9e3779b9 + (key << 6) + (key >> 2);
+
+  // Hash input types
+  for (const auto& input : config.inputs) {
+    key ^= static_cast<uint64_t>(input.elem_type) + 0x9e3779b9 + (key << 6) + (key >> 2);
+  }
+
+  // Hash output types
+  for (const auto& output : config.outputs) {
+    key ^= static_cast<uint64_t>(output.elem_type) + 0x9e3779b9 + (key << 6) + (key >> 2);
+  }
+
+  // Hash attributes
+  for (const auto& attr : config.attributes) {
+    key ^= hasher(attr.name) + 0x9e3779b9 + (key << 6) + (key >> 2);
+
+    switch (attr.type) {
+      case AttributeType::INT:
+        key ^= static_cast<uint64_t>(attr.int_value) + 0x9e3779b9 + (key << 6) + (key >> 2);
+        break;
+      case AttributeType::FLOAT: {
+        uint32_t float_bits;
+        std::memcpy(&float_bits, &attr.float_value, sizeof(float));
+        key ^= static_cast<uint64_t>(float_bits) + 0x9e3779b9 + (key << 6) + (key >> 2);
+        break;
+      }
+      case AttributeType::STRING:
+        key ^= hasher(attr.string_value) + 0x9e3779b9 + (key << 6) + (key >> 2);
+        break;
+      case AttributeType::INTS:
+        for (auto val : attr.ints_value) {
+          key ^= static_cast<uint64_t>(val) + 0x9e3779b9 + (key << 6) + (key >> 2);
+        }
+        break;
+      case AttributeType::FLOATS:
+        for (auto val : attr.floats_value) {
+          uint32_t float_bits;
+          std::memcpy(&float_bits, &val, sizeof(float));
+          key ^= static_cast<uint64_t>(float_bits) + 0x9e3779b9 + (key << 6) + (key >> 2);
+        }
+        break;
+      case AttributeType::STRINGS:
+        for (const auto& val : attr.strings_value) {
+          key ^= hasher(val) + 0x9e3779b9 + (key << 6) + (key >> 2);
+        }
+        break;
+    }
+  }
+
+  return key;
+}
+
+// Create a new session for the given model and EP
+std::unique_ptr<OrtSession> OneOpModelExecutor::CreateSession(
+    const std::vector<uint8_t>& model_bytes,
+    const std::string& ep_name,
+    const std::vector<const char*>& session_config_keys,
+    const std::vector<const char*>& session_config_values) {
+  auto session_options = OrtSessionOptions::Create();
+  session_options->SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
+
+  // Apply session configuration entries
+  for (size_t i = 0; i < session_config_keys.size(); i++) {
+    session_options->AddConfigEntry(session_config_keys[i], session_config_values[i]);
+  }
+
+  // Append execution provider
+  if (!ep_name.empty()) {
+    session_options->AppendExecutionProvider(ep_name.c_str(), nullptr, nullptr, 0);
+  }
+
+  return OrtSession::Create(GetOrtEnv(), model_bytes.data(), model_bytes.size(), session_options.get());
+}
+
+// Get or create a cached session
+OrtSession* OneOpModelExecutor::GetOrCreateSession(
+    const OneOpModelConfig& config,
+    const std::string& ep_name,
+    const std::vector<const char*>& session_config_keys,
+    const std::vector<const char*>& session_config_values) {
+  auto& cache = GetOneOpSessionCache();
+  uint64_t key = GenerateCacheKey(config, ep_name);
+
+  std::lock_guard<std::mutex> lock(cache.mutex_);
+
+  auto it = cache.sessions_.find(key);
+  if (it != cache.sessions_.end()) {
+    return it->second.get();
+  }
+
+  // Create new session
+  auto model_bytes = OneOpModelBuilder::Build(config);
+  auto session = CreateSession(model_bytes, ep_name, session_config_keys, session_config_values);
+
+  OrtSession* session_ptr = session.get();
+  cache.sessions_[key] = std::move(session);
+
+  return session_ptr;
+}
+
+// Execute a 1-op model
+bool OneOpModelExecutor::Execute(
+    const OneOpModelConfig& model_config,
+    const OneOpExecutionParams& exec_params) {
+  try {
+    // Get or create session
+    OrtSession* session = GetOrCreateSession(
+        model_config,
+        exec_params.execution_provider_name,
+        exec_params.session_config_keys,
+        exec_params.session_config_values);
+
+    // Create IOBinding for efficient execution
+    auto io_binding = OrtIoBinding::Create(*session);
+
+    // Bind inputs
+    for (size_t i = 0; i < exec_params.inputs.size(); i++) {
+      const auto& input_spec = exec_params.inputs[i];
+      const auto& input_config = model_config.inputs[i];
+
+      auto input_tensor = OrtValue::CreateTensor(
+          *exec_params.memory_info,
+          input_spec.data,
+          input_spec.size_in_bytes,
+          input_spec.shape,
+          input_spec.elem_type);
+
+      io_binding->BindInput(input_config.name.c_str(), *input_tensor);
+    }
+
+    // Bind outputs
+    for (size_t i = 0; i < exec_params.outputs.size(); i++) {
+      const auto& output_spec = exec_params.outputs[i];
+      const auto& output_config = model_config.outputs[i];
+
+      auto output_tensor = OrtValue::CreateTensor(
+          *exec_params.memory_info,
+          output_spec.data,
+          output_spec.size_in_bytes,
+          output_spec.shape,
+          output_spec.elem_type);
+
+      io_binding->BindOutput(output_config.name.c_str(), *output_tensor);
+    }
+
+    // Run inference
+    session->Run(nullptr, *io_binding);
+
+    return true;
+  } catch (const std::exception& e) {
+    // Log error or handle as needed
+    std::cerr << "OneOpModelExecutor::Execute failed: " << e.what() << std::endl;
+    return false;
+  }
+}
+
+// Clear all cached sessions
+void OneOpModelExecutor::ClearCache() {
+  auto& cache = GetOneOpSessionCache();
+  std::lock_guard<std::mutex> lock(cache.mutex_);
+  cache.sessions_.clear();
+}
+
+// Helper function for Cast operation
+bool ExecuteCastOp(
+    void* input_data,
+    void* output_data,
+    ONNXTensorElementDataType input_type,
+    ONNXTensorElementDataType output_type,
+    size_t element_count,
+    const std::string& execution_provider_name,
+    const OrtMemoryInfo* memory_info,
+    const std::vector<const char*>& session_config_keys,
+    const std::vector<const char*>& session_config_values) {
+  // Build Cast model configuration with dynamic shape (-1) to support any element count
+  OneOpModelConfig config("Cast");
+  config.inputs.push_back(TensorConfig("input", input_type, {-1}));
+  config.outputs.push_back(TensorConfig("output", output_type, {-1}));
+  config.attributes.push_back(AttributeValue::Int("to", static_cast<int64_t>(output_type)));
+
+  // Build execution parameters
+  OneOpExecutionParams params(execution_provider_name, memory_info);
+  params.inputs.push_back(OneOpTensorSpec(
+      input_data,
+      input_type,
+      {static_cast<int64_t>(element_count)},
+      element_count * Ort::SizeOf(input_type)));
+  params.outputs.push_back(OneOpTensorSpec(
+      output_data,
+      output_type,
+      {static_cast<int64_t>(element_count)},
+      element_count * Ort::SizeOf(output_type)));
+
+  // Apply session config entries if provided
+  params.session_config_keys = session_config_keys;
+  params.session_config_values = session_config_values;
+
+  return OneOpModelExecutor::Execute(config, params);
+}
+
+}  // namespace Generators


The new OneOpModelExecutor and OneOpModelBuilder classes lack test coverage. Given that this is core infrastructure for WebGPU Cast operations (and potentially other operations in the future), unit tests should be added to verify correct ONNX model generation, cache behavior, and session execution across different data types and operators.

Copilot · 2026-01-07T05:06:40Z

src/webgpu/interface.cpp

+  bool Cast(void* input, void* output, ONNXTensorElementDataType input_type, ONNXTensorElementDataType output_type, size_t element_count) override {
+    if (!ort_allocator_) {
+      throw std::runtime_error("WebGPU allocator not initialized");
+    }
+
+    // Get WebGPU allocator's memory info
+    const OrtMemoryInfo* webgpu_mem_info = nullptr;
+    Ort::ThrowOnError(Ort::api->AllocatorGetInfo(ort_allocator_, &webgpu_mem_info));
+
+    // WebGPU-specific session configuration
+    static const char* webgpu_config_key = "ep.webgpuexecutionprovider.registerInt64Ops";
+    static const char* webgpu_config_value = "1";
+    std::vector<const char*> session_config_keys = {webgpu_config_key};
+    std::vector<const char*> session_config_values = {webgpu_config_value};
+
+    // Use the generalized ExecuteCastOp helper with WebGPU session config
+    return ExecuteCastOp(
+        input,
+        output,
+        input_type,
+        output_type,
+        element_count,
+        "WebGPU",
+        webgpu_mem_info,
+        session_config_keys,
+        session_config_values);
+  }


The new Cast method implementation for WebGPU lacks test coverage. Consider adding tests to verify that the Cast operation works correctly for various type conversions (e.g., float to float16, int32 to int64) on the WebGPU execution provider.

src/models/one_op_model_builder.h

kunal-vaishnavi · 2026-01-07T23:29:45Z

src/models/one_op_model_builder.h

+  // Default is 17 which is widely supported and has been validated with this infrastructure.
+  // Can be overridden if a specific opset is required, but ensure the ONNX Runtime build
+  // supports it and the operator exists in that opset version.
+  int opset_version{17};


Can we use opset 21 to match the opset used in the model builder?

qjia7 added 2 commits November 26, 2025 16:31

Implement Cast for webgpu

8aab979

delete unnecessary sync

2900ee4

qjia7 requested a review from Copilot November 26, 2025 09:20

Copilot started reviewing on behalf of qjia7 November 26, 2025 09:20 View session

Copilot finished reviewing on behalf of qjia7 November 26, 2025 09:23

Copilot AI reviewed Nov 26, 2025

View reviewed changes

src/webgpu/interface.cpp Outdated Show resolved Hide resolved

src/webgpu/interface.cpp Outdated Show resolved Hide resolved

src/webgpu/interface.cpp Outdated Show resolved Hide resolved

qjia7 added 2 commits November 26, 2025 18:19

fix the issue that Cast session destroyed after OrtGlobals

c0cb211

address comments

3868a5d

qjia7 marked this pull request as ready for review November 27, 2025 02:22

qjia7 requested review from fs-eire, guschmue and kunal-vaishnavi November 27, 2025 02:23

kunal-vaishnavi reviewed Dec 3, 2025

View reviewed changes

src/webgpu/cast_model_builder.cpp Outdated Show resolved Hide resolved

qjia7 marked this pull request as draft December 4, 2025 02:05

qjia7 added 2 commits January 7, 2026 10:46

refactor to generalize 1-op onnx model

c55ca05

remove unused codes

3accf1a

qjia7 marked this pull request as ready for review January 7, 2026 05:00

qjia7 requested review from Copilot and kunal-vaishnavi January 7, 2026 05:01

Copilot started reviewing on behalf of qjia7 January 7, 2026 05:02 View session

Copilot AI reviewed Jan 7, 2026

View reviewed changes

address comments from Copilot

65bd018

kunal-vaishnavi changed the title ~~Implement Cast for webgpu~~ Implement 1-op models for EPs Jan 7, 2026

kunal-vaishnavi reviewed Jan 7, 2026

View reviewed changes

kunal-vaishnavi requested a review from baijumeswani January 7, 2026 23:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement 1-op models for EPs #1895

Implement 1-op models for EPs #1895

qjia7 commented Nov 26, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qjia7 commented Nov 27, 2025

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Uh oh!

kunal-vaishnavi Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Implement 1-op models for EPs #1895

Are you sure you want to change the base?

Implement 1-op models for EPs #1895

Conversation

qjia7 commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qjia7 commented Nov 27, 2025

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kunal-vaishnavi Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qjia7 commented Nov 26, 2025 •

edited

Loading