Skip to content

Conversation

@bachelor-dou
Copy link
Contributor

Description

This PR fixes the ACL_ERROR_REPEAT_INITIALIZE error that occurs when onnxruntime-cann coexists with torch_npu, which causes the CANN provider to fail to initialize and fall back to CPU for inference.

Error log

2025-08-30 03:30:44.189484484 [E:onnxruntime:Default, provider_bridge_ort.cc:2279 TryGetProviderInfo_CANN] ~/code/onnxruntime/onnxruntime/core/providers/cann/cann_call.cc:143 bool onnxruntime::CannCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = int; bool THRW = true]
~/code/onnxruntime/onnxruntime/core/providers/cann/cann_call.cc:137 bool onnxruntime::CannCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = int; bool THRW = true] CANN failure 100002: ACL_ERROR_REPEAT_INITIALIZE ; NPU=0 ; hostname=coder-6a95445b-e353-4b0b-970b-41797b02ca23-566969c97-xlf2s ; expr=aclInit(nullptr); 


*************** EP Error ***************
EP Error /home/dou/code/onnxruntime/onnxruntime/python/onnxruntime_pybind_state.cc:1231 std::shared_ptr<onnxruntime::IExecutionProviderFactory> onnxruntime::python::CreateExecutionProviderFactoryInstance(const onnxruntime::SessionOptions&, const string&, const ProviderOptionsMap&) create CANN ExecutionProvider fail
 when using [('CANNExecutionProvider', {'device_id': 0, 'arena_extend_strategy': 'kNextPowerOfTwo', 'npu_mem_limit': 4294967296, 'enable_cann_graph': True}), 'CPUExecutionProvider']
Falling back to ['CPUExecutionProvider'] and retrying.
****************************************
================= before run
[array([[ -8.8664875,  -7.903085 ,  -6.529765 ,  -6.0811057,  -4.148506 ,
         -6.4790154,  -5.8431354,  -8.300879 ,  -6.77909  ,  -8.013498 ,
         -8.759175 ,  -7.237462 ,  -6.890937 ,  -8.367442 ,  -7.565303 ,

@bachelor-dou bachelor-dou changed the title Fix the ACL_ERROR_REPEAT_INITIALIZE error that occurs when coexisting… [CANN] Fix the ACL_ERROR_REPEAT_INITIALIZE error that occurs when coexisting… Sep 29, 2025
@bachelor-dou
Copy link
Contributor Author

@snnn hi, Could you please review them so they can be merged as soon as possible?thanks

Copy link
Contributor

@snnn snnn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In onnxruntime/core/providers/cann/cann_utils.cc, the implementation of MatchFile has been changed from a regex-based search to a simple string search. The new implementation looks for a file that contains the given file_name as a substring and has a .om extension. Is this change intended and correct? It seems reasonable, but I want to double-check.

snnn
snnn previously requested changes Sep 29, 2025
Copy link
Contributor

@snnn snnn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the snprintf usage in onnxruntime/core/providers/cann/cann_call.cc (around line 135):
The str buffer is declared as static. This means that if two threads enter the CannCall function at the same time, they will be writing to the same buffer, which can lead to garbled error messages. Could you please clarify if this function is expected to be called from multiple threads concurrently? If so, it would be better to declare str as a local variable on the stack to ensure thread safety.

@bachelor-dou
Copy link
Contributor Author

In onnxruntime/core/providers/cann/cann_utils.cc, the implementation of MatchFile has been changed from a regex-based search to a simple string search. The new implementation looks for a file that contains the given file_name as a substring and has a .om extension. Is this change intended and correct? It seems reasonable, but I want to double-check.

Yes, this change is based on the features of CANN (which converts ONNX models into offline .om files). The goal is to ensure that the .om file has been generated and is unique, as converting to the .om file is a time-consuming process.

@bachelor-dou
Copy link
Contributor Author

Regarding the snprintf usage in onnxruntime/core/providers/cann/cann_call.cc (around line 135): The str buffer is declared as static. This means that if two threads enter the CannCall function at the same time, they will be writing to the same buffer, which can lead to garbled error messages. Could you please clarify if this function is expected to be called from multiple threads concurrently? If so, it would be better to declare str as a local variable on the stack to ensure thread safety.

Thank you for your suggestion. I noticed this as well. Actually, there is no concurrency requirement here, and I will fix it

@snnn
Copy link
Contributor

snnn commented Oct 1, 2025

Please run clang-format and fix the "Lint / Python format (pull_request) issue

@bachelor-dou
Copy link
Contributor Author

Please run clang-format and fix the "Lint / Python format (pull_request) issue

thanks

@bachelor-dou
Copy link
Contributor Author

Please run clang-format and fix the "Lint / Python format (pull_request) issue

The format has been modified.

@bachelor-dou
Copy link
Contributor Author

Could you kindly review it again when you have some free time, so it can be merged as soon as possible? Thank you.@snnn

1 similar comment
@bachelor-dou
Copy link
Contributor Author

Could you kindly review it again when you have some free time, so it can be merged as soon as possible? Thank you.@snnn

@snnn
Copy link
Contributor

snnn commented Oct 21, 2025

Sorry for the delay.
In the last release we added EP ABI support, so that hardware vendors can build EPs independent of ONNX Runtime version. We recommend every EP switching to that. See microsoft/onnxruntime-inference-examples#527 for an example.

@snnn snnn dismissed their stale review October 21, 2025 03:24

reset

@bachelor-dou
Copy link
Contributor Author

Sorry for the delay. In the last release we added EP ABI support, so that hardware vendors can build EPs independent of ONNX Runtime version. We recommend every EP switching to that. See microsoft/onnxruntime-inference-examples#527 for an example.

After reviewing the information you provided, I think this is an excellent improvement. I will implement it in CANN. Thank you!

@bachelor-dou
Copy link
Contributor Author

Sorry for the delay. In the last release we added EP ABI support, so that hardware vendors can build EPs independent of ONNX Runtime version. We recommend every EP switching to that. See microsoft/onnxruntime-inference-examples#527 for an example.

Updated. Thank you for your review.

@fffrog
Copy link
Contributor

fffrog commented Oct 21, 2025

In the last release we added EP ABI support, so that hardware vendors can build EPs independent of ONNX Runtime version. We recommend every EP switching to that. See microsoft/onnxruntime-inference-examples#527 for an example.

Hey @snnn, sorry to bother you.

I have a few questions to confirm about the EB API:

  • Migrate all EPs from the ONNXRuntime codebase to a separate repo for each EP.
  • The only thing left in the ONNXRuntime repo is the framework or public.
  • Each EP needs to publish its own EP plug-in, which can be used in conjunction with ONNXRuntime.

If so, from my point of view, this is an excellent change that will provide high maintainability for ONNXRuntime and more flexibility for EP Repo.

In addition, is it possible to create a separate repo for each EP in the microsoft organization, called something like onnxruntime-cuda, onnxruntime-cann, and so on, which can bring several benefits to ONNX Runtime?

  • Provides an overall overview of the EPs for uses of ONNXRuntime.
  • All EPs will be under control of community.
  • New EPs can be easily integrated by referencing other EPs.

Is this a right direction? Looking forward to your reply.

Thank you.

@snnn
Copy link
Contributor

snnn commented Oct 21, 2025

In addition, is it possible to create a separate repo for each EP in the microsoft organization, called something like onnxruntime-cuda, onnxruntime-cann, and so on, which can bring several benefits to ONNX Runtime?

I prefer to host the repos in the independent hardware vendors (IHVs) Github org , so the hardware companies will own the IP asset.

@snnn
Copy link
Contributor

snnn commented Oct 22, 2025

@microsoft-github-policy-service rerun

@snnn snnn closed this Oct 22, 2025
@snnn snnn reopened this Oct 22, 2025
@fffrog
Copy link
Contributor

fffrog commented Oct 23, 2025

I prefer to host the repos in the independent hardware vendors (IHVs) Github org , so the hardware companies will own the IP asset.

Got it, thank you a lot.

It would be better if can add a new seperated orgnization for onnxruntime and put all IHVs into this one orgnization

@snnn snnn merged commit f20df72 into microsoft:main Oct 25, 2025
173 of 176 checks passed
naomiOvad pushed a commit to naomiOvad/onnxruntime that referenced this pull request Nov 2, 2025
…xisting… (microsoft#26193)

### Description
This PR fixes the ACL_ERROR_REPEAT_INITIALIZE error that occurs when
onnxruntime-cann coexists with torch_npu, which causes the CANN provider
to fail to initialize and fall back to CPU for inference.

### Error log
```
2025-08-30 03:30:44.189484484 [E:onnxruntime:Default, provider_bridge_ort.cc:2279 TryGetProviderInfo_CANN] ~/code/onnxruntime/onnxruntime/core/providers/cann/cann_call.cc:143 bool onnxruntime::CannCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = int; bool THRW = true]
~/code/onnxruntime/onnxruntime/core/providers/cann/cann_call.cc:137 bool onnxruntime::CannCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = int; bool THRW = true] CANN failure 100002: ACL_ERROR_REPEAT_INITIALIZE ; NPU=0 ; hostname=coder-6a95445b-e353-4b0b-970b-41797b02ca23-566969c97-xlf2s ; expr=aclInit(nullptr); 


*************** EP Error ***************
EP Error /home/dou/code/onnxruntime/onnxruntime/python/onnxruntime_pybind_state.cc:1231 std::shared_ptr<onnxruntime::IExecutionProviderFactory> onnxruntime::python::CreateExecutionProviderFactoryInstance(const onnxruntime::SessionOptions&, const string&, const ProviderOptionsMap&) create CANN ExecutionProvider fail
 when using [('CANNExecutionProvider', {'device_id': 0, 'arena_extend_strategy': 'kNextPowerOfTwo', 'npu_mem_limit': 4294967296, 'enable_cann_graph': True}), 'CPUExecutionProvider']
Falling back to ['CPUExecutionProvider'] and retrying.
****************************************
================= before run
[array([[ -8.8664875,  -7.903085 ,  -6.529765 ,  -6.0811057,  -4.148506 ,
         -6.4790154,  -5.8431354,  -8.300879 ,  -6.77909  ,  -8.013498 ,
         -8.759175 ,  -7.237462 ,  -6.890937 ,  -8.367442 ,  -7.565303 ,
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants