-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[CANN] Fix the ACL_ERROR_REPEAT_INITIALIZE error that occurs when coexisting… #26193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@snnn hi, Could you please review them so they can be merged as soon as possible?thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In onnxruntime/core/providers/cann/cann_utils.cc, the implementation of MatchFile has been changed from a regex-based search to a simple string search. The new implementation looks for a file that contains the given file_name as a substring and has a .om extension. Is this change intended and correct? It seems reasonable, but I want to double-check.
snnn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the snprintf usage in onnxruntime/core/providers/cann/cann_call.cc (around line 135):
The str buffer is declared as static. This means that if two threads enter the CannCall function at the same time, they will be writing to the same buffer, which can lead to garbled error messages. Could you please clarify if this function is expected to be called from multiple threads concurrently? If so, it would be better to declare str as a local variable on the stack to ensure thread safety.
Yes, this change is based on the features of CANN (which converts ONNX models into offline .om files). The goal is to ensure that the .om file has been generated and is unique, as converting to the .om file is a time-consuming process. |
Thank you for your suggestion. I noticed this as well. Actually, there is no concurrency requirement here, and I will fix it |
943e679 to
0f01db1
Compare
|
Please run clang-format and fix the "Lint / Python format (pull_request) issue |
thanks |
The format has been modified. |
|
Could you kindly review it again when you have some free time, so it can be merged as soon as possible? Thank you.@snnn |
1 similar comment
|
Could you kindly review it again when you have some free time, so it can be merged as soon as possible? Thank you.@snnn |
|
Sorry for the delay. |
After reviewing the information you provided, I think this is an excellent improvement. I will implement it in CANN. Thank you! |
Updated. Thank you for your review. |
Hey @snnn, sorry to bother you. I have a few questions to confirm about the EB API:
If so, from my point of view, this is an excellent change that will provide high maintainability for ONNXRuntime and more flexibility for EP Repo. In addition, is it possible to create a separate repo for each EP in the microsoft organization, called something like
Is this a right direction? Looking forward to your reply. Thank you. |
I prefer to host the repos in the independent hardware vendors (IHVs) Github org , so the hardware companies will own the IP asset. |
|
@microsoft-github-policy-service rerun |
Got it, thank you a lot. It would be better if can add a new seperated orgnization for |
…xisting… (microsoft#26193) ### Description This PR fixes the ACL_ERROR_REPEAT_INITIALIZE error that occurs when onnxruntime-cann coexists with torch_npu, which causes the CANN provider to fail to initialize and fall back to CPU for inference. ### Error log ``` 2025-08-30 03:30:44.189484484 [E:onnxruntime:Default, provider_bridge_ort.cc:2279 TryGetProviderInfo_CANN] ~/code/onnxruntime/onnxruntime/core/providers/cann/cann_call.cc:143 bool onnxruntime::CannCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = int; bool THRW = true] ~/code/onnxruntime/onnxruntime/core/providers/cann/cann_call.cc:137 bool onnxruntime::CannCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = int; bool THRW = true] CANN failure 100002: ACL_ERROR_REPEAT_INITIALIZE ; NPU=0 ; hostname=coder-6a95445b-e353-4b0b-970b-41797b02ca23-566969c97-xlf2s ; expr=aclInit(nullptr); *************** EP Error *************** EP Error /home/dou/code/onnxruntime/onnxruntime/python/onnxruntime_pybind_state.cc:1231 std::shared_ptr<onnxruntime::IExecutionProviderFactory> onnxruntime::python::CreateExecutionProviderFactoryInstance(const onnxruntime::SessionOptions&, const string&, const ProviderOptionsMap&) create CANN ExecutionProvider fail when using [('CANNExecutionProvider', {'device_id': 0, 'arena_extend_strategy': 'kNextPowerOfTwo', 'npu_mem_limit': 4294967296, 'enable_cann_graph': True}), 'CPUExecutionProvider'] Falling back to ['CPUExecutionProvider'] and retrying. **************************************** ================= before run [array([[ -8.8664875, -7.903085 , -6.529765 , -6.0811057, -4.148506 , -6.4790154, -5.8431354, -8.300879 , -6.77909 , -8.013498 , -8.759175 , -7.237462 , -6.890937 , -8.367442 , -7.565303 , ```
Description
This PR fixes the ACL_ERROR_REPEAT_INITIALIZE error that occurs when onnxruntime-cann coexists with torch_npu, which causes the CANN provider to fail to initialize and fall back to CPU for inference.
Error log