-
Notifications
You must be signed in to change notification settings - Fork 109
Add AppDomain debugging support, and fix some problems it turned up #644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'm not entirely sure why this has just become a problem now.
|
This should be ready now. The AppDomain checking is disabled in non-Editor builds. Editor builds (even release ones) do the AppDomain checking, which makes ObjectHandles 50% bigger than they used to be and requires an extra comparison when dereferencing, but I think we can live with that. I still can't explain why we still sometimes get "invalid GC Handle" errors once in awhile when entering Play mode. As mentioned above, it's possible (but maybe not likely?) that this isn't even a Cesium bug. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kring ! I have a few comments but most are questions with potentially no action. Although I still see some invalid GC handle errors (as you acknowledged), they are significantly fewer than before, which is a great improvement!
| this->_handle = nullptr; | ||
| } | ||
| this->_appDomainIndex = _runningAppDomainIndex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the copies use runningAppDomainIndex while the moves use rhs._appDomainIndex -- is this in case the copy happens during an AppDomain reload?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't actually matter... but I think of it like this:
- When moving, we're just moving some existing handle to a new place in memory. There's no particular reason to spend time evaluating the validity of that handle when we do that. We'll do that later when we dereference it or free it or whatever.
- When copying, we must have a valid handle. So we want to do all our appdomain validation logic. But if
rhs._appDomainIndex != runningAppDomainIndex, thenrhs.GetRaw()will return nullptr anyway and then the app domain index doesn't matter. So we could have initialized_appDomainIndex(rhs._appDomainIndex)in the copy constructor, too and it wouldn't change any behavior. I think it would be slightly less clear, though.
| pThis->_isLoadingProfile = false; | ||
| pThis->_profile = std::nullopt; | ||
| pThis->broadcastProfileUpdate(); | ||
| pThis->_profile.emplace(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering what the effects of this were, and so I traced a potential issue to the function refreshProfileIfNeeded. But then I discovered that we never call that function in the codebase anyways! Maybe we could delete all those refresh___IfNeeded functions to confirm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just FYI, I made this change to avoid a refresh loop:
- The profile is loading.
- Unity wants to reload the AppDomain, so we cancel all requests, including the profile request.
- The canceled request causes this catchInMainThread continuation to be invoked.
- The continuation calls
BroadcastProfileUpdate, which accesses the profile, which causes it to be requested again. - We block forever waiting for the session to not be busy.
But yeah these refresh-if-needed functions look pretty useless. I'll remove them.
What kind of errors? If it was various flavors of "request canceled because the AppDomain is unloading", that's expected. Maybe we could hide those from the user, but that will take some doing. If you saw other errors, like GCHandle errors for example, that's more concerning.
No that's a valid thing to do. Does it happen consistently for you? If you see it again, can you attach the debugger (VS or VS Code will work fine) to the unity.exe process, click Retry, and tell me what your call stack looks like? |
It was definitely a flux of GCHandle errors. I also saw the "request canceled because the AppDomain is unloading" logs but I figured that was fine and intentional!
I just tried to reproduce it a couple times but I couldn't 🤔 I don't remember if I did anything specific to trigger it, but maybe it was just a fluke. I also tried reproducing the 45+ GCHandle errors, but now I'm only getting ~3. So I'll let you know if I ever run into it again, I suppose? |
🤦 And no call stack associated with any of them, I assume? |
3 is a lot! The most I've ever seen in this branch is one, and only rarely. It happens when entering play mode. |
|
I was able to reproduce the assertion when exiting Unity while in Play mode. The problem was that, while finalizing, it was attempting to dispose a UnityWebRequest that had already been disposed. This caused a managed exception to be thrown, and then we asserted when creating an ObjectHandle for the managed exception. Easy fix: don't dispose a request that has already been disposed. |
|
Thanks for the review @j9liu! I believe I've addressed everything, except that there may still be a lingering GCHandle issue. |
j9liu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks @kring !

This PR includes and requires CesiumGS/cesium-native#1291 so merge that first.
Continuing the quest for AppDomain reload robustness...
First, this PR adds a system for detecting cases where we attempt to use a reference from one AppDomain in a different one. The idea is to capture which AppDomain we're in whenever we construct an
ObjectHandle. Then, later, when we use it, make sure we're still in the same AppDomain. It's simple in theory, but of course there are difficulties...The main problem is that it's hard to tell what AppDomain we're in. AppDomains have an
Idproperty, which uniquely identifies them. Perfect! Except it only uniquely identifies them among AppDomains that are currently active. A new AppDomain can have the same ID as one that has now been unloaded. 🙄 So that doesn't work at all for our purposes.I thought maybe we could use the AppDomain's
GetHashCode, and it would at least be mostly unique for each AppDomain, even if it's not guaranteed to be so. But no,GetHashCodeseems to be implemented to basically return a hash of the Id. So that doesn't work for our purposes either.My next idea was to track the AppDomain manually. In Unity, we know there's only one active at a time, even though this isn't guaranteed in .NET in general. So we can say that a new AppDomain starts when
initializeReinteropis called, and it ends when the AppDomain'sDomainUnloadevent is raised. So an ObjectHandle created inside one[initializeReinterop, DomainUnload]interval should never be used in a different such interval. We can work with that.But what about when a handle is created or destroyed outside of any such interval? Well, handle creation outside the interval should never happen. So we'll assert on that.
Destruction is not so simple. The problem is finalization. Mono calls
DomainUnloadand then it invokes finalizers for all objects in the AppDomain. In our case, all of our C# classes backed by a C++ Impl need to be finalized so that the Impls can be destroyed. In some cases, the Impls hold references to other managed objects. When the Impls are destroyed, these references will be released. Outside our happy interval. So we can't assert in this case.This isn't great, though, because actual bugs could look just like this. We create an ObjectHandle inside one AppDomain interval, and then release it too late, after the AppDomain is gone but before the
initializeReinteropis called for the next one. Can we detect this scenario somehow?Yes, because we can ask the current AppDomain if it is currently finalizing objects for unload. So the logic is, a handle is valid (using it does not assert) if:
In the latter case, we don't assert, but we also treat the reference as if it's null. Effectively, as soon as the DomainUnload event is raised, all managed references held by our native code immediately become null. I think this is a good idea because the purpose of a finalizer is solely to release native resources. Releasing a managed reference is pointless, because all managed objects are by definition dead at this stage. And using a managed reference is worse than pointless - it's dangerous! - because the object may have already been finalized. The .NET doc on finalizers says:
So with this debugging system in place, I did all kinds of terrible things to trigger AppDomain unloads at the worst times, and watched for assertions. I saw a couple, and those are fixed in this PR as well. Specifically:
_updateInEditorCallbackso it doesn't need to be finalized.Also fixed a case where
ein generatedcatch (Exception e)conflicted with the name of a parameter. I renamed it toreinteropManagedException.This PR is currently draft because this AppDomain tracking probably adds some overhead. We should at least disable it in non-Editor builds. Maybe it should be optional even in Debug/Editor builds?
Also draft because, after all this, I still (rarely) get this error:
I can't really think of how this can possibly happen anymore. With the debugging system in place, any error like this should trigger an assertion failure, not a stackless error message like this. Perhaps it's overly optimistic, but I kind of want to say this isn't us! That some other part of Unity is attempting to use a reference across AppDomains.
But maybe there's some subtle way this can still happen, and I just haven't thought of it yet...