Skip to content

Conversation

@kring
Copy link
Member

@kring kring commented Dec 18, 2025

This PR includes and requires CesiumGS/cesium-native#1291 so merge that first.

Continuing the quest for AppDomain reload robustness...

First, this PR adds a system for detecting cases where we attempt to use a reference from one AppDomain in a different one. The idea is to capture which AppDomain we're in whenever we construct an ObjectHandle. Then, later, when we use it, make sure we're still in the same AppDomain. It's simple in theory, but of course there are difficulties...

The main problem is that it's hard to tell what AppDomain we're in. AppDomains have an Id property, which uniquely identifies them. Perfect! Except it only uniquely identifies them among AppDomains that are currently active. A new AppDomain can have the same ID as one that has now been unloaded. 🙄 So that doesn't work at all for our purposes.

I thought maybe we could use the AppDomain's GetHashCode, and it would at least be mostly unique for each AppDomain, even if it's not guaranteed to be so. But no, GetHashCode seems to be implemented to basically return a hash of the Id. So that doesn't work for our purposes either.

My next idea was to track the AppDomain manually. In Unity, we know there's only one active at a time, even though this isn't guaranteed in .NET in general. So we can say that a new AppDomain starts when initializeReinterop is called, and it ends when the AppDomain's DomainUnload event is raised. So an ObjectHandle created inside one [initializeReinterop, DomainUnload] interval should never be used in a different such interval. We can work with that.

But what about when a handle is created or destroyed outside of any such interval? Well, handle creation outside the interval should never happen. So we'll assert on that.

Destruction is not so simple. The problem is finalization. Mono calls DomainUnload and then it invokes finalizers for all objects in the AppDomain. In our case, all of our C# classes backed by a C++ Impl need to be finalized so that the Impls can be destroyed. In some cases, the Impls hold references to other managed objects. When the Impls are destroyed, these references will be released. Outside our happy interval. So we can't assert in this case.

This isn't great, though, because actual bugs could look just like this. We create an ObjectHandle inside one AppDomain interval, and then release it too late, after the AppDomain is gone but before the initializeReinterop is called for the next one. Can we detect this scenario somehow?

Yes, because we can ask the current AppDomain if it is currently finalizing objects for unload. So the logic is, a handle is valid (using it does not assert) if:

  1. It is being used in the same AppDomain happy interval in which it was created, or
  2. The current AppDomain is finalizing for unload.

In the latter case, we don't assert, but we also treat the reference as if it's null. Effectively, as soon as the DomainUnload event is raised, all managed references held by our native code immediately become null. I think this is a good idea because the purpose of a finalizer is solely to release native resources. Releasing a managed reference is pointless, because all managed objects are by definition dead at this stage. And using a managed reference is worse than pointless - it's dangerous! - because the object may have already been finalized. The .NET doc on finalizers says:

Don't access managed object members from a finalizer. During finalization, managed objects might already be disposed, making them unavailable or in an invalid state. Only access unmanaged resources directly from finalizers.

So with this debugging system in place, I did all kinds of terrible things to trigger AppDomain unloads at the worst times, and watched for assertions. I saw a couple, and those are fixed in this PR as well. Specifically:

  1. Added code to wait for async CesiumIonSession work to complete before allowing the AppDomain unload. Similar to what we were already doing for tilesets and raster overlays.
  2. Explicitly disposed tileset's _updateInEditorCallback so it doesn't need to be finalized.

Also fixed a case where e in generated catch (Exception e) conflicted with the name of a parameter. I renamed it to reinteropManagedException.

This PR is currently draft because this AppDomain tracking probably adds some overhead. We should at least disable it in non-Editor builds. Maybe it should be optional even in Debug/Editor builds?

Also draft because, after all this, I still (rarely) get this error:

Release of invalid GC handle. The handle is from a previous domain. The release operation is skipped.

I can't really think of how this can possibly happen anymore. With the debugging system in place, any error like this should trigger an assertion failure, not a stackless error message like this. Perhaps it's overly optimistic, but I kind of want to say this isn't us! That some other part of Unity is attempting to use a reference across AppDomains.

But maybe there's some subtle way this can still happen, and I just haven't thought of it yet...

@kring
Copy link
Member Author

kring commented Dec 19, 2025

This should be ready now. The AppDomain checking is disabled in non-Editor builds. Editor builds (even release ones) do the AppDomain checking, which makes ObjectHandles 50% bigger than they used to be and requires an extra comparison when dereferencing, but I think we can live with that. I still can't explain why we still sometimes get "invalid GC Handle" errors once in awhile when entering Play mode. As mentioned above, it's possible (but maybe not likely?) that this isn't even a Cesium bug.

@kring kring marked this pull request as ready for review December 19, 2025 02:18
Copy link
Contributor

@j9liu j9liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kring ! I have a few comments but most are questions with potentially no action. Although I still see some invalid GC handle errors (as you acknowledged), they are significantly fewer than before, which is a great improvement!

this->_handle = nullptr;
}
this->_appDomainIndex = _runningAppDomainIndex;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the copies use runningAppDomainIndex while the moves use rhs._appDomainIndex -- is this in case the copy happens during an AppDomain reload?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't actually matter... but I think of it like this:

  1. When moving, we're just moving some existing handle to a new place in memory. There's no particular reason to spend time evaluating the validity of that handle when we do that. We'll do that later when we dereference it or free it or whatever.
  2. When copying, we must have a valid handle. So we want to do all our appdomain validation logic. But if rhs._appDomainIndex != runningAppDomainIndex, then rhs.GetRaw() will return nullptr anyway and then the app domain index doesn't matter. So we could have initialized _appDomainIndex(rhs._appDomainIndex) in the copy constructor, too and it wouldn't change any behavior. I think it would be slightly less clear, though.

pThis->_isLoadingProfile = false;
pThis->_profile = std::nullopt;
pThis->broadcastProfileUpdate();
pThis->_profile.emplace();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering what the effects of this were, and so I traced a potential issue to the function refreshProfileIfNeeded. But then I discovered that we never call that function in the codebase anyways! Maybe we could delete all those refresh___IfNeeded functions to confirm?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI, I made this change to avoid a refresh loop:

  1. The profile is loading.
  2. Unity wants to reload the AppDomain, so we cancel all requests, including the profile request.
  3. The canceled request causes this catchInMainThread continuation to be invoked.
  4. The continuation calls BroadcastProfileUpdate, which accesses the profile, which causes it to be requested again.
  5. We block forever waiting for the session to not be busy.

But yeah these refresh-if-needed functions look pretty useless. I'll remove them.

@j9liu
Copy link
Contributor

j9liu commented Dec 22, 2025

Just realized I was testing the build on Release and not Debug, so I tried it out on the latter. The first time I did an AppDomain reload I got one big batch of errors (45), but afterwards I couldn't see them. Maybe because the ion requests were cached?

But I did run into this when I exited mid-play mode in the first sample level, while tiles were still loading (arguably evil behavior 😅):

image

@kring
Copy link
Member Author

kring commented Dec 22, 2025

The first time I did an AppDomain reload I got one big batch of errors

What kind of errors? If it was various flavors of "request canceled because the AppDomain is unloading", that's expected. Maybe we could hide those from the user, but that will take some doing. If you saw other errors, like GCHandle errors for example, that's more concerning.

But I did run into this when I exited mid-play mode in the first sample level, while tiles were still loading (arguably evil behavior 😅)

No that's a valid thing to do. Does it happen consistently for you? If you see it again, can you attach the debugger (VS or VS Code will work fine) to the unity.exe process, click Retry, and tell me what your call stack looks like?

@j9liu
Copy link
Contributor

j9liu commented Dec 22, 2025

What kind of errors? If it was various flavors of "request canceled because the AppDomain is unloading", that's expected. Maybe we could hide those from the user, but that will take some doing. If you saw other errors, like GCHandle errors for example, that's more concerning.

It was definitely a flux of GCHandle errors. I also saw the "request canceled because the AppDomain is unloading" logs but I figured that was fine and intentional!

No that's a valid thing to do. Does it happen consistently for you? If you see it again, can you attach the debugger (VS or VS Code will work fine) to the unity.exe process, click Retry, and tell me what your call stack looks like?

I just tried to reproduce it a couple times but I couldn't 🤔 I don't remember if I did anything specific to trigger it, but maybe it was just a fluke.

I also tried reproducing the 45+ GCHandle errors, but now I'm only getting ~3. So I'll let you know if I ever run into it again, I suppose?

@kring
Copy link
Member Author

kring commented Dec 23, 2025

It was definitely a flux of GCHandle errors.

🤦 And no call stack associated with any of them, I assume?

@kring
Copy link
Member Author

kring commented Dec 23, 2025

I also tried reproducing the 45+ GCHandle errors, but now I'm only getting ~3.

3 is a lot! The most I've ever seen in this branch is one, and only rarely. It happens when entering play mode.

kring and others added 3 commits December 23, 2025 22:56
@kring
Copy link
Member Author

kring commented Dec 23, 2025

I was able to reproduce the assertion when exiting Unity while in Play mode. The problem was that, while finalizing, it was attempting to dispose a UnityWebRequest that had already been disposed. This caused a managed exception to be thrown, and then we asserted when creating an ObjectHandle for the managed exception. Easy fix: don't dispose a request that has already been disposed.

@kring
Copy link
Member Author

kring commented Dec 23, 2025

Thanks for the review @j9liu! I believe I've addressed everything, except that there may still be a lingering GCHandle issue.

Copy link
Contributor

@j9liu j9liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks @kring !

@j9liu j9liu merged commit 276f362 into main Jan 1, 2026
8 checks passed
@j9liu j9liu deleted the more-appdomain-fun branch January 1, 2026 18:34
@j9liu j9liu added this to the January 2026 Release milestone Jan 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants