Skip to content

Conversation

@caspark
Copy link
Contributor

@caspark caspark commented Dec 6, 2024

This fixes the case where the following happens:

  1. Sync layer requests roll back to frame N
  2. Sync layer requests advance to frame N+1
  3. Sync layer marks frame N as confirmed
  4. P2pSession's desync detection decides to send a checksum for frame N to peers, even though the checksum for frame N is now known to be incorrect (because ggrs-consuming-code has not had a chance to act on the advance frame request, so the checksum stored in frame N is from before the rollback).
  5. Peers receive the incorrect checksum and flip out thinking there is a desync.

Actually causing this bug to reliably manifest in practice is somewhat tricky; I was able to reliably do it in my game via the following:

  • high latencies (150ms on each client network interface, so 300ms round trip)
  • input delay == 0
  • desync detection interval set to 1
  • start all game clients at once
  • in the first advance, each client initially sends an input which is mispredicted by all other clients (so not the zeroed input).
  • and lastly that first input from each client must modify the resulting checksum in a way that is unique from each other client's input.

The concrete input I am sending in my game is approximately equivalent to "store the client handle in the game state", but when I tried modifying the example game to do similar then it didn't manifest this bug - no idea why. In any case, this patch fixes the issue so I'll leave it at that.

It also changes a < to <= in a seemingly-isolated way; this change is based on the rationale given at
https://github.com/gschup/ggrs/pull/64/files#r1368265924 no longer applying.


cc @johanhelsing since at one point you were also somewhat familiar with this code thanks to #64 - although that was quite a while ago :)

Alternative approaches and why I didn't take them:

  • "we can just skip sending a checksum if there's a rollback queued"
    • this is easy to detect because the first incorrect frame is easily accessible, but clients rely on receiving all checksums eventually so this isn't a good approach.
  • "when we would send a checksum but there's already a rollback queued, instead remember that we need to send a checksum the next time advance_frame() is called"
    • this is functionally the same as what this patch does, except it would require some bookkeeping to keep track of "hey send a checksum next frame". Just moving the desync detection logic to the start of advance_frame() is neater overall.

@caspark
Copy link
Contributor Author

caspark commented Dec 6, 2024

Oh, I forgot to mention in the commit notes: I needed >2 clients to trigger this reliably - even though in theory it ought to be possible with 2 clients.

@gschup
Copy link
Owner

gschup commented Dec 6, 2024

This stuff seems like an endless source of tiny bugs. Good job on catching this one! Now let's see if I can wrap my head around it, too :)

@caspark
Copy link
Contributor Author

caspark commented Dec 6, 2024

If I've learned anything in the last 3-4 weeks, it's that "endless source of tiny bugs" perfectly describes working on game netcode in general :)

Copy link
Owner

@gschup gschup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, it would be great if you could fix the merge conflicts :)

This fixes the case where the following happens:

1. Sync layer requests roll back to frame N
2. Sync layer requests advance to frame N+1
3. Sync layer marks frame N as confirmed
4. P2pSession's desync detection decides to send a checksum for frame N
   to peers, even though the checksum for frame N is now known to be incorrect,
   (because ggrs-consuming-code has not had a chance to act on the
   advance frame request, so the checksum stored in frame N is from
   before the rollback.)
5. Peers receive the incorrect checksum and flip out thinking there is a
   desync.

Actually causing this bug to reliably manifest in practice is somewhat
tricky; I was able to reliably do it in my game via the following:

* high input latencies (150ms on each client network interface, so 300ms
  round trip)
* input delay == 0
* desync detection interval set to 1
* start all game clients at once
* in the first advance, each client initially sends an input which is
  mispredicted by all other clients (so not the zeroed input).
* and lastly that first frame from each client must modify the resulting
  checksum in a way that is unique from each other client.

The concrete input I am sending in my game is approximately equivalent
to "store the client handle in the game state", but but when I tried
modifying the example game to do similar then it didn't manifest this
bug - no idea why. In any case, this patch fixes the issue so I'll leave
it at that.

It also changes a < to <= in a seemingly-isolated way; this change is
based on the rationale given at
https://github.com/gschup/ggrs/pull/64/files#r1368265924 no longer
applying.
@caspark caspark force-pushed the fix-rollback-desync-false-positive branch from 521b12c to 2325935 Compare December 12, 2024 10:56
@caspark
Copy link
Contributor Author

caspark commented Dec 12, 2024

Merge conflicts fixed!

@gschup gschup merged commit b66d18f into gschup:main Dec 12, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants