[WIP] Improved network protocol (more stable on slow clients) #890

Debilski · 2025-06-11T11:45:17Z

Still a big WIP. Code needs lots of cleanup (+ fixes) and the network classes can still be simplified. Publishing this only to increase the personal pressure on finishing this. :)

In the old code, we have make_team creating a zmq.PAIR socket, starting a process with pelita-player and then immediately sending out a set_initial request to the remote player. We kind of rely on zmq magic that this will keep the message in the pipeline when the remote socket is not ready yet and will eventually succeed. In most cases it does but sometimes this fails, often when a slow file system is involved. Usually, this means a loss for the slow player involved. (Also sometimes it goes unnoticed and will then trigger a failure when the game sends the first move request.)

We add additional logic (which is how it should have been done in the first place as this is network design 101): Once the remote player is started, it sends a status type message "ok" to the main game (we include the team name as additional payload in this message, so that we can show this info in the UI as early as possible) to acknowledge its readiness.
game.py will wait for the oks of both remote teams simultaneously (with a longer timeout than 3 seconds) but failing as soon as one of them has an error. If either of them has an error, the game is simply cancelled with a new failure state. There will be no winner. (Not super important in every day situations but this makes things clearer for CI.)

Ideally, we would also adapt the protocol to include health pings (to ensure we don’t have old bots running for days) but the server-less, dict-only structure of the pelita core doesn’t make this easy.

Last remark: There will be an introduction on game phases (init, running, finished, failure) to the game state so we don’t have to do things like if round is None anymore where no-one knows what it is supposed to mean. This will probably be string-based in the first iteration but I hope to introduce a more elegant solution that would ideally also be usable with match statements.

Closes #784, closes #778, closes #785, closes #887, closes #908

Game phases need to be handled (+ improved in a later stage)
We could now get rid of set_initial as an individual step (and give the first move a longer timeout); I am undecided on this
RemoteTeam is still incomplete
Propagation of errors in user code needs to be better defined
Error handling needs to be properly tested
Tk should update the team names immediately as they arrive

Debilski · 2025-06-23T11:03:39Z

Ready for review (if anyone wants to), but I will still add a few tweaks before merging (and fix remaining TODOs and docstrings).

It follows a small explanation of what has been done.

Network/subprocess logic

All in all not so many changes. Initially, I wanted to make the RemotePlayerConnection (formerly ZMQConnection in network.py) stateful and allow for it to send out move queries to the remote player, awaiting the reply, as well as being able to accept generic status type messages (team name, health checks) from the remote player. I think the stateful design might still be added at some stage but since RemotePlayerConnection does not run asynchronously, the health messages do not make a lot of sense.

Assuming we run pelita player1 player2, this is what happens:

For each of the team specs player1, player2 a zmq.PAIR socket is created and bound ($URL1, $URL2).
A subprocess of pelita-player remote-game player1 $URL1 is started.
pelita-player tries to import the player1 spec. (Analogously for player2).
pelita-player connects to $URL1 and sends a message containing the team name of player1 (or an error message)
The main pelita process runs a loop, awaiting the status messages from both subprocesses. In case one of them returns with an error, the loop is aborted early and pelita goes into "FAILURE" mode.
The normal game then runs as usual. Pelita will send out a move request and the player will reply.

In case of a game with a server player (pelita://somehost/player1) steps 1 and 2 change to

A zmq.DEALER socket is created and connects to pelita://somehost/player1.
Pelita sends out a request message on that connection (the server will then start the subprocess and transfer everything transparently)

Error handling, timeouts

The Team class is responsible for catching any user errors on the Bot side (exceptions, bad return values). It transforms them into an error dict that is handled in bot_turn. Network failures raise an exception and are handled in game.py currently, where they also become a dict. (Or, in case of a timeout, a random move is created before passing the result to bot_turn. Timeouts are also counted separately now, triggering a fatal error when the threshold has exceeded.)

A new function add_fatal_error has been added that should be used instead of manually adding errors to the fatal_errors list. The function will automatically set the new game phase ("INIT" → "FAILURE", "RUNNING" → "FINISHED") without the need of having to call check_gameover at several points in the game. check_gameover will now only check for wins based on round/food.

Game phases

The concept of game phases is introduced: When started, Pelita is in "INIT" phase. Once the game runs, it is in "RUNNING" phase, after which it should eventually end up in "FINISHED" phase. Both "INIT" and "RUNNING" can also end up in "FAILURE" phase.

Conceptually, the "FAILURE" phase has no winner, nor it is a draw. We use it for example for all fatal errors that are triggered during "INIT". The phase is currently only given as a string but ideally, we would model it like so (will not be implemented in this PR):

@dataclasses.dataclass
class InitPhase:
    pass

@dataclasses.dataclass
class RunningPhase:
    bot_turn: int
    round: int

@dataclasses.dataclass
class FinishedPhase:
    bot_turn: int
    round: int
    is_draw: bool
    winning_team: int|None

@dataclasses.dataclass
class FailurePhase:
    reason: str

This would clearly signify that, e.g. the INIT phase is not associated with any concept of a round. And it would, e.g. disallow game_state["winning_team"] while the game is still running. (Not an uncommon source of bugs.)

The usage would be straightforward with matches

match phase:
    case RunningPhase(bot_turn, round):
        ... # here we can use bot_turn, round
    case FinishedPhase(bot_turn, round, is_draw, winning_team):
       ... # here we can use winning_team ...

Not sure yet if we have to do it like that (match also works with plain dicts), but either way there should be a clear association between these keys in the game state and the game phase. Ie. when using a game state, one should always check the phase first (unless it is obvious from context).

otizonaizit · 2025-07-03T08:45:27Z

@Debilski is this ready for review? I see you are still pushing to it once in a while, so I am not sure if I should review or wait

Debilski · 2025-07-03T09:00:14Z

It is 95% done (as far as this PR is concerned). I need to remove some code duplication in team.py and decide whether I keep the ConnectionState in network.py (probably not but I need to think about it).

Other than that, I will need to go through the list of TODOs that I introduced in this PR and fix them (or delete them) as well.

I guess on the whole you could already look through it for any problems. On the whole things should not change substantially afterwards; most of the stuff I need to do is going to be cosmetically. Anything bigger will get a new PR.

A keyboard interrupt during recv (or sigterm from a parent process) would still cause a send on the socket, leading to a lock.

Timeouts are handled and counted in request_new_position and create a fatal_error when exceeding the limit. Fatal errors are created with a new function add_fatal_error that also immediately changes the game phase to FINISHED or FAILURE, eliminating a few extra calls to check_gameover.

Debilski · 2025-07-30T14:07:38Z

All in all I’d say ready for review. I am still unhappy with how the remote team classes in team.py are written (code duplication, inheritance) but any rewrite there should functionally do the same thing (ideally, we’d find a structure that allows for the fewest number of bugs).

The stateful socket logic in network.py can also be improved upon but most of the time it is to fix a bug that only occurs when the system is already failing.

I added a new test for the network protocol in test/test_network.py::test_simpleclient_broken which tries to send garbage data to Pelita main at some points in the session and ensures that Pelita main does not crash. It looks a bit convoluted but feel free to add more special cases there.

jbdyn

Impressive changes! Cool that you put in so much effort.

I wish I could follow you on the changes on the network protocol, but I would need more time to wrap my head around the whole architecture, basically.

Despite that, I pointed out some other things, especially a removed if timeout is None check which causes pelita to crash on a missing timeout.

Hope you find my suggestions useful.

jbdyn · 2025-08-23T11:34:26Z

pelita/base_utils.py

+def default_zmq_context(zmq_context=None):
+    return zmq.Context()


Did you mean this?

def default_zmq_context(zmq_context=None): return zmq_context or zmq.Context()

jbdyn · 2025-08-23T11:43:50Z

pelita/game.py

+            zmq_context = zmq.Context()
+            zmq_external_publisher = ZMQPublisher(address=viewer_opts, bind=False, zmq_context=zmq_context)


No need for a manual zmq_context here, since it is created automatically in ZMQPublisher, so this can be shortened to

zmq_external_publisher = ZMQPublisher(address=viewer_opts, bind=False)

jbdyn · 2025-08-23T11:46:23Z

pelita/game.py

+            zmq_context = zmq.Context()
+            zmq_publisher = ZMQPublisher(address='tcp://127.0.0.1', zmq_context=zmq_context)
            viewer_state['viewers'].append(zmq_publisher)
-            viewer_state['controller'] = setup_controller()
+            viewer_state['controller'] = Controller(zmq_context=zmq_context)


But here I assume that ZMQPublisher and Controller need to get the exact same zmq_context, right?

Then, this would be alright.

Doesn’t need to be the same but it also doesn’t harm so I reused it, yes.

jbdyn · 2025-08-23T11:57:32Z

test/test_remote_game.py



-@pytest.mark.xfail(reason="TODO: Fails in CI for macOS. Unclear why.")
+#@pytest.mark.xfail(reason="TODO: Fails in CI for macOS. Unclear why.")


Can probably be removed?

jbdyn · 2025-08-23T12:18:58Z

pelita/scripts/pelita_player.py

 _logger = logging.getLogger(__name__)

+# Maximum time that a player will wait for a move request
+TIMEOUT_SECS = 60 * 60


Why a timeout of 1 hour? Wouldn't one minute be sufficient?

This would mean that pausing a game for more than 60 seconds would cause a player to timeout and exit (and in turn lose the game). Unless we implement health ping messages between the main game and the players, this time needs to be longer than a typical ‘I pause the game because I want to check something’ situation.

Ideally, the main game would exit before this time (so that the match is not counted as a loss for the player but registered as a failed match) but for now we’ll ignore this edge case. ;) (I’ll add a remark.)

jbdyn · 2025-08-23T12:21:08Z

pelita/scripts/pelita_player.py

+        socks = dict(poller.poll(timeout=TIMEOUT_SECS * 1000))
+        if socks.get(socket) == zmq.POLLIN:
+            json_message = socket.recv_unicode()
+        else:
+            # TODO: Would be nice to tell Pelita main that we’re exiting
+            _logger.warning(f"No request in {TIMEOUT_SECS} seconds. Exiting player.")
+
+            # returning False breaks the loop
+            return False


related to #890 (comment) for TIMEOUT_SECS above.

jbdyn · 2025-08-23T12:50:03Z

pelita/network.py

-        # special case for no timeout
-        # just loop until we receive the correct reply
-        if timeout is None:
-            while True:
-                msg_id, reply = self._recv()
-                if msg_id == expected_id:
-                    return reply
-
-        # normal timeout handling


Without this check, pelita crashes when called with pelita --no-timeout ....

To fix this, one could do the following:

if timeout is None: timeout = math.inf

and then see below for handling the infinite timeout.

Ah, good question, I don’t remember exactly why I removed it.

However: With the max timeout of a player being 1 hour, there is not really an infinite timeout anymore so in any case this would never happen. I guess the bug fix would be to set --no-timeout to use something like one hour as well.

jbdyn · 2025-08-23T12:52:27Z

pelita/network.py


            socks = dict(self.pollin.poll(time_left * 1000)) # poll needs milliseconds


For handling the infinite timeout (i.e. no timeout), then we would need to write these lines like this:

time_left_msec = None if time_left == math.inf else time_left * 1000 socks = dict(self.pollin.poll(time_left_msec))

Passing timeout=None is not supported anymore (as stated above).

Passing timeout=None is not supported anymore (as stated above).

OK, but right now the game crashes, so the behavior of ``--no-timeout` needs to be adjusted to set a long timeout – 1h is more than enough, I agree – before merging.

Also, if in the future we indeed implement some form of heartbeat protocol, we will need infinite timeout again. So I think @jbdyn 's implementation should be used.

Heartbeats will need a core redesign (practically not sensible to implement with our approach of passing everything around in a giant dict; going async or GIL-less would be the way to do it). I estimate some further 13 years.

otizonaizit · 2025-08-24T14:00:39Z

I like the idea of game phases a lot! And also the new way of managing client errors/timeouts. I tried to review the changes but it is a bit difficult because of the sheer amount of changed lines... but thanks for the overviews above. Together with @jbdyn we noticed a couple of things, and a crash when --no-timeout is passed (see the thorough review above). I also could not understand if the CLI tournament still works. The tests are passing, but I have a feeling that the tests are not the whole story. But this is not a blocker. We could still have a hot-fix should something go wrong while we test the tournament by hand.

Thanks @Debilski for working on this for so long. This could be the basis to reorganize the network protocol and the internal state representations, maybe moving to strings to enums or so. But this is a discussion for the future. For the present I'd like to understand if you'd like to see this in before ASPP in Bulgaria or if you think it should wait.

Debilski · 2025-08-25T09:42:15Z

I think it is relatively uncontroversial and should be merged (after the bugs are found and fixed) but I’ll give some more explanations to the changes in the protocol later this week.

Debilski · 2025-08-26T13:07:31Z

I wish I could follow you on the changes on the network protocol, but I would need more time to wrap my head around the whole architecture, basically.

In the absence of error states, the design is rather simple (as stated above):

Main game opens and binds a socket for each player (socket is stored in the RemoteTeam objects in the game_state dict). Main game starts a subprocess for each player and passes the socket’s address. The subprocess connects and sends a ‘ok’ message with its team name.
The main game then sends a message with the initial game state (this will be simplified in #851) to each bot and awaits a dummy reply.
Then we enter the loop of the main game sending requests to each bot and awaiting new moves.

The only thing that has changed is really that the subprocess has to send an ‘ok’ message when it is ready. In the earlier design, the player could not initiate messages on its own. The main game would send a message directly after binding the socket (and likely before the player was connected). This was never a good design but it worked in 99.5% of the cases because of zmq magic (zmq has an outgoing message queue and will still deliver these messages). But when a client was not starting fast enough, we would get weird error messages (#785 #784)

We used this design because it would allow us to make our main game network agnostic (network handling was strictly done in RemoteTeam objects) but I think we should eventually move away from this (I’ll create an issue at some point) and embrace a big while loop again that manages it all.

I tried to review the changes but it is a bit difficult because of the sheer amount of changed lines...

The bigger changes in the PR are some restructuring of how and where exceptions are handled (which is kind of unrelated to the protocol change) and some reorganisation of the Team objects and network. I think these changes are not completely finished yet but I’d like to merge so that we can use the more robust protocol and the game phase.

I also could not understand if the CLI tournament still works.

I think it should work. But ideally, in the future it would make use of the game_phase and only register properly "FINISHED" games.

This could be the basis to reorganize the network protocol and the internal state representations, maybe moving to strings to enums or so.

There are some reorganisations coming in #851 (ideally also before Plovdiv but let’s see). I think Python enums don’t give us
any advantage here over strings now that type checkers are good enough to handle these things. (A Rust-style enum would help greatly but we don’t have that unfortunately.)

Debilski · 2025-08-28T09:12:08Z

Any further requests or questions here? Otherwise I would suggest to merge, because #851, #889 (and #903) depend on it. I’ll take care of fixing any outstanding bugs and the docstrings will be improved with or after #851 (there is still going to be a bit of changes here and I would like to settle the design first).

[WIP] Improved network protocol (more stable on slow clients) 698c4fb

otizonaizit mentioned this pull request Jun 12, 2025

when bot respawns on an enemy in its own homezone, the kill is recorded in the next turn... #891

Closed

Debilski force-pushed the feature/new-protocol branch 10 times, most recently from 8d52a29 to 6804f92 Compare June 22, 2025 21:29

Debilski marked this pull request as ready for review June 23, 2025 11:03

Debilski force-pushed the feature/new-protocol branch 6 times, most recently from f2311e8 to 7c612ce Compare July 3, 2025 20:45

Debilski mentioned this pull request Jul 22, 2025

Rename Errors to Timeouts in UI #887

Closed

Debilski added 8 commits July 22, 2025 21:10

ENH: Add default_zmq_context

6a8b9b6

BF: Do not crash tk when no team name is passed

81006a4

BF: Keyboard interrupt should not cause a freeze

8646bf7

A keyboard interrupt during recv (or sigterm from a parent process) would still cause a send on the socket, leading to a lock.

ENH: Better bot repr

05aec98

RF: New network/subprocess protocol

bba5358

BF: Fix remote test

037f20f

BF: Playing with bad rounds or no food should trigger FAILURE

c949728

Debilski and others added 7 commits July 23, 2025 15:58

RF: Use more game_phases

c57f72b

BF: Fix timeout counting

f4b557e

TST: Test that long team names are detected

78375e4

UI: Show failure

cf17dbe

RF: Cleanup superfluous functions

b42f710

RF: Raise (if needed) from add_fatal_error

0f7cbcc

RF: Print the error before potential exceptions are raised

1941050

Debilski force-pushed the feature/new-protocol branch from beb1c3a to 046d818 Compare July 30, 2025 10:46

TST: Add network protocol test

ac9ad35

Debilski mentioned this pull request Jul 30, 2025

Add (long) timeout for Pelita players #908

Closed

ENH: Add 60*60 second timeout to pelita_player

e7e3dec

Debilski force-pushed the feature/new-protocol branch 2 times, most recently from ce187ed to d751b47 Compare July 30, 2025 13:55

ENH: Reorganise remote Team classes

6925dc0

Debilski force-pushed the feature/new-protocol branch from d751b47 to 6925dc0 Compare July 30, 2025 14:05

jbdyn requested changes Aug 23, 2025

View reviewed changes

Debilski added 3 commits August 26, 2025 12:45

ENH: Remove xfail

4595c22

TST: Add test for --no-timeout cli option

a26625c

BF: Fix --no-timeout

0455642

BF: Fix default_zmq_context

d4ce537

otizonaizit merged commit 698c4fb into ASPP:main Aug 28, 2025
53 of 54 checks passed

github-actions bot pushed a commit that referenced this pull request Aug 28, 2025

Merge pull request #890 from Debilski/feature/new-protocol

e48ece1

[WIP] Improved network protocol (more stable on slow clients) 698c4fb

Debilski mentioned this pull request Sep 3, 2025

Fix send_exit behaviour #916

Merged

Debilski mentioned this pull request Sep 11, 2025

Remote players do not timeout #413

Closed

Debilski deleted the feature/new-protocol branch September 11, 2025 20:59

		def default_zmq_context(zmq_context=None):
		return zmq.Context()

		zmq_context = zmq.Context()
		zmq_external_publisher = ZMQPublisher(address=viewer_opts, bind=False, zmq_context=zmq_context)



		@pytest.mark.xfail(reason="TODO: Fails in CI for macOS. Unclear why.")
		#@pytest.mark.xfail(reason="TODO: Fails in CI for macOS. Unclear why.")


		socks = dict(self.pollin.poll(time_left * 1000)) # poll needs milliseconds

[WIP] Improved network protocol (more stable on slow clients) #890

[WIP] Improved network protocol (more stable on slow clients) #890

Uh oh!

Conversation

Debilski commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Debilski commented Jun 23, 2025

Network/subprocess logic

Error handling, timeouts

Game phases

Uh oh!

otizonaizit commented Jul 3, 2025

Uh oh!

Debilski commented Jul 3, 2025

Uh oh!

Debilski commented Jul 30, 2025

Uh oh!

jbdyn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbdyn Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbdyn Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbdyn Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

otizonaizit commented Aug 24, 2025

Uh oh!

Debilski commented Aug 25, 2025

Uh oh!

Debilski commented Aug 26, 2025

Uh oh!

Debilski commented Aug 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Debilski commented Jun 11, 2025 •

edited

Loading

jbdyn Aug 23, 2025 •

edited

Loading

jbdyn Aug 23, 2025 •

edited

Loading

jbdyn Aug 23, 2025 •

edited

Loading