Skip to content

Conversation

@Debilski
Copy link
Member

@Debilski Debilski commented Jun 11, 2025

Still a big WIP. Code needs lots of cleanup (+ fixes) and the network classes can still be simplified. Publishing this only to increase the personal pressure on finishing this. :)

In the old code, we have make_team creating a zmq.PAIR socket, starting a process with pelita-player and then immediately sending out a set_initial request to the remote player. We kind of rely on zmq magic that this will keep the message in the pipeline when the remote socket is not ready yet and will eventually succeed. In most cases it does but sometimes this fails, often when a slow file system is involved. Usually, this means a loss for the slow player involved. (Also sometimes it goes unnoticed and will then trigger a failure when the game sends the first move request.)

We add additional logic (which is how it should have been done in the first place as this is network design 101): Once the remote player is started, it sends a status type message "ok" to the main game (we include the team name as additional payload in this message, so that we can show this info in the UI as early as possible) to acknowledge its readiness.
game.py will wait for the oks of both remote teams simultaneously (with a longer timeout than 3 seconds) but failing as soon as one of them has an error. If either of them has an error, the game is simply cancelled with a new failure state. There will be no winner. (Not super important in every day situations but this makes things clearer for CI.)

Ideally, we would also adapt the protocol to include health pings (to ensure we don’t have old bots running for days) but the server-less, dict-only structure of the pelita core doesn’t make this easy.

Last remark: There will be an introduction on game phases (init, running, finished, failure) to the game state so we don’t have to do things like if round is None anymore where no-one knows what it is supposed to mean. This will probably be string-based in the first iteration but I hope to introduce a more elegant solution that would ideally also be usable with match statements.

Closes #784, closes #778, closes #785, closes #887, closes #908

  • Game phases need to be handled (+ improved in a later stage)
  • We could now get rid of set_initial as an individual step (and give the first move a longer timeout); I am undecided on this
  • RemoteTeam is still incomplete
  • Propagation of errors in user code needs to be better defined
  • Error handling needs to be properly tested
  • Tk should update the team names immediately as they arrive

@Debilski
Copy link
Member Author

Ready for review (if anyone wants to), but I will still add a few tweaks before merging (and fix remaining TODOs and docstrings).

It follows a small explanation of what has been done.

Network/subprocess logic

All in all not so many changes. Initially, I wanted to make the RemotePlayerConnection (formerly ZMQConnection in network.py) stateful and allow for it to send out move queries to the remote player, awaiting the reply, as well as being able to accept generic status type messages (team name, health checks) from the remote player. I think the stateful design might still be added at some stage but since RemotePlayerConnection does not run asynchronously, the health messages do not make a lot of sense.

Assuming we run pelita player1 player2, this is what happens:

  1. For each of the team specs player1, player2 a zmq.PAIR socket is created and bound ($URL1, $URL2).
  2. A subprocess of pelita-player remote-game player1 $URL1 is started.
  3. pelita-player tries to import the player1 spec. (Analogously for player2).
  4. pelita-player connects to $URL1 and sends a message containing the team name of player1 (or an error message)
  5. The main pelita process runs a loop, awaiting the status messages from both subprocesses. In case one of them returns with an error, the loop is aborted early and pelita goes into "FAILURE" mode.
  6. The normal game then runs as usual. Pelita will send out a move request and the player will reply.

In case of a game with a server player (pelita://somehost/player1) steps 1 and 2 change to

  1. A zmq.DEALER socket is created and connects to pelita://somehost/player1.
  2. Pelita sends out a request message on that connection (the server will then start the subprocess and transfer everything transparently)

Error handling, timeouts

The Team class is responsible for catching any user errors on the Bot side (exceptions, bad return values). It transforms them into an error dict that is handled in bot_turn. Network failures raise an exception and are handled in game.py currently, where they also become a dict. (Or, in case of a timeout, a random move is created before passing the result to bot_turn. Timeouts are also counted separately now, triggering a fatal error when the threshold has exceeded.)

A new function add_fatal_error has been added that should be used instead of manually adding errors to the fatal_errors list. The function will automatically set the new game phase ("INIT" → "FAILURE", "RUNNING" → "FINISHED") without the need of having to call check_gameover at several points in the game. check_gameover will now only check for wins based on round/food.

Game phases

The concept of game phases is introduced: When started, Pelita is in "INIT" phase. Once the game runs, it is in "RUNNING" phase, after which it should eventually end up in "FINISHED" phase. Both "INIT" and "RUNNING" can also end up in "FAILURE" phase.

Conceptually, the "FAILURE" phase has no winner, nor it is a draw. We use it for example for all fatal errors that are triggered during "INIT". The phase is currently only given as a string but ideally, we would model it like so (will not be implemented in this PR):

@dataclasses.dataclass
class InitPhase:
    pass

@dataclasses.dataclass
class RunningPhase:
    bot_turn: int
    round: int

@dataclasses.dataclass
class FinishedPhase:
    bot_turn: int
    round: int
    is_draw: bool
    winning_team: int|None

@dataclasses.dataclass
class FailurePhase:
    reason: str

This would clearly signify that, e.g. the INIT phase is not associated with any concept of a round. And it would, e.g. disallow game_state["winning_team"] while the game is still running. (Not an uncommon source of bugs.)

The usage would be straightforward with matches

match phase:
    case RunningPhase(bot_turn, round):
        ... # here we can use bot_turn, round
    case FinishedPhase(bot_turn, round, is_draw, winning_team):
       ... # here we can use winning_team ...

Not sure yet if we have to do it like that (match also works with plain dicts), but either way there should be a clear association between these keys in the game state and the game phase. Ie. when using a game state, one should always check the phase first (unless it is obvious from context).

@Debilski Debilski marked this pull request as ready for review June 23, 2025 11:03
@otizonaizit
Copy link
Member

@Debilski is this ready for review? I see you are still pushing to it once in a while, so I am not sure if I should review or wait

@Debilski
Copy link
Member Author

Debilski commented Jul 3, 2025

It is 95% done (as far as this PR is concerned). I need to remove some code duplication in team.py and decide whether I keep the ConnectionState in network.py (probably not but I need to think about it).

Other than that, I will need to go through the list of TODOs that I introduced in this PR and fix them (or delete them) as well.

I guess on the whole you could already look through it for any problems. On the whole things should not change substantially afterwards; most of the stuff I need to do is going to be cosmetically. Anything bigger will get a new PR.

@Debilski Debilski force-pushed the feature/new-protocol branch 6 times, most recently from f2311e8 to 7c612ce Compare July 3, 2025 20:45
Debilski added 8 commits July 22, 2025 21:10
A keyboard interrupt during recv (or sigterm from a parent process)
would still cause a send on the socket, leading to a lock.
Timeouts are handled and counted in request_new_position and create a fatal_error when exceeding the limit.

Fatal errors are created with a new function add_fatal_error that also immediately changes the game phase to FINISHED or FAILURE, eliminating a few extra calls to check_gameover.
@Debilski Debilski force-pushed the feature/new-protocol branch from beb1c3a to 046d818 Compare July 30, 2025 10:46
@Debilski Debilski force-pushed the feature/new-protocol branch 2 times, most recently from ce187ed to d751b47 Compare July 30, 2025 13:55
@Debilski Debilski force-pushed the feature/new-protocol branch from d751b47 to 6925dc0 Compare July 30, 2025 14:05
@Debilski
Copy link
Member Author

All in all I’d say ready for review. I am still unhappy with how the remote team classes in team.py are written (code duplication, inheritance) but any rewrite there should functionally do the same thing (ideally, we’d find a structure that allows for the fewest number of bugs).

The stateful socket logic in network.py can also be improved upon but most of the time it is to fix a bug that only occurs when the system is already failing.

I added a new test for the network protocol in test/test_network.py::test_simpleclient_broken which tries to send garbage data to Pelita main at some points in the session and ensures that Pelita main does not crash. It looks a bit convoluted but feel free to add more special cases there.

Copy link
Collaborator

@jbdyn jbdyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive changes! Cool that you put in so much effort.

I wish I could follow you on the changes on the network protocol, but I would need more time to wrap my head around the whole architecture, basically.

Despite that, I pointed out some other things, especially a removed if timeout is None check which causes pelita to crash on a missing timeout.

Hope you find my suggestions useful.

Comment on lines 17 to 18
def default_zmq_context(zmq_context=None):
return zmq.Context()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean this?

def default_zmq_context(zmq_context=None):
    return zmq_context or zmq.Context()

Comment on lines +257 to +258
zmq_context = zmq.Context()
zmq_external_publisher = ZMQPublisher(address=viewer_opts, bind=False, zmq_context=zmq_context)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for a manual zmq_context here, since it is created automatically in ZMQPublisher, so this can be shortened to

zmq_external_publisher = ZMQPublisher(address=viewer_opts, bind=False)

Comment on lines +261 to +264
zmq_context = zmq.Context()
zmq_publisher = ZMQPublisher(address='tcp://127.0.0.1', zmq_context=zmq_context)
viewer_state['viewers'].append(zmq_publisher)
viewer_state['controller'] = setup_controller()
viewer_state['controller'] = Controller(zmq_context=zmq_context)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But here I assume that ZMQPublisher and Controller need to get the exact same zmq_context, right?

Then, this would be alright.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn’t need to be the same but it also doesn’t harm so I reused it, yes.



@pytest.mark.xfail(reason="TODO: Fails in CI for macOS. Unclear why.")
#@pytest.mark.xfail(reason="TODO: Fails in CI for macOS. Unclear why.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can probably be removed?

_logger = logging.getLogger(__name__)

# Maximum time that a player will wait for a move request
TIMEOUT_SECS = 60 * 60
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a timeout of 1 hour? Wouldn't one minute be sufficient?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would mean that pausing a game for more than 60 seconds would cause a player to timeout and exit (and in turn lose the game). Unless we implement health ping messages between the main game and the players, this time needs to be longer than a typical ‘I pause the game because I want to check something’ situation.

Ideally, the main game would exit before this time (so that the match is not counted as a loss for the player but registered as a failed match) but for now we’ll ignore this edge case. ;) (I’ll add a remark.)

Comment on lines +104 to +112
socks = dict(poller.poll(timeout=TIMEOUT_SECS * 1000))
if socks.get(socket) == zmq.POLLIN:
json_message = socket.recv_unicode()
else:
# TODO: Would be nice to tell Pelita main that we’re exiting
_logger.warning(f"No request in {TIMEOUT_SECS} seconds. Exiting player.")

# returning False breaks the loop
return False
Copy link
Collaborator

@jbdyn jbdyn Aug 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to #890 (comment) for TIMEOUT_SECS above.

Comment on lines -227 to -235
# special case for no timeout
# just loop until we receive the correct reply
if timeout is None:
while True:
msg_id, reply = self._recv()
if msg_id == expected_id:
return reply

# normal timeout handling
Copy link
Collaborator

@jbdyn jbdyn Aug 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this check, pelita crashes when called with pelita --no-timeout ....

To fix this, one could do the following:

if timeout is None:
    timeout = math.inf

and then see below for handling the infinite timeout.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good question, I don’t remember exactly why I removed it.

However: With the max timeout of a player being 1 hour, there is not really an infinite timeout anymore so in any case this would never happen. I guess the bug fix would be to set --no-timeout to use something like one hour as well.

Comment on lines 305 to 306

socks = dict(self.pollin.poll(time_left * 1000)) # poll needs milliseconds
Copy link
Collaborator

@jbdyn jbdyn Aug 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For handling the infinite timeout (i.e. no timeout), then we would need to write these lines like this:

time_left_msec = None if time_left == math.inf else time_left * 1000
socks = dict(self.pollin.poll(time_left_msec))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing timeout=None is not supported anymore (as stated above).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing timeout=None is not supported anymore (as stated above).

OK, but right now the game crashes, so the behavior of ``--no-timeout` needs to be adjusted to set a long timeout – 1h is more than enough, I agree – before merging.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if in the future we indeed implement some form of heartbeat protocol, we will need infinite timeout again. So I think @jbdyn 's implementation should be used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heartbeats will need a core redesign (practically not sensible to implement with our approach of passing everything around in a giant dict; going async or GIL-less would be the way to do it). I estimate some further 13 years.

@otizonaizit
Copy link
Member

I like the idea of game phases a lot! And also the new way of managing client errors/timeouts. I tried to review the changes but it is a bit difficult because of the sheer amount of changed lines... but thanks for the overviews above. Together with @jbdyn we noticed a couple of things, and a crash when --no-timeout is passed (see the thorough review above). I also could not understand if the CLI tournament still works. The tests are passing, but I have a feeling that the tests are not the whole story. But this is not a blocker. We could still have a hot-fix should something go wrong while we test the tournament by hand.

Thanks @Debilski for working on this for so long. This could be the basis to reorganize the network protocol and the internal state representations, maybe moving to strings to enums or so. But this is a discussion for the future. For the present I'd like to understand if you'd like to see this in before ASPP in Bulgaria or if you think it should wait.

@Debilski
Copy link
Member Author

I think it is relatively uncontroversial and should be merged (after the bugs are found and fixed) but I’ll give some more explanations to the changes in the protocol later this week.

@Debilski
Copy link
Member Author

I wish I could follow you on the changes on the network protocol, but I would need more time to wrap my head around the whole architecture, basically.

In the absence of error states, the design is rather simple (as stated above):

Main game opens and binds a socket for each player (socket is stored in the RemoteTeam objects in the game_state dict). Main game starts a subprocess for each player and passes the socket’s address. The subprocess connects and sends a ‘ok’ message with its team name.
The main game then sends a message with the initial game state (this will be simplified in #851) to each bot and awaits a dummy reply.
Then we enter the loop of the main game sending requests to each bot and awaiting new moves.

The only thing that has changed is really that the subprocess has to send an ‘ok’ message when it is ready. In the earlier design, the player could not initiate messages on its own. The main game would send a message directly after binding the socket (and likely before the player was connected). This was never a good design but it worked in 99.5% of the cases because of zmq magic (zmq has an outgoing message queue and will still deliver these messages). But when a client was not starting fast enough, we would get weird error messages (#785 #784)

We used this design because it would allow us to make our main game network agnostic (network handling was strictly done in RemoteTeam objects) but I think we should eventually move away from this (I’ll create an issue at some point) and embrace a big while loop again that manages it all.

I tried to review the changes but it is a bit difficult because of the sheer amount of changed lines...

The bigger changes in the PR are some restructuring of how and where exceptions are handled (which is kind of unrelated to the protocol change) and some reorganisation of the Team objects and network. I think these changes are not completely finished yet but I’d like to merge so that we can use the more robust protocol and the game phase.

I also could not understand if the CLI tournament still works.

I think it should work. But ideally, in the future it would make use of the game_phase and only register properly "FINISHED" games.

This could be the basis to reorganize the network protocol and the internal state representations, maybe moving to strings to enums or so.

There are some reorganisations coming in #851 (ideally also before Plovdiv but let’s see). I think Python enums don’t give us
any advantage here over strings now that type checkers are good enough to handle these things. (A Rust-style enum would help greatly but we don’t have that unfortunately.)

@Debilski
Copy link
Member Author

Any further requests or questions here? Otherwise I would suggest to merge, because #851, #889 (and #903) depend on it. I’ll take care of fixing any outstanding bugs and the docstrings will be improved with or after #851 (there is still going to be a bit of changes here and I would like to settle the design first).

@otizonaizit otizonaizit merged commit 698c4fb into ASPP:main Aug 28, 2025
53 of 54 checks passed
github-actions bot pushed a commit that referenced this pull request Aug 28, 2025
[WIP] Improved network protocol (more stable on slow clients) 698c4fb
@Debilski Debilski mentioned this pull request Sep 3, 2025
@Debilski Debilski deleted the feature/new-protocol branch September 11, 2025 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants