-
Notifications
You must be signed in to change notification settings - Fork 70
[WIP] Improved network protocol (more stable on slow clients) #890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8d52a29 to
6804f92
Compare
|
Ready for review (if anyone wants to), but I will still add a few tweaks before merging (and fix remaining TODOs and docstrings). It follows a small explanation of what has been done. Network/subprocess logicAll in all not so many changes. Initially, I wanted to make the Assuming we run
In case of a game with a server player (
Error handling, timeoutsThe A new function Game phasesThe concept of game phases is introduced: When started, Pelita is in "INIT" phase. Once the game runs, it is in "RUNNING" phase, after which it should eventually end up in "FINISHED" phase. Both "INIT" and "RUNNING" can also end up in "FAILURE" phase. Conceptually, the "FAILURE" phase has no winner, nor it is a draw. We use it for example for all fatal errors that are triggered during "INIT". The phase is currently only given as a string but ideally, we would model it like so (will not be implemented in this PR): @dataclasses.dataclass
class InitPhase:
pass
@dataclasses.dataclass
class RunningPhase:
bot_turn: int
round: int
@dataclasses.dataclass
class FinishedPhase:
bot_turn: int
round: int
is_draw: bool
winning_team: int|None
@dataclasses.dataclass
class FailurePhase:
reason: strThis would clearly signify that, e.g. the INIT phase is not associated with any concept of a The usage would be straightforward with matches match phase:
case RunningPhase(bot_turn, round):
... # here we can use bot_turn, round
case FinishedPhase(bot_turn, round, is_draw, winning_team):
... # here we can use winning_team ...Not sure yet if we have to do it like that ( |
|
@Debilski is this ready for review? I see you are still pushing to it once in a while, so I am not sure if I should review or wait |
|
It is 95% done (as far as this PR is concerned). I need to remove some code duplication in team.py and decide whether I keep the Other than that, I will need to go through the list of TODOs that I introduced in this PR and fix them (or delete them) as well. I guess on the whole you could already look through it for any problems. On the whole things should not change substantially afterwards; most of the stuff I need to do is going to be cosmetically. Anything bigger will get a new PR. |
f2311e8 to
7c612ce
Compare
A keyboard interrupt during recv (or sigterm from a parent process) would still cause a send on the socket, leading to a lock.
Timeouts are handled and counted in request_new_position and create a fatal_error when exceeding the limit. Fatal errors are created with a new function add_fatal_error that also immediately changes the game phase to FINISHED or FAILURE, eliminating a few extra calls to check_gameover.
beb1c3a to
046d818
Compare
ce187ed to
d751b47
Compare
d751b47 to
6925dc0
Compare
|
All in all I’d say ready for review. I am still unhappy with how the remote team classes in The stateful socket logic in I added a new test for the network protocol in |
jbdyn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive changes! Cool that you put in so much effort.
I wish I could follow you on the changes on the network protocol, but I would need more time to wrap my head around the whole architecture, basically.
Despite that, I pointed out some other things, especially a removed if timeout is None check which causes pelita to crash on a missing timeout.
Hope you find my suggestions useful.
pelita/base_utils.py
Outdated
| def default_zmq_context(zmq_context=None): | ||
| return zmq.Context() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean this?
def default_zmq_context(zmq_context=None):
return zmq_context or zmq.Context()| zmq_context = zmq.Context() | ||
| zmq_external_publisher = ZMQPublisher(address=viewer_opts, bind=False, zmq_context=zmq_context) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for a manual zmq_context here, since it is created automatically in ZMQPublisher, so this can be shortened to
zmq_external_publisher = ZMQPublisher(address=viewer_opts, bind=False)| zmq_context = zmq.Context() | ||
| zmq_publisher = ZMQPublisher(address='tcp://127.0.0.1', zmq_context=zmq_context) | ||
| viewer_state['viewers'].append(zmq_publisher) | ||
| viewer_state['controller'] = setup_controller() | ||
| viewer_state['controller'] = Controller(zmq_context=zmq_context) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But here I assume that ZMQPublisher and Controller need to get the exact same zmq_context, right?
Then, this would be alright.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn’t need to be the same but it also doesn’t harm so I reused it, yes.
test/test_remote_game.py
Outdated
|
|
||
|
|
||
| @pytest.mark.xfail(reason="TODO: Fails in CI for macOS. Unclear why.") | ||
| #@pytest.mark.xfail(reason="TODO: Fails in CI for macOS. Unclear why.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can probably be removed?
| _logger = logging.getLogger(__name__) | ||
|
|
||
| # Maximum time that a player will wait for a move request | ||
| TIMEOUT_SECS = 60 * 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why a timeout of 1 hour? Wouldn't one minute be sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would mean that pausing a game for more than 60 seconds would cause a player to timeout and exit (and in turn lose the game). Unless we implement health ping messages between the main game and the players, this time needs to be longer than a typical ‘I pause the game because I want to check something’ situation.
Ideally, the main game would exit before this time (so that the match is not counted as a loss for the player but registered as a failed match) but for now we’ll ignore this edge case. ;) (I’ll add a remark.)
| socks = dict(poller.poll(timeout=TIMEOUT_SECS * 1000)) | ||
| if socks.get(socket) == zmq.POLLIN: | ||
| json_message = socket.recv_unicode() | ||
| else: | ||
| # TODO: Would be nice to tell Pelita main that we’re exiting | ||
| _logger.warning(f"No request in {TIMEOUT_SECS} seconds. Exiting player.") | ||
|
|
||
| # returning False breaks the loop | ||
| return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related to #890 (comment) for TIMEOUT_SECS above.
| # special case for no timeout | ||
| # just loop until we receive the correct reply | ||
| if timeout is None: | ||
| while True: | ||
| msg_id, reply = self._recv() | ||
| if msg_id == expected_id: | ||
| return reply | ||
|
|
||
| # normal timeout handling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this check, pelita crashes when called with pelita --no-timeout ....
To fix this, one could do the following:
if timeout is None:
timeout = math.infand then see below for handling the infinite timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good question, I don’t remember exactly why I removed it.
However: With the max timeout of a player being 1 hour, there is not really an infinite timeout anymore so in any case this would never happen. I guess the bug fix would be to set --no-timeout to use something like one hour as well.
|
|
||
| socks = dict(self.pollin.poll(time_left * 1000)) # poll needs milliseconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For handling the infinite timeout (i.e. no timeout), then we would need to write these lines like this:
time_left_msec = None if time_left == math.inf else time_left * 1000
socks = dict(self.pollin.poll(time_left_msec))There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing timeout=None is not supported anymore (as stated above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing
timeout=Noneis not supported anymore (as stated above).
OK, but right now the game crashes, so the behavior of ``--no-timeout` needs to be adjusted to set a long timeout – 1h is more than enough, I agree – before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if in the future we indeed implement some form of heartbeat protocol, we will need infinite timeout again. So I think @jbdyn 's implementation should be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heartbeats will need a core redesign (practically not sensible to implement with our approach of passing everything around in a giant dict; going async or GIL-less would be the way to do it). I estimate some further 13 years.
|
I like the idea of game phases a lot! And also the new way of managing client errors/timeouts. I tried to review the changes but it is a bit difficult because of the sheer amount of changed lines... but thanks for the overviews above. Together with @jbdyn we noticed a couple of things, and a crash when Thanks @Debilski for working on this for so long. This could be the basis to reorganize the network protocol and the internal state representations, maybe moving to strings to enums or so. But this is a discussion for the future. For the present I'd like to understand if you'd like to see this in before ASPP in Bulgaria or if you think it should wait. |
|
I think it is relatively uncontroversial and should be merged (after the bugs are found and fixed) but I’ll give some more explanations to the changes in the protocol later this week. |
In the absence of error states, the design is rather simple (as stated above): Main game opens and binds a socket for each player (socket is stored in the The only thing that has changed is really that the subprocess has to send an ‘ok’ message when it is ready. In the earlier design, the player could not initiate messages on its own. The main game would send a message directly after binding the socket (and likely before the player was connected). This was never a good design but it worked in 99.5% of the cases because of zmq magic (zmq has an outgoing message queue and will still deliver these messages). But when a client was not starting fast enough, we would get weird error messages (#785 #784) We used this design because it would allow us to make our main game network agnostic (network handling was strictly done in
The bigger changes in the PR are some restructuring of how and where exceptions are handled (which is kind of unrelated to the protocol change) and some reorganisation of the Team objects and network. I think these changes are not completely finished yet but I’d like to merge so that we can use the more robust protocol and the game phase.
I think it should work. But ideally, in the future it would make use of the game_phase and only register properly
There are some reorganisations coming in #851 (ideally also before Plovdiv but let’s see). I think Python enums don’t give us |
|
Any further requests or questions here? Otherwise I would suggest to merge, because #851, #889 (and #903) depend on it. I’ll take care of fixing any outstanding bugs and the docstrings will be improved with or after #851 (there is still going to be a bit of changes here and I would like to settle the design first). |
[WIP] Improved network protocol (more stable on slow clients) 698c4fb
Still a big WIP. Code needs lots of cleanup (+ fixes) and the network classes can still be simplified. Publishing this only to increase the personal pressure on finishing this. :)
In the old code, we have
make_teamcreating azmq.PAIRsocket, starting a process withpelita-playerand then immediately sending out aset_initialrequest to the remote player. We kind of rely on zmq magic that this will keep the message in the pipeline when the remote socket is not ready yet and will eventually succeed. In most cases it does but sometimes this fails, often when a slow file system is involved. Usually, this means a loss for the slow player involved. (Also sometimes it goes unnoticed and will then trigger a failure when the game sends the first move request.)We add additional logic (which is how it should have been done in the first place as this is network design 101): Once the remote player is started, it sends a status type message
"ok"to the main game (we include the team name as additional payload in this message, so that we can show this info in the UI as early as possible) to acknowledge its readiness.game.pywill wait for the oks of both remote teams simultaneously (with a longer timeout than 3 seconds) but failing as soon as one of them has an error. If either of them has an error, the game is simply cancelled with a new failure state. There will be no winner. (Not super important in every day situations but this makes things clearer for CI.)Ideally, we would also adapt the protocol to include health pings (to ensure we don’t have old bots running for days) but the server-less, dict-only structure of the pelita core doesn’t make this easy.
Last remark: There will be an introduction on game phases (init, running, finished, failure) to the game state so we don’t have to do things like
if round is Noneanymore where no-one knows what it is supposed to mean. This will probably be string-based in the first iteration but I hope to introduce a more elegant solution that would ideally also be usable with match statements.Closes #784, closes #778, closes #785, closes #887, closes #908
RemoteTeamis still incomplete