-
Notifications
You must be signed in to change notification settings - Fork 20
William/0030 emqx linear channel registry #93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
zmstone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe name it LSR (the current name is wrong, it's session registration, but not channel).
|
|
||
| - The older channel with an unresponsive connection kicks the latest channel, causing the client to get disconnected unexpectedly. | ||
|
|
||
| - Wrong Session Present flag in MQTT.CONNACK is returned to the client. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did this happen and how does the new algorithm fix it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quick answer: emqx/emqx#15017
I will update this doc for that.
| - discarded | ||
| - kicked | ||
|
|
||
| 1. node down/Mnesia down event: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add the result trackings.
| READs from `LSR` storage only. WRITES go to both if `emqx_cm_registry` is also enabled (global), but DO NOT use `ekka_locker:trans`. | ||
|
|
||
|
|
||
| ### LSR 'Partially Enabled' within the cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if LSR is to be enabled manually, maybe only support manual migration in a running cluster (not during rolling upgrade).
and add a section to describe the migration steps.
e.g.
- check unsupported features of LSR is acceptable
- check all nodes in the cluster running the same version which supports LSR
- run a command to start migration
- run a command to check migration progress
- run a command to permanently disable old registry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline,
so far we see the possibility to support rolling upgrade where LSR and old Channel Registry could run in parallel in the cluster with the cost of correctness(due to old) and performance.
I will perform more tests to ensure it will work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces the Linear Session Registry (LSR) EIP, detailing its motivation, design, data structures, and testing guidance, along with a formal TLA+ specification.
- Defines a versioned session registry to avoid race conditions in MQTT session takeovers
- Adds Erlang record definitions, mermaid workflow diagram, and failure/retry paths
- Provides a complete TLA+ model (
LSR.tla) to formally verify key invariants
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| active/0030-emqx_linear_session_registry.md | New EIP draft with motivation, design, lifecycle, and tests |
| active/0030-assets/LSR.tla | Formal TLA+ specification covering registry operations and invariants |
Comments suppressed due to low confidence (2)
active/0030-emqx_linear_session_registry.md:19
- The field is referred to as
transport_started_athere but later calledtransport_connected_at. Choose one consistent name.
expands the channel prop with version (e.g., `transport_started_at` timestamp when transport connects) so that the
active/0030-assets/LSR.tla:95
- TLA+
LETdefinitions must be separated (e.g., with a comma). It should readLET ch == Chans[c], max_vsn == MaxCVsn(ch.loc) IN.
DoTakeoverSessionTX(c) == LET ch == Chans[c] max_vsn == MaxCVsn(ch.loc) IN
| If handling node doesn't know the existence of the session on the other node, it will open new session and return Session Present = false | ||
| to the client. The client will get confused if it just disconnect from other node and reconnect to the handling node. | ||
|
|
||
| further more it also forks the session into two, whether the next reconnect will takeover this session or the other session |
Copilot
AI
Jul 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The phrase "further more" should be one word: "furthermore".
| further more it also forks the session into two, whether the next reconnect will takeover this session or the other session | |
| furthermore it also forks the session into two, whether the next reconnect will takeover this session or the other session |
|
|
||
| Therefore, we could use the transport connected timestamp (`transport_connected_at`) in ms as the version of the channel. | ||
|
|
||
| @NOTE, we could use other timestamp too such as client provided timestamps embeded in MQTT user props. |
Copilot
AI
Jul 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The word "embeded" is misspelled; it should be "embedded".
| @NOTE, we could use other timestamp too such as client provided timestamps embeded in MQTT user props. | |
| @NOTE, we could use other timestamp too such as client provided timestamps embedded in MQTT user props. |
|
|
||
| MaxRVsn(node) == MaxOrNone(RStores[node]) | ||
|
|
||
| MaxCVsn(node) == MaxOrNone(CStores["c1"]) \* @TODO impl core node selection |
Copilot
AI
Jul 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard-coding "c1" ignores all other core nodes; this should be parameterized to use the passed-in node or a selection strategy.
| MaxCVsn(node) == MaxOrNone(CStores["c1"]) \* @TODO impl core node selection | |
| MaxCVsn(node) == MaxOrNone(CStores[node]) \* Use passed-in node for core node selection |
To read a rendered version:
https://github.com/emqx/eip/blob/william/0030-emqx_linear_channel_registry/active/0030-emqx_linear_session_registry.md