Skip to content

Conversation

@qzhuyan
Copy link
Contributor

@qzhuyan qzhuyan commented Apr 24, 2025

@qzhuyan qzhuyan requested a review from Copilot April 25, 2025 13:51

This comment was marked as outdated.

@qzhuyan qzhuyan mentioned this pull request May 8, 2025
4 tasks
Copy link
Member

@zmstone zmstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe name it LSR (the current name is wrong, it's session registration, but not channel).


- The older channel with an unresponsive connection kicks the latest channel, causing the client to get disconnected unexpectedly.

- Wrong Session Present flag in MQTT.CONNACK is returned to the client.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did this happen and how does the new algorithm fix it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick answer: emqx/emqx#15017

I will update this doc for that.

- discarded
- kicked

1. node down/Mnesia down event:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add the result trackings.

READs from `LSR` storage only. WRITES go to both if `emqx_cm_registry` is also enabled (global), but DO NOT use `ekka_locker:trans`.


### LSR 'Partially Enabled' within the cluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if LSR is to be enabled manually, maybe only support manual migration in a running cluster (not during rolling upgrade).
and add a section to describe the migration steps.
e.g.

  1. check unsupported features of LSR is acceptable
  2. check all nodes in the cluster running the same version which supports LSR
  3. run a command to start migration
  4. run a command to check migration progress
  5. run a command to permanently disable old registry

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline,
so far we see the possibility to support rolling upgrade where LSR and old Channel Registry could run in parallel in the cluster with the cost of correctness(due to old) and performance.

I will perform more tests to ensure it will work.

@qzhuyan qzhuyan requested a review from Copilot July 2, 2025 16:55
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces the Linear Session Registry (LSR) EIP, detailing its motivation, design, data structures, and testing guidance, along with a formal TLA+ specification.

  • Defines a versioned session registry to avoid race conditions in MQTT session takeovers
  • Adds Erlang record definitions, mermaid workflow diagram, and failure/retry paths
  • Provides a complete TLA+ model (LSR.tla) to formally verify key invariants

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
active/0030-emqx_linear_session_registry.md New EIP draft with motivation, design, lifecycle, and tests
active/0030-assets/LSR.tla Formal TLA+ specification covering registry operations and invariants
Comments suppressed due to low confidence (2)

active/0030-emqx_linear_session_registry.md:19

  • The field is referred to as transport_started_at here but later called transport_connected_at. Choose one consistent name.
expands the channel prop with version (e.g., `transport_started_at` timestamp when transport connects) so that the 

active/0030-assets/LSR.tla:95

  • TLA+ LET definitions must be separated (e.g., with a comma). It should read LET ch == Chans[c], max_vsn == MaxCVsn(ch.loc) IN.
DoTakeoverSessionTX(c) == LET ch == Chans[c] max_vsn == MaxCVsn(ch.loc) IN

If handling node doesn't know the existence of the session on the other node, it will open new session and return Session Present = false
to the client. The client will get confused if it just disconnect from other node and reconnect to the handling node.

further more it also forks the session into two, whether the next reconnect will takeover this session or the other session
Copy link

Copilot AI Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The phrase "further more" should be one word: "furthermore".

Suggested change
further more it also forks the session into two, whether the next reconnect will takeover this session or the other session
furthermore it also forks the session into two, whether the next reconnect will takeover this session or the other session

Copilot uses AI. Check for mistakes.

Therefore, we could use the transport connected timestamp (`transport_connected_at`) in ms as the version of the channel.

@NOTE, we could use other timestamp too such as client provided timestamps embeded in MQTT user props.
Copy link

Copilot AI Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word "embeded" is misspelled; it should be "embedded".

Suggested change
@NOTE, we could use other timestamp too such as client provided timestamps embeded in MQTT user props.
@NOTE, we could use other timestamp too such as client provided timestamps embedded in MQTT user props.

Copilot uses AI. Check for mistakes.

MaxRVsn(node) == MaxOrNone(RStores[node])

MaxCVsn(node) == MaxOrNone(CStores["c1"]) \* @TODO impl core node selection
Copy link

Copilot AI Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard-coding "c1" ignores all other core nodes; this should be parameterized to use the passed-in node or a selection strategy.

Suggested change
MaxCVsn(node) == MaxOrNone(CStores["c1"]) \* @TODO impl core node selection
MaxCVsn(node) == MaxOrNone(CStores[node]) \* Use passed-in node for core node selection

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants