Horizontal pod scaling for relays #44
timdrysdale
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Our current relay implementation is a single instance that combines the access and relay parts in a single executable running on a single VM. Autoscaling is desirable (cost, resilience), and relatively achievable thanks to already considering automatic reconnection in the clients.
TL;DR: use hash affinity in nginx ingress, and have all relay pods broadcast to their peers any new connections they receive. Any peer hearing a new connection that duplicates a topic they already serve, drops all connections to that topic, forcing them to reconnect and get routed to the new pod. A similar mechanism is required to enforce cancellation of sessions anyway, so we have two reasons to do this.
Reasons for wanting to autoscale
What do other relays do?
Chat applications tend to scale horizontally by exchanging messages between server instances. No affinity is required. We would introduce significant CPU overhead to do this, because our dominant load comes from video ingest. We want to minimise that. We may also incur data charges for data flowing between different AZ. So those approaches do not work for us
What should we do?
We need to find a way to enforce affinity, deterministically, without incurring overhead.
Ideas already rejected
Affinity by hash key
Fortunately, nginx ingress offers affinity by hashing of a user defined key (like the request URI or user identity).
The main issues with affinity by hashing are that
Different URLs
We can solve this problem by ensuring a consistent base for the URL and routing using a variable fed by a regexp (TODO: check this works)
e.g.
wss:///relay/data/<session_id>/<stream_id>//<user_id>
e.g. for a video stream
experiment:
wss://core.practable.io/relay/data/d1df758c-3811-4291-83b2-144bc3db4db6/f9ee8002-a918-4639-b007-c97896a5a3cc/w/1b8f05c3-4aa1-496e-85a4-0a658fd0be42user
wss://core.practable.io/relay/data/d1df758c-3811-4291-83b2-144bc3db4db6/f9ee8002-a918-4639-b007-c97896a5a3cc/r/38481c41-97f0-4073-ab54-22107d6b7301The routing key we would want in this case is
wss://core.practable.io/relay/data/d1df758c-3811-4291-83b2-144bc3db4db6/f9ee8002-a918-4639-b007-c97896a5a3ccWe could possibly even reduce it further if required by what parts of the path we can access for forming our key, by noting that the session_id-stream_id dyad is unique i.e. no collision between tenants or with relay/ssh is statistically likely (although better to be namespaced for certainty).
d1df758c-3811-4291-83b2-144bc3db4db6/f9ee8002-a918-4639-b007-c97896a5a3ccThere was an now-solved issue in 2018 with double variables like
nginx.ingress.kubernetes.io/upstream-hash-by: $request_uri$http_x_user_id. At the time there was a workaround which may be helpful if we want more than one variable to form our hash key (e.g. if doing regexp to extract the common parts of the route from all participants in a topic):Pod scaling events
The hash affinity minimises remapping of keys on upstream group size changes, but it can still happen. If it happens, we end up with split affinity. Old connections on original pod, and new connections on new pod. Nginx will not cut the old connections. We need to do that ourselves.
We just need the old pod to drop the connections, and then let reconnection be routed to new pod.
Since we already want a cancel feature (#24, #37), we need a way of communicating from the access point to the relay pods, to tell them to cancel a session. We also need a way to share the short-lived access codes we put in the URL. And now we also need a way to broadcast new connections - so plenty of justification for introducing a peer communication link.
Note: we do not need or want subset hashing, because that would just leave participants randomly on hosts within each subset and only able to communicate with some or none of the other participants, for no benefit in scaling or robustness that we could get any other way.
Do we need a persistent code/message store?
We don't need to persist the cancellation or connection messages because they are only relevant to a pod running at the time they are broadcast. We can assume any pod that fails to process them is also failing to relay the video and data, and will soon be noticed as unhealthy, and restarted.
We also do not have to persist the authentication messages, even in the edge case of a slow client. Scenario
accessprovides a link to a user with a one-time authentication codewssconnectionThree things need to happen for this to occur - slow client, new pod, hash map selects new pod. It is going to happen. But probably not that often. Assume pods scale up for the afternoon rush, then again for the evening rush, for three different main time zones, so maybe six upscaling events each day, so at risk of this edge case for 0.03% of the day, with chance further reduced by likelihood of the client being that slow, and 1/N chance of being assigned to the new pod. Say six pods, and 1/100 clients are slow (won't be even close to this pessimistic number), and 1000 clients an hour peak connecting, and issue most likely to happen at peak time.
1000 clients an hour is one per every 3.6 seconds. There is a window of 6 scalings * 30sec per day, or 50 client connections /day that might be affected. Of those, 1/100 might be a slow client, and there might be six replicas, so 1/6 chance of hitting the issue on that slow client connection. That's a chance of 0.083 per day i.e. will happen on average once every 12 days. The cost of failed connection is a second or two to reconnect - admittedly, maybe more if it is a slow client - but then reconnection might help them pick up a different routing that performs better anyway? (clutching at straws on that last one).
We could cover off this case by putting in authentication code persistence, e.g. redis. However, redis won't do the broadcasting that we need (it can't push messages to the relay pods). So it is extra infrastructure to cover this one edge case, once every 12 days.
So we just let the reconnection take care of it.
graph TD U1[User] --https://access----> A1["Access Replica #1"] --> R[Message Broker] U2[Experiment] --https://access-->A2["Access Replica #2"] --> R A1--"use: wss://relay?c=..."-->U1 A2 --"use: wss://relay?c=..."---->U2 R --> R1[Relay Replica #1] R --> R2[Relay Replica #2] R --> R3[Relay Replica #3] U1 --"wss://relay?c=..."---> IG[Ingress with hashing] U2 --"wss://relay?c=..."--> IG IG --> R1 IG --> R1Keeping our access token
It's tempting to think that with the shift to ory we can drop our access token approach. There is the [websocket rules example] (https://www.ory.sh/docs/oathkeeper/guides/proxy-websockets) for ory. But there are security implications. For example, XSS e.g. CSWSH - if your websocket connection accepts cookies you can connect from any attacker website, opening us up to fielding malicious connections generated by users browser attacker websites (unlikely but possible). So we'd have to implement CSRF tokens approach (and mind leaking them in logs because the wss connection is made with csrf token as query param). Our approach uses the https:// auth method to obtain the one-time code which is then passed as a query param. It's basically the same approach, differing only in the entropy of the csrf token versus the uuid we use as the code. So we can just keep doing what we are doing for similar security levels, and avoid code thrash (2 x unnecessary refactoring for perceived consistency) if we try to mirror session cookies and CSRF now, to match Ory, and then move to another auth in future. So we keep our access token approach which we have been using the last couple of years in production.
Beta Was this translation helpful? Give feedback.
All reactions