Horizontal pod scaling for relays #44

timdrysdale · 2022-06-11T18:37:05Z

timdrysdale
Jun 11, 2022
Maintainer

Our current relay implementation is a single instance that combines the access and relay parts in a single executable running on a single VM. Autoscaling is desirable (cost, resilience), and relatively achievable thanks to already considering automatic reconnection in the clients.

TL;DR: use hash affinity in nginx ingress, and have all relay pods broadcast to their peers any new connections they receive. Any peer hearing a new connection that duplicates a topic they already serve, drops all connections to that topic, forcing them to reconnect and get routed to the new pod. A similar mechanism is required to enforce cancellation of sessions anyway, so we have two reasons to do this.

Reasons for wanting to autoscale

if we have low latency network connections to smaller nodes, then we can reduce our standing costs by scaling the load at finer granularity i.e. run normally with a few smaller nodes, and scale up during high demand. At present we provision for max usage.
if we have any relay instance (container or pod) failures, then we are better placed to handle them if we are already working with a system that understands the affinity needs of multiple instances. At present we have to restart the relay node (which can be minutes) or operate a VIP-like switchover (faster, but not automated at present, and you still have to spool up the replacement or pay for it to be running).

What do other relays do?

Chat applications tend to scale horizontally by exchanging messages between server instances. No affinity is required. We would introduce significant CPU overhead to do this, because our dominant load comes from video ingest. We want to minimise that. We may also incur data charges for data flowing between different AZ. So those approaches do not work for us

What should we do?

We need to find a way to enforce affinity, deterministically, without incurring overhead.
Ideas already rejected

stateful sets and writing our own haproxy (does not scale up and down on demand as easily)
just reject connections we know a peer already has a connection on, and hope that they get routed to that pod eventually (for five nines chance of connection, need 90 retries with 6 pods, so not really a suitable system for scale)
write a custom iptables controller (sounds a bit like a problem someone else might have solved before - and they have!)

Affinity by hash key

Fortunately, nginx ingress offers affinity by hashing of a user defined key (like the request URI or user identity).

# check that a backend is actually using upstream-hash
kubectl exec <an ingress-nginx pod> -n <namespace> -- curl -s localhost:18080/configuration/backends
<snip>
"upstream-hash-by": "$request_uri",
<snip>

The main issues with affinity by hashing are that

we have different URLs for each participant
k8s scales down pods with load, affecting current relay sessions
hashing only applies to new connections, so old connections stay connected and new connections go to another pod

Different URLs

We can solve this problem by ensuring a consistent base for the URL and routing using a variable fed by a regexp (TODO: check this works)

e.g.
wss:///relay/data/<session_id>/<stream_id>//<user_id>
e.g. for a video stream
experiment:
wss://core.practable.io/relay/data/d1df758c-3811-4291-83b2-144bc3db4db6/f9ee8002-a918-4639-b007-c97896a5a3cc/w/1b8f05c3-4aa1-496e-85a4-0a658fd0be42

user
wss://core.practable.io/relay/data/d1df758c-3811-4291-83b2-144bc3db4db6/f9ee8002-a918-4639-b007-c97896a5a3cc/r/38481c41-97f0-4073-ab54-22107d6b7301

The routing key we would want in this case is
wss://core.practable.io/relay/data/d1df758c-3811-4291-83b2-144bc3db4db6/f9ee8002-a918-4639-b007-c97896a5a3cc

We could possibly even reduce it further if required by what parts of the path we can access for forming our key, by noting that the session_id-stream_id dyad is unique i.e. no collision between tenants or with relay/ssh is statistically likely (although better to be namespaced for certainty).
d1df758c-3811-4291-83b2-144bc3db4db6/f9ee8002-a918-4639-b007-c97896a5a3cc

There was an now-solved issue in 2018 with double variables like nginx.ingress.kubernetes.io/upstream-hash-by: $request_uri$http_x_user_id. At the time there was a workaround which may be helpful if we want more than one variable to form our hash key (e.g. if doing regexp to extract the common parts of the route from all participants in a topic):

ElvinEfendi commented on 23 Nov 2018
@rasmusbergpalm in the meantime you can use map to introduce a new variable that's set to $request_uri$http_x_user_id and use that new variable as the value to nginx.ingress.kubernetes.io/upstream-hash-by

Pod scaling events

The hash affinity minimises remapping of keys on upstream group size changes, but it can still happen. If it happens, we end up with split affinity. Old connections on original pod, and new connections on new pod. Nginx will not cut the old connections. We need to do that ourselves.

We just need the old pod to drop the connections, and then let reconnection be routed to new pod.

Since we already want a cancel feature (#24, #37), we need a way of communicating from the access point to the relay pods, to tell them to cancel a session. We also need a way to share the short-lived access codes we put in the URL. And now we also need a way to broadcast new connections - so plenty of justification for introducing a peer communication link.

Note: we do not need or want subset hashing, because that would just leave participants randomly on hosts within each subset and only able to communicate with some or none of the other participants, for no benefit in scaling or robustness that we could get any other way.

Do we need a persistent code/message store?

We don't need to persist the cancellation or connection messages because they are only relevant to a pod running at the time they are broadcast. We can assume any pod that fails to process them is also failing to relay the video and data, and will soon be noticed as unhealthy, and restarted.

We also do not have to persist the authentication messages, even in the edge case of a slow client. Scenario

access provides a link to a user with a one-time authentication code
access broadcasts the authentication code to the relay pods
user delays in making the wss connection
new relay pod spins up, having not heard the authentication code
user connects, ingress assigns it to the new pod, which rejects them

Three things need to happen for this to occur - slow client, new pod, hash map selects new pod. It is going to happen. But probably not that often. Assume pods scale up for the afternoon rush, then again for the evening rush, for three different main time zones, so maybe six upscaling events each day, so at risk of this edge case for 0.03% of the day, with chance further reduced by likelihood of the client being that slow, and 1/N chance of being assigned to the new pod. Say six pods, and 1/100 clients are slow (won't be even close to this pessimistic number), and 1000 clients an hour peak connecting, and issue most likely to happen at peak time.

1000 clients an hour is one per every 3.6 seconds. There is a window of 6 scalings * 30sec per day, or 50 client connections /day that might be affected. Of those, 1/100 might be a slow client, and there might be six replicas, so 1/6 chance of hitting the issue on that slow client connection. That's a chance of 0.083 per day i.e. will happen on average once every 12 days. The cost of failed connection is a second or two to reconnect - admittedly, maybe more if it is a slow client - but then reconnection might help them pick up a different routing that performs better anyway? (clutching at straws on that last one).

We could cover off this case by putting in authentication code persistence, e.g. redis. However, redis won't do the broadcasting that we need (it can't push messages to the relay pods). So it is extra infrastructure to cover this one edge case, once every 12 days.

So we just let the reconnection take care of it.

graph TD
    U1[User] --https://access----> A1["Access Replica #1"] --> R[Message Broker]
    U2[Experiment] --https://access-->A2["Access Replica #2"] --> R
    A1--"use: wss://relay?c=..."-->U1
    A2 --"use: wss://relay?c=..."---->U2
    R --> R1[Relay Replica #1]
    R --> R2[Relay Replica #2]
    R --> R3[Relay Replica #3]
    U1 --"wss://relay?c=..."---> IG[Ingress with hashing]
    U2 --"wss://relay?c=..."--> IG
    IG --> R1
    IG --> R1

Keeping our access token

It's tempting to think that with the shift to ory we can drop our access token approach. There is the [websocket rules example] (https://www.ory.sh/docs/oathkeeper/guides/proxy-websockets) for ory. But there are security implications. For example, XSS e.g. CSWSH - if your websocket connection accepts cookies you can connect from any attacker website, opening us up to fielding malicious connections generated by users browser attacker websites (unlikely but possible). So we'd have to implement CSRF tokens approach (and mind leaking them in logs because the wss connection is made with csrf token as query param). Our approach uses the https:// auth method to obtain the one-time code which is then passed as a query param. It's basically the same approach, differing only in the entropy of the csrf token versus the uuid we use as the code. So we can just keep doing what we are doing for similar security levels, and avoid code thrash (2 x unnecessary refactoring for perceived consistency) if we try to mirror session cookies and CSRF now, to match Ory, and then move to another auth in future. So we keep our access token approach which we have been using the last couple of years in production.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Horizontal pod scaling for relays #44

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Horizontal pod scaling for relays #44

Uh oh!

timdrysdale Jun 11, 2022 Maintainer

Reasons for wanting to autoscale

What do other relays do?

What should we do?

Affinity by hash key

Different URLs

Pod scaling events

Do we need a persistent code/message store?

Keeping our access token

Replies: 0 comments

timdrysdale
Jun 11, 2022
Maintainer