Skip to content

Conversation

@pgray
Copy link
Contributor

@pgray pgray commented Nov 24, 2025

We're seeing some 5xx traffic when rolling out upgrades

this should give the service/ingress/etc time to propagate changes before typesense gets a sigterm

@akyriako
Copy link
Owner

Valid point, but putting it to sleep for 15sec doesnt seem to solve the problem. There is no guarantee that you will not encounter 5XXs after that period. Quick thought: maybe you should additionally check if /health endpoint is up and inly after is not responding to let the preStop hook exit.

@pgray
Copy link
Contributor Author

pgray commented Dec 4, 2025

I tried to add helpful behavior over here:
typesense/typesense#2675

I didn't get amazing feedback from @kishorenc

I consulted the hangops community and apparently the preStop sleep configuration was added to kubernetes explicitly for situations like this.

This PR should resolve issues for most environments because currently the sigterm happens almost instantaneously, allowing for milliseconds before the endpoint slice is updated...

see explanation here

Also, it's not 15s but rather:

10 seconds total (the whole termination grace period)

the first 7 seconds of which are the kubelet sleeping before sending the sigterm

which i tried to explain in the code comment

@akyriako
Copy link
Owner

akyriako commented Dec 4, 2025

As I said in my previous response, this approach is not solving the problem deterministically. Additionally, the sleep time value (7sec) is totally arbitrary and will lead to side-effects:

  1. Smaller clusters that termination happens faster will need to sit on dead time
  2. Bigger workloads might need more than 7 secs
  3. It's passive, not adaptive and can end up wasteful and degrade the performance of both the operator and the ts-cluster itself.

You should go after something in this direction (I havent tried it so I am just spitballing here):

Lifecycle: &corev1.Lifecycle{
    PreStop: &corev1.LifecycleHandler{
        Exec: &corev1.ExecAction{
            Command: []string{
                "/bin/bash",
                "-c",
                `MAX_WAIT=25
WAIT_TIME=0
while (echo > /dev/tcp/localhost/8108) > /dev/null 2>&1; do
    if [ $WAIT_TIME -ge $MAX_WAIT ]; then
        echo "Waiting ${MAX_WAIT}s"
        break
    fi
    sleep 1
    WAIT_TIME=$((WAIT_TIME + 1))
done`,
            },
        },
    },
},

IMPORTANT> Typesense containers have neither curl, wget or nc installed. So you have to be creative, like going for a raw tcp connection with /dev/tcp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants