Skip to content

Stuck cleanup phase if pg_repack loses connection during data copy (with --no-kill-backend) #456

@boosterKRD

Description

@boosterKRD

full_stdout.txt

Summary

If pg_repack loses its connection (e.g. due to HAProxy timeout) during the initialization stage (specifically, during the copy data phase), and the --no-kill-backend flag is used, the following serious issue occurs:

  • The PostgreSQL backend process continues running the data copy query (e.g. INSERT INTO repack.table_* SELECT ...) in an active state, even though the client (pg_repack) has disconnected.
#Just for example
ERROR: query failed: SSL SYSCALL error: EOF detected
DETAIL: query was: INSERT INTO repack.table_16451 SELECT 
  • At this point, pg_repack begins the cleanup phase, which requires an ACCESS EXCLUSIVE lock on the target table.
  • However, it cannot acquire the lock because the still-running data copy process prevents it from acquiring the lock.
  • Since --no-kill-backend is used, pg_repack does not terminate the blocking backend.
  • As a result, all other read/write queries to the affected table are blocked, potentially for a very long time.

Reproduction Steps

  1. Start pg_repack on a large table with --no-kill-backend and --wait-timeout=10 (or any value).
  2. Use a load balancer (e.g. HAProxy) with a timeout that is shorter than the expected duration of the data copy phase
global
  log stdout format raw local0
  maxconn 4096

defaults
  log     global
  mode    tcp
  option  tcplog
  timeout connect 10s
  timeout client 30s
  timeout server 30s
  retries 3

frontend ha-frontend
  bind *:5432
  default_backend ha-backend

backend ha-backend
  server rds-primary PUT_HERE_UOYR_URI:5432 check
  1. Observe:
  • The backend on PostgreSQL continues running the INSERT ... SELECT in an active state.
  • pg_repack starts the cleanup phase and tries to acquire ACCESS EXCLUSIVE.
  • It fails to acquire the lock, but does not terminate the backend due to --no-kill-backend.
  • The table becomes inaccessible (locked) for all other clients until the backend completes or is manually terminated.
# start pg_repack 
 pid  | query_duration  |        state        | wait_event_type |                                                                 query
------+-----------------+---------------------+-----------------+---------------------------------------------------------------------------------------------------------------------------------------
 4418 | 00:00:02.106592 | active              | IO              | INSERT INTO repack.table_16451 SELECT id,user_id,status,score,flag,event_time,notes,meta FROM ONLY public.big_data_tiny
 4420 | 00:00:02.133738 | idle in transaction | Client          | LOCK TABLE public.big_data_tiny IN ACCESS SHARE MODE


#####  On sudden connection loss

 pid  | query_duration  | state  | wait_event_type |                                                                 query
------+-----------------+--------+-----------------+---------------------------------------------------------------------------------------------------------------------------------------
 4273 | 00:01:39.248618 | active | IO              | INSERT INTO repack.table_16451 SELECT id,user_id,status,score,flag,event_time,notes,meta FROM ONLY public.big_data_tiny
 4370 | 00:00:00.009245 | active | Lock            | LOCK TABLE public.big_data_tiny IN ACCESS EXCLUSIVE MODE

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions