Skip to content

copy_data fail if table definition changed after copy_data SQL string saved to memory in repack_one_database() #421

@zhanghien

Description

@zhanghien

Hi all.

At first I want to say that this case is just a corner case and I am not trying to stir up trouble. I just report it for robustness of pg_repack.
Maybe you have found that I am always reporting issues of corner cases. The reason it that I transplanted pg_repack to Alibaba Cloud PolarDB-PG and I built a test framework for it to find out the corner cases and improve the stability of this commercial database.

I executed this command with pg_repack 1.5.0:

pg_repack -Utest_user --dbname=ddl_full0 -k --table=normal.t1 --jobs=3

I got the error below:

NOTICE: Setting up workers.conns
INFO: repacking table "normal.t1"
NOTICE: Canceled 1 unsafe queries
ERROR: query failed: ERROR:  column "foo" does not exist
LINE 1: ...,NULL::integer AS "........pg.dropped.18........",foo,col_wi...
                                                                                             ^
DETAIL: query was: INSERT INTO repack.table_29169 SELECT id,unique1,hundred,tenthous,a,b,c,d,f_char,g_vchar,h_bool,i_enum,j_array_bigint,extra,e,NULL::integer AS "........pg.dropped.17........",NULL::integer AS "........pg.dropped.18........",foo,col_with_default FROM ONLY normal.t1

It seems that the table does not have column foo but copy_data command has the column.

The original table and new table both don't have column foo now.

ddl_full0=# \d t1
                                                Table "normal.t1"
      Column      |              Type              | Collation | Nullable |               Default
------------------+--------------------------------+-----------+----------+--------------------------------------
 id               | integer                        |           | not null |
 unique1          | numeric                        |           |          |
 hundred          | numeric                        |           |          |
 tenthous         | numeric                        |           |          |
 a                | text                           |           |          |
 b                | date                           |           |          |
 c                | timestamp(6) without time zone |           |          |
 d                | double precision               |           | not null |
 f_char           | character(256)                 |           |          |
 g_vchar          | character varying(256)         |           |          |
 h_bool           | boolean                        |           |          |
 i_enum           | enum_type                      |           |          |
 j_array_bigint   | bigint[]                       |           |          |
 extra            | jsonb                          |           |          |
 e                | integer                        |           | not null | nextval('t1_e_seq'::regclass)
 col_with_default | integer                        |           |          | nextval('seq_for_default'::regclass)
Indexes:
    "t1_pkey" PRIMARY KEY, btree (id, e)
    "t1_idx_btree_a" btree (a)
    "t1_idx_g_hash" hash (g_vchar)
    "t1_idx_gin_j_array_bigint" gin (j_array_bigint)
    "t1_idx_gist_char" gist (f_char)

ddl_full0=# \d repack.table_29169
                                   Table "repack.table_29169"
            Column             |              Type              | Collation | Nullable | Default
-------------------------------+--------------------------------+-----------+----------+---------
 id                            | integer                        |           |          |
 unique1                       | numeric                        |           |          |
 hundred                       | numeric                        |           |          |
 tenthous                      | numeric                        |           |          |
 a                             | text                           |           |          |
 b                             | date                           |           |          |
 c                             | timestamp(6) without time zone |           |          |
 d                             | double precision               |           |          |
 f_char                        | character(256)                 |           |          |
 g_vchar                       | character varying(256)         |           |          |
 h_bool                        | boolean                        |           |          |
 i_enum                        | enum_type                      |           |          |
 j_array_bigint                | bigint[]                       |           |          |
 extra                         | jsonb                          |           |          |
 e                             | integer                        |           |          |
 ........pg.dropped.17........ | integer                        |           |          |
 ........pg.dropped.18........ | integer                        |           |          |
 ........pg.dropped.19........ | integer                        |           |          |
 col_with_default              | integer                        |           |          |

Since pg_repack held shared lock in the txn, other session would not be able to drop the column concurrently.

2024-09-03 04:32:25.310 UTC 22026 ddl_full0  00000 0 pg_repack LOG:  statement: /*pg_catalog, pg_temp, public*/ BEGIN ISOLATION LEVEL SERIALIZABLE
……
2024-09-03 04:32:25.403 UTC 22026 ddl_full0  00000 2592 pg_repack LOG:  statement: /*pg_catalog, pg_temp, public*/ LOCK TABLE normal.t1 IN ACCESS SHARE MODE
2024-09-03 04:32:25.507 UTC 22026 ddl_full0  00000 2592 pg_repack LOG:  execute <unnamed>: /*pg_catalog, pg_temp, public*/ SELECT repack.create_table('29169', 'pg_default')
2024-09-03 04:32:25.509 UTC 22026 ddl_full0  00000 0 pg_repack LOG:  statement: /*pg_catalog, pg_temp, public*/ ALTER TABLE repack.table_29169 ALTER hundred SET STORAGE EXTENDED
2024-09-03 04:32:25.518 UTC 22026 ddl_full0  42703 0 pg_repack LOG:  statement: /*pg_catalog, pg_temp, public*/ INSERT INTO repack.table_29169 SELECT id,unique1,hundred,tenthous,a,b,c,d,f_char,g_vchar,h_bool,i_enum,j_array_bigint,extra,e,NULL::integer AS "........pg.dropped.17........",NULL::integer AS "........pg.dropped.18........",foo,col_with_default FROM ONLY normal.t1

With the audit log we can see that the column was dropped before copy_data executed:

2024-09-03 04:32:12.694 UTC 20935 ddl_full0  00000 2489 pgbench LOG:  execute <unnamed>: /*normal, public*/ ALTER TABLE IF EXISTS t1 DROP COLUMN IF EXISTS foo;

So it seems impossible that pg_repack's copy_data command had column foo in it. This confused me for a long time. But finally I got the answer by reading the code.

copy_data is defined in repack.tables view and the repack.get_columns_for_create_as() function is executed in repack_one_database(), then it saves the copy_data sql text into memory. At this moment, column foo is still in the table.

'INSERT INTO repack.table_' || R.oid || ' SELECT ' || repack.get_columns_for_create_as(R.oid) || ' FROM ONLY ' || repack.oid2text(R.oid) AS copy_data,
table.copy_data = getstr(res, i , c++);

Before repack_one_database() calls repack_one_table(), a DDL command drops column foo. This is possible because pg_repack holds no lock in repack_one_database().
Then repack_one_database() calls repack_one_table() and executes the copy_data command generated before. Now the table definition and copy_data's table definition mismatch.

How to reproduce it:

  1. Add command("select pg_sleep(20)", 0, NULL); in repack_one_database() after copy_data generated
  2. create table test(a int primary key, b text);
  3. pg_repack --dbname=postgres --table=test
  4. Get the pid of pg_repack client and execute gdb attach pid, pg_repack sleeps for 20s and I think it's enough for us. We can see that copy_data text includes column a and b
(gdb) f    12
#12 0x0000000000405809 in repack_one_database (orderby=0x0, errbuf=0x7ffd8562f6b0 "\240\222]", errsize=256) at pg_repack.c:1007
(gdb) p    table.copy_data
$1 = 0x5f2009 "INSERT INTO repack.table_41928 SELECT a,b FROM ONLY public.test"
  1. pg_repack hangs and now we connect to db by psql, there's no lock on table test now, so we can drop column b
postgres=# select * from pg_locks where relation = 'test'::regclass;
 locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | m
ode | granted | fastpath | waitstart
----------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-----+--
----+---------+----------+-----------
(0 rows)
postgres=# alter table test drop column b;
ALTER TABLE
  1. Execute 'c' in gdb to continue, copy_data is executed and we get the error
pg_repack --dbname=postgres --table=test
INFO: repacking table "public.test"
ERROR: query failed: ERROR:  column "b" does not exist
LINE 1: INSERT INTO repack.table_41928 SELECT a,b FROM ONLY public.t...
                                               ^
DETAIL: query was: INSERT INTO repack.table_41928 SELECT a,b FROM ONLY public.test

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions