-
Notifications
You must be signed in to change notification settings - Fork 187
Description
Hi all.
At first I want to say that this case is just a corner case and I am not trying to stir up trouble. I just report it for robustness of pg_repack.
Maybe you have found that I am always reporting issues of corner cases. The reason it that I transplanted pg_repack to Alibaba Cloud PolarDB-PG and I built a test framework for it to find out the corner cases and improve the stability of this commercial database.
I executed this command with pg_repack 1.5.0:
pg_repack -Utest_user --dbname=ddl_full0 -k --table=normal.t1 --jobs=3I got the error below:
NOTICE: Setting up workers.conns
INFO: repacking table "normal.t1"
NOTICE: Canceled 1 unsafe queries
ERROR: query failed: ERROR: column "foo" does not exist
LINE 1: ...,NULL::integer AS "........pg.dropped.18........",foo,col_wi...
^
DETAIL: query was: INSERT INTO repack.table_29169 SELECT id,unique1,hundred,tenthous,a,b,c,d,f_char,g_vchar,h_bool,i_enum,j_array_bigint,extra,e,NULL::integer AS "........pg.dropped.17........",NULL::integer AS "........pg.dropped.18........",foo,col_with_default FROM ONLY normal.t1
It seems that the table does not have column foo but copy_data command has the column.
The original table and new table both don't have column foo now.
ddl_full0=# \d t1
Table "normal.t1"
Column | Type | Collation | Nullable | Default
------------------+--------------------------------+-----------+----------+--------------------------------------
id | integer | | not null |
unique1 | numeric | | |
hundred | numeric | | |
tenthous | numeric | | |
a | text | | |
b | date | | |
c | timestamp(6) without time zone | | |
d | double precision | | not null |
f_char | character(256) | | |
g_vchar | character varying(256) | | |
h_bool | boolean | | |
i_enum | enum_type | | |
j_array_bigint | bigint[] | | |
extra | jsonb | | |
e | integer | | not null | nextval('t1_e_seq'::regclass)
col_with_default | integer | | | nextval('seq_for_default'::regclass)
Indexes:
"t1_pkey" PRIMARY KEY, btree (id, e)
"t1_idx_btree_a" btree (a)
"t1_idx_g_hash" hash (g_vchar)
"t1_idx_gin_j_array_bigint" gin (j_array_bigint)
"t1_idx_gist_char" gist (f_char)
ddl_full0=# \d repack.table_29169
Table "repack.table_29169"
Column | Type | Collation | Nullable | Default
-------------------------------+--------------------------------+-----------+----------+---------
id | integer | | |
unique1 | numeric | | |
hundred | numeric | | |
tenthous | numeric | | |
a | text | | |
b | date | | |
c | timestamp(6) without time zone | | |
d | double precision | | |
f_char | character(256) | | |
g_vchar | character varying(256) | | |
h_bool | boolean | | |
i_enum | enum_type | | |
j_array_bigint | bigint[] | | |
extra | jsonb | | |
e | integer | | |
........pg.dropped.17........ | integer | | |
........pg.dropped.18........ | integer | | |
........pg.dropped.19........ | integer | | |
col_with_default | integer | | |Since pg_repack held shared lock in the txn, other session would not be able to drop the column concurrently.
2024-09-03 04:32:25.310 UTC 22026 ddl_full0 00000 0 pg_repack LOG: statement: /*pg_catalog, pg_temp, public*/ BEGIN ISOLATION LEVEL SERIALIZABLE
……
2024-09-03 04:32:25.403 UTC 22026 ddl_full0 00000 2592 pg_repack LOG: statement: /*pg_catalog, pg_temp, public*/ LOCK TABLE normal.t1 IN ACCESS SHARE MODE
2024-09-03 04:32:25.507 UTC 22026 ddl_full0 00000 2592 pg_repack LOG: execute <unnamed>: /*pg_catalog, pg_temp, public*/ SELECT repack.create_table('29169', 'pg_default')
2024-09-03 04:32:25.509 UTC 22026 ddl_full0 00000 0 pg_repack LOG: statement: /*pg_catalog, pg_temp, public*/ ALTER TABLE repack.table_29169 ALTER hundred SET STORAGE EXTENDED
2024-09-03 04:32:25.518 UTC 22026 ddl_full0 42703 0 pg_repack LOG: statement: /*pg_catalog, pg_temp, public*/ INSERT INTO repack.table_29169 SELECT id,unique1,hundred,tenthous,a,b,c,d,f_char,g_vchar,h_bool,i_enum,j_array_bigint,extra,e,NULL::integer AS "........pg.dropped.17........",NULL::integer AS "........pg.dropped.18........",foo,col_with_default FROM ONLY normal.t1
With the audit log we can see that the column was dropped before copy_data executed:
2024-09-03 04:32:12.694 UTC 20935 ddl_full0 00000 2489 pgbench LOG: execute <unnamed>: /*normal, public*/ ALTER TABLE IF EXISTS t1 DROP COLUMN IF EXISTS foo;So it seems impossible that pg_repack's copy_data command had column foo in it. This confused me for a long time. But finally I got the answer by reading the code.
copy_data is defined in repack.tables view and the repack.get_columns_for_create_as() function is executed in repack_one_database(), then it saves the copy_data sql text into memory. At this moment, column foo is still in the table.
'INSERT INTO repack.table_' || R.oid || ' SELECT ' || repack.get_columns_for_create_as(R.oid) || ' FROM ONLY ' || repack.oid2text(R.oid) AS copy_data,
table.copy_data = getstr(res, i , c++);Before repack_one_database() calls repack_one_table(), a DDL command drops column foo. This is possible because pg_repack holds no lock in repack_one_database().
Then repack_one_database() calls repack_one_table() and executes the copy_data command generated before. Now the table definition and copy_data's table definition mismatch.
How to reproduce it:
- Add
command("select pg_sleep(20)", 0, NULL);in repack_one_database() after copy_data generated - create table test(a int primary key, b text);
- pg_repack --dbname=postgres --table=test
- Get the pid of pg_repack client and execute
gdb attach pid, pg_repack sleeps for 20s and I think it's enough for us. We can see that copy_data text includes column a and b
(gdb) f 12
#12 0x0000000000405809 in repack_one_database (orderby=0x0, errbuf=0x7ffd8562f6b0 "\240\222]", errsize=256) at pg_repack.c:1007
(gdb) p table.copy_data
$1 = 0x5f2009 "INSERT INTO repack.table_41928 SELECT a,b FROM ONLY public.test"
- pg_repack hangs and now we connect to db by psql, there's no lock on table test now, so we can drop column b
postgres=# select * from pg_locks where relation = 'test'::regclass;
locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | m
ode | granted | fastpath | waitstart
----------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-----+--
----+---------+----------+-----------
(0 rows)
postgres=# alter table test drop column b;
ALTER TABLE- Execute 'c' in gdb to continue, copy_data is executed and we get the error
pg_repack --dbname=postgres --table=test
INFO: repacking table "public.test"
ERROR: query failed: ERROR: column "b" does not exist
LINE 1: INSERT INTO repack.table_41928 SELECT a,b FROM ONLY public.t...
^
DETAIL: query was: INSERT INTO repack.table_41928 SELECT a,b FROM ONLY public.test