Skip to content

Store and maintain the SQL query string in pg_ivm_immv.#128

Open
adamguo0 wants to merge 8 commits intosraoss:mainfrom
adamguo0:20250326
Open

Store and maintain the SQL query string in pg_ivm_immv.#128
adamguo0 wants to merge 8 commits intosraoss:mainfrom
adamguo0:20250326

Conversation

@adamguo0
Copy link

@adamguo0 adamguo0 commented Mar 27, 2025

Hi, I have been looking into how we can smoothly support pg_dump/restore/upgrade with this extension (#118). This PR is part one of my idea. Any feedback is welcome!

As I understand it, there are two fundamental problems.

  • The triggers hard-code the IMMV OID, and this is not preserved by dump/restore.
  • The pg_ivm_immv table stores a serialized Query tree, and this is not preserved by dump/restore or upgrade. Specifically dump/restore changes the referenced OIDs, and upgrade can change the Query tree struct itself.

This means after dump/restore or upgrade, the IMMVs are not usable. In fact, we can't even run refresh_immv to bring them back, because the original view query is lost.

This PR consists of a few changes:

  • Add a querystring text column to pg_ivm_immv. This column stores the fully-qualified SQL query that defines this view.
  • Add a DDL event trigger to update the querystring whenever an object is renamed. This is needed so that the querystring does not become stale if any referenced objects are renamed.
  • Modify RefreshIMMVByOid so that it re-parses the querystring and stores the updated viewdef in pg_ivm_immv.
  • Add a new SQL function recreate_all_immvs that refreshes all IMMVs. This re-populates the data and re-creates all the base table and change control triggers

With this change, a user can run pg_dump/restore/upgrade, then run pgivm.recreate_all_immvs() in order to restore their IMMVs to working order. They do not need to drop and re-create the IMMVs, and they do not need to input the original queries again.

I hope to find a way to automate the pgivm.recreate_all_immvs() step somehow. I played with creating executor/processutility hooks to automatically refresh the IMMVs during a restore/upgrade, but couldn't get it to work. I thought I would submit this portion of the PR first and get some feedback. Thanks!

@yugo-n
Copy link
Collaborator

yugo-n commented Apr 22, 2025

Thank you for your proposal!

As I understand it, there are two fundamental problems.

The triggers hard-code the IMMV OID, and this is not preserved by dump/restore.
The pg_ivm_immv table stores a serialized Query tree, and this is not preserved by dump/restore or upgrade.
Specifically dump/restore changes the referenced OIDs, and upgrade can change the Query tree struct itself.

This means after dump/restore or upgrade, the IMMVs are not usable. In fact, we can't even run refresh_immv to bring >them back, because the original view query is lost.

Yes, you are right. Also, we need to recreate dependencies after restore/upgrade, as you suggested before.

This PR consists of a few changes:

Add a querystring text column to pg_ivm_immv. This column stores the fully-qualified SQL query that defines this view.
Add a DDL event trigger to update the querystring whenever an object is renamed. This is needed so that the querystring does not become stale if any referenced objects are renamed.
Modify RefreshIMMVByOid so that it re-parses the querystring and stores the updated viewdef in pg_ivm_immv.
Add a new SQL function recreate_all_immvs that refreshes all IMMVs. This re-populates the data and re-creates all the base table and change control triggers

I am planning to fix the issue by similar but different approach. I was going to response you with my draft patch of other pull request, but I've not finished it, so I'll show my idea here.

I don't like storing query string in pg_ivm_immv always, because we already have query_def column so this makes redundancy, and it needs additional maintenance such as handling table rename as you proposed to handle it by a DDL event trigger.

Instead, I would like make two user commands or scripts to be run before and after pg_dump/restore or pg_upgrade. The script for pre-dump/upgrade creates a table for saving query definition string for every IMMV. And, the script for post-restore/upgrade read the table, and recreate query_def, trigger, dependency, and on. For this sake, adding a new SQL function like recreate_all_immvs (or recreate_immv, reset_immv, ...) would be useful.

What do you think of this approach?

@adamguo0
Copy link
Author

Thank you for looking! I wonder if there's some way we can automate the pre-upgrade and post-upgrade commands. I think users may be hesitant to increase the complexity of database upgrades even further if pg_ivm requires more work for them. They also need to make sure that no DDLs are executed between running the pre-upgrade script and shutting down the database, and between starting the database and running the post-upgrade script, which could be challenging for a live application.

A DDL trigger could maintain this second table automatically so that we can avoid the pre-upgrade script. Then maybe a hook in pg_ivm can apply the post-upgrade logic automatically. I think the maintenance tradeoff may be worth the reduced pain of upgrading. Eager to hear your thoughts, thanks!

@yugo-n
Copy link
Collaborator

yugo-n commented Apr 23, 2025

Of course, I understand the importance of automation and also would like automated way of pg_dump and pg_upgrade if we could. However, I don't think it is a big matter that running a script at executing pg_upgrade or restoring dump data, because servers should be stopped at pg_upgrade, and restore would be usually performed as a part of system maintenance. Also, some extensions require to execute scripts for upgrade and restore, for example, ....

It may matter that query strings saved by a script could be inconsistent with table schemas output by pg_dump because pg_dump would be usually executed during system running. Maybe, this could be prevented by specifying exported snapshot by pg_export_snapshot() to pg_dump's --snapshot option, though.

It makes sense that maintaining query strings stored using event trigger would be a way for getting query definitions that can be used to re-create view info. However, I am still not confidence that it is the best, because the redundant information in pg_ivm_immv catalog bothres me.

Also, do you have a concreate idea on how to autoamte post-upgrade logic for re-creating view infomation without executing some script? I could not find usable hook for it. We might do it in a trigger when INSERT or COPY is executed on pg_ivm_immv or at the first incremental view maintenance after restore, though

@yugo-n
Copy link
Collaborator

yugo-n commented Apr 23, 2025

Also, some extensions require to execute scripts for upgrade and restore, for example, ...
I forgot to show examples:

https://docs.timescale.com/self-hosted/latest/backup-and-restore/logical-backup/
https://docs.citusdata.com/en/stable/admin_guide/upgrading_citus.html

@adamguo0
Copy link
Author

Also, do you have a concreate idea on how to autoamte post-upgrade logic for re-creating view infomation without executing some script?

I tried to find a hook or other way as well and had some trouble. Maybe we can use the pg_ivm _PG_init function to run the post-restore logic on startup?

Thank you for your explanation, I agree that the pre- and post-restore functions require the least significant design changes in the extension. I thought of another potential solution which avoids needing to run post-restore logic, but it still involves the query string + DDL trigger.

  • Create a new sequence pkey called id in the pg_ivm_immv table
  • In each maintenance trigger, store the id in the function args. This id should not change between dump/restore/upgrade, so the triggers don't need to be dropped and re-created.
  • Replace the viewdef column with a querystring column, which should not change between dump/restore/upgrade. Unfortunately I can't find a way to maintain the querystring column besides the DDL trigger idea.

@yugo-n
Copy link
Collaborator

yugo-n commented Apr 30, 2025

I tried to find a hook or other way as well and had some trouble. Maybe we can use the pg_ivm _PG_init function to run the post-restore logic on startup?

That was the only thing I could think of on the post-restore logic too. In this case, we have to detect if it is the first incremental maintenance or the first execution of any function. One concern is the impact to the performance of the first maintenance.

Create a new sequence pkey called id in the pg_ivm_immv table
In each maintenance trigger, store the id in the function args. This id should not change between dump/restore/upgrade, so the triggers don't need to be dropped and re-created.
Replace the viewdef column with a querystring column, which should not change between dump/restore/upgrade. Unfortunately I can't find a way to maintain the querystring column besides the DDL trigger idea.

Another solution I thought was to make a (normal) view for each IMMV instead of storing view definition in pg_ivm_immv. We would get the correct view definition after the restore because pg_dump would dump the view definition. However, I wonder it would not still work well because this would just replace the problem to "how to save the view name with namespace"...

The current thought is that; first introduce pre- and post- script approach that enables to support pg_dump/pg_upgrade with minimal change on pg_ivm, and then continue to discuss how to automate them, since each of both is definitely a improvement of user experience. For the automation, we might adopt event trigger approach or perhaps could find some better way.

@michaelpq
Copy link
Contributor

The discussion here seems to have drifted between a fully-automated solution vs a solution that would require manual steps. FWIW, I think that both are the same thing, what only matters is the timing where we want to integrate this data and how we should include it in the dump.

Like Nagata-san I am not really a fan of the approach taken by the patch set, for a couple of reasons:

  • Maintaining a copy of the query string in pg_ivm_immv is a non-starter IMO, because it increases an extra level of complexity when changing the Query tree itself, and renames of the parent table that an IMMV depends on is supported. The duplication has a complexity cost.
  • The work of copying transferring the definition of the IMMV should be to use ruleutils.c at dump time, which would be the pre-dump step.
  • Making pg_ivm_immv directly dumpable is an error IMO.
  • Ordering of the actions during pre-restore. Are you sure that you would be entirely
  • The triggers should not be included in the dumps, as by design they are internal and hidden to the end-user.
  • Your patch set has a total, exact amount of zero tests. There is nothing. I would strongly suggest a set of TAP scripts for both the dump AND upgrade paths. How is one able to validate any of the logic suggested by the patch set? This consumes the time of reviewers and committers.

Again, the proposed patch set on branch 20250520_pr is messy and unorganized so it is kind of hard for a reviewer to understand what's going on when going through all the changes. I'd recommend a rework if we want to keep going on with this design. I can't help but notice that the proposed patch set has an exact number of three TODOs. That does not bring a lot of confidence..

The current thought is that; first introduce pre- and post- script approach that enables to support pg_dump/pg_upgrade with minimal change on pg_ivm, and then continue to discuss how to automate them, since each of both is definitely a improvement of user experience. For the automation, we might adopt event trigger approach or perhaps could find some better way.

I would recommend a user-controlled set of pre- and post- scripts, rather than hacks in loading paths to enforce a behavior. That could be surprising for some users if this is enforced at a stage, particularly something as early as a one-time _PG_init().

Question here for @yugo-n (aka Nagata-san). What do you have in mind in terms of contents to get during the pre-script phase? Based on a set of guesses, I would imagine a user experience like:

  • Use a pre-restore SQL function like citus and the like, that internally scans pg_ivm_immv and returns a set of calls to run a set of pg_create_immv() commands to make sure that the objects are created with the same paths by using these queries during the post-restore phase. I'm afraid that you would need to remove the IMMVs from the original cluster if you don't want these to be copied during a pg_upgrade. You'd want these to be based on the newest version, so as you need to think only about backward-compatibility of the pg_create_immv() calls.
  • As a post-restore step, replay all the contents received from the pre-restore step.
  • Force the original immvs to not be included in the default dumps.
  • If you do that, it seems to me that you may not really need any C code, PL functions could get that sorted out to get the objects in need of re-creation.

All that comes down to how much you want the refreshes to take post-upgrade, but I think that there is also an argument for simplicity and pg_imv has its own internal complications by mimicking the Postgres code for CTAS and matviews, especially if in the future new internal triggers are added in pg_ivm, even if it means a more costly experience because the IMMVs need to be refreshed from scratch. I think that's simpler to let the parent tables replay first anyway, CREATE MATERIALIZED VIEW + REFRESH is what happens in pg_dump for the core matviews.

@michaelpq
Copy link
Contributor

So, considering the fact that pg_ivm may introduce a different function layer when creating its IMMVs in new versions and how the module works correctly with its dependency on OIDs for its internal objects, I would imagine an upgrade or dump/restore flow to work like that:

  • Update the library of pg_ivm on the old cluster.
  • Do the pre-script phase, compatible with the new version. This should include cleanup phases in the old cluster, as well as an extract of the current objects.
  • Upgrade/dump the old cluster
  • Prepare the new cluster, with pg_upgrade or a restore, which uses the new version of pg_ivm.
  • Do the post-script phase, recreating the IMMV objects in the new cluster, using the new module compatibility. No need to worry about the OID dependency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants