Skip to content
This repository was archived by the owner on Dec 14, 2023. It is now read-only.
This repository was archived by the owner on Dec 14, 2023. It is now read-only.

Replace home-grown PostgreSQL database migrations with something modern #754

@pypt

Description

@pypt

Hey James!

So, are you up for fixing up some past mistakes of mine?

As you might have noticed (and as is described in the docs), we use a self-made database migrations system that I came up with. The essence of it is that to come up with a migration, one has to:

  1. Update apps/postgresql-server/schema/mediawords.sql which serves as our reference schema file.
  2. Add a new migration under apps/postgresql-server/schema/migrations/mediawords-XXX0-XXX1.sql that will get imported when the production database actually get migrated.

This process kinda works, but there are a few issues with it:

  • Migrations don't get automatically tested at any point. When postgresql-server gets built into a Docker image, only the reference schema/mediawords.sql gets imported which is not something that we do at all when deploying. Instead, migrations get applied on postgresql-server container start, i.e. the bin/postgresql_server.sh (via bin/apply_migrations.sh) starts the server in an unpublished port (IIRC 1234), tests whether the schema of the database is up to date, applies the migrations if needed, stops the server and then restarts in on a "proper" port. If a migration fails, then the container never really starts (which is okay because we don't want a database instance with an out-of-date schema but also not okay because we should somehow learn that the migration is broken before we even try deploying it).
  • Duplicate work. To add a table, one has to edit both schema/mediawords.sql and a migration file in schema/migrations/.

So, a new migration system is due. Something like this would be a tremendous improvement IMHO:

  1. We get rid of the mediawords.sql file altogether and manage the schema only via migrations.
  2. At build time, we import all the migrations sequentially to end up with a live schema.
  3. At runtime, we maybe follow a similar approach to what's currently being done - start a "private" instance of PostgreSQL, run a migration tool against it to get it up-to-date, and then restart it to make it public.

Or is there a better way to do migrations these days?

Vague to-do:

  1. Come up with a few migration tools that you like, describe them in this issue and pick one that you think we should go with (I mentioned Flyway during our meeting but it doesn't have to be Flyway at all, just something that makes the most sense to you).
  2. (Attempt to) rewrite current migrations to said tool.
    • I think essentially you'll have to rename existing migrations somehow, get rid of the duplicate comments at the top of each migration file, and remove the set_database_schema_version() stuff, quite possibly fix a few syntax errors here and there (let me know if you get stuck with those, I might be able to help out).
    • We have quite a few migrations, and redoing them with a new migration tool might be a bit of work (even with the help of regex find-and-replace) but trying out the new tool of choice with quite a few of migrations might prove to be useful as we'd get to find out things like: 1) how fast does it work with a hundred of migrations? 2) how does it handle migrations that can't be run in a transaction? 3) how does it deal with errors? etc.
    • With that said, feel free to skip the migrations that are better off being skipped, e.g. the ones that create a table that gets dropped right away in a subsequent migration.
  3. Loosely compare schema/mediawords.sql with whatever schema got imported though those migrations and make sure that they more or less look like the same thing.
    • There might (will!) be discrepancies between what gets imported via the main mediawords.sql schema file and what ends up in the database after going through all of the migrations. If it's just column order or something like that that's different, then that's fine, but I think it's important to not miss a table or two.
    • The simpliest way of comparing those would be to import schema/mediawords.sql into one database, all of the migrations into another, pg_dump both databases and review a diff between them.
  4. Update the docs for them to describe the new migration process.

Notes, considerations and a wishlist:

  • While schema/mediawords.sql has to go, it would still be tremendously useful to have a single (auto-generated) file with the currently active schema for our own reference (i.e. something that you could look at while developing things). Maybe the container image build process could import all the migrations and then do a quick schema dump to a separate file that we could then be able to look at? Something like:

    RUN import_migrations.sh
    
    RUN pg_dump > /mediawords.sql

    So that later one could do:

    docker run dockermediacloud/postgresql-server cat /mediawords.sql > mediawords.sql

    to extract the latest schema.

    Or is there some sort of a better way?

    • One issue that comes to mind with this approach (pg_dump-generated reference schema files) is that we'd lose the -- comments that we have in the schema. Some of these comments are not particularly useful (e.g. feeds.type column is described as -- Feed type :)) but others are something that we'd like to retain. Maybe the most useful comments could be ported to COMMENT ON statements? Or would this be too much of a hassle?
  • It would be nice to have (retain) to have some way of applying any given migrations manually before deploying the rebuilt postgresql-server service. Sometimes migrations that work fine on an empty / testing database don't quite work as well on a production one (e.g. CREATE INDEX on a table with a billion rows) so sometimes it's useful to apply the migration manually before deploying anything. If the migration tool would be able to print out SQL that would be run on the live database instead of the tool insisting that it has to run the SQL itself, that would be pretty great! If not, then oh well, we'll think of something.

  • If it so happens that your migration tool of choice uses Java, consider using (and rebasing postgresql-base on) java-base app which is a base container image for apps that use Java.

  • If you don't like it that migrations get applied at deployment time, that's up for a discussion too - it just seemed to make sense to me so I did it that way, but maybe there's a better way (point in time) to apply those migrations.

Years ago, I started doing this in a flyway branch but never finished up the work. What might be useful for you in this branch is the pre-migrations full schema file (the first "migration" so to say) and a bunch of migrations with some fixed up SQL.

While doing the task, please keep in mind that if it seems to take too much time to do something (e.g. port existing migrations or rewrite comments to COMMENT ON statements), we could always decide to skip on the too-time-consuming parts of this task (e.g. existing migrations is just a nice-to-have, and comments on various tables could potentially be fished out some other way). Don't let yourself get scope-creeped!

As always, do let me know if you have any questions and / or need any help.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions