Replace home-grown PostgreSQL database migrations with something modern

Hey James!

So, are you up for fixing up some past mistakes of mine?

As you might have noticed (and as is described in the [docs](https://github.com/mediacloud/backend/blob/master/doc/schema_migrations.markdown)), we use a self-made database migrations system that I came up with. The essence of it is that to come up with a migration, one has to:

1. Update `apps/postgresql-server/schema/mediawords.sql` which serves as our reference schema file.
2. Add a new migration under `apps/postgresql-server/schema/migrations/mediawords-XXX0-XXX1.sql` that will get imported when the production database actually get migrated.

This process kinda works, but there are a few issues with it:

* Migrations don't get automatically tested at any point. When `postgresql-server` gets built into a Docker image, only the reference `schema/mediawords.sql` gets imported which is not something that we do at all when deploying. Instead, migrations get applied on `postgresql-server` container start, i.e. the `bin/postgresql_server.sh` (via `bin/apply_migrations.sh`) starts the server in an unpublished port (IIRC `1234`), tests whether the schema of the database is up to date, applies the migrations if needed, stops the server and then restarts in on a "proper" port. If a migration fails, then the container never really starts (which is okay because we don't want a database instance with an out-of-date schema but also not okay because we should somehow learn that the migration is broken before we even try deploying it).
* Duplicate work. To add a table, one has to edit both `schema/mediawords.sql` and a migration file in `schema/migrations/`.

So, a new migration system is due. Something like this would be a tremendous improvement IMHO:

1. We get rid of the `mediawords.sql` file altogether and manage the schema only via migrations.
2. At build time, we import all the migrations sequentially to end up with a live schema.
3. At runtime, we maybe follow a similar approach to what's currently being done - start a "private" instance of PostgreSQL, run a migration tool against it to get it up-to-date, and then restart it to make it public.

Or is there a better way to do migrations these days?

Vague to-do:

1. Come up with a few migration tools that you like, describe them in this issue and pick one that you think we should go with (I mentioned [Flyway](https://flywaydb.org/) during our meeting but it doesn't have to be Flyway at all, just something that makes the most sense to you).
2. (Attempt to) rewrite current migrations to said tool.
    * I think essentially you'll have to rename existing migrations somehow, get rid of the duplicate comments at the top of each migration file, and remove the `set_database_schema_version()` stuff, quite possibly fix a few syntax errors here and there (let me know if you get stuck with those, I might be able to help out).
    * We have quite a few migrations, and redoing them with a new migration tool might be a bit of work (even with the help of regex find-and-replace) but trying out the new tool of choice with quite a few of migrations might prove to be useful as we'd get to find out things like: 1) how fast does it work with a hundred of migrations? 2) how does it handle migrations that can't be run in a transaction? 3) how does it deal with errors? etc.
    * With that said, feel free to skip the migrations that are better off being skipped, e.g. the ones that create a table that gets dropped right away in a subsequent migration.
3. Loosely compare `schema/mediawords.sql` with whatever schema got imported though those migrations and make sure that they more or less look like the same thing.
    * There might (will!) be discrepancies between what gets imported via the main `mediawords.sql` schema file and what ends up in the database after going through all of the migrations. If it's just column order or something like that that's different, then that's fine, but I think it's important to not miss a table or two.
    * The simpliest way of comparing those would be to import `schema/mediawords.sql` into one database, all of the migrations into another, `pg_dump` both databases and review a diff between them.
4. Update the docs for them to describe the new migration process.

Notes, considerations and a wishlist:

* While `schema/mediawords.sql` has to go, it would still be tremendously useful to have a single (auto-generated) file with the currently active schema for our own reference (i.e. something that you could look at while developing things). Maybe the container image build process could import all the migrations and then do a quick schema dump to a separate file that we could then be able to look at? Something like:

    ```dockerfile
    RUN import_migrations.sh

    RUN pg_dump > /mediawords.sql
    ```

    So that later one could do:

    ```bash
    docker run dockermediacloud/postgresql-server cat /mediawords.sql > mediawords.sql
    ```

    to extract the latest schema.

    Or is there some sort of a better way?

    * One issue that comes to mind with this approach (`pg_dump`-generated reference schema files) is that we'd lose the `-- comments` that we have in the schema. Some of these comments are not particularly useful (e.g. `feeds.type` column is described as `-- Feed type` :)) but others are something that we'd like to retain. Maybe the most useful comments could be ported to [`COMMENT ON` statements](https://www.postgresql.org/docs/11/sql-comment.html)? Or would this be too much of a hassle?

* It would be nice to have (retain) to have some way of applying any given migrations manually before deploying the rebuilt `postgresql-server` service. Sometimes migrations that work fine on an empty / testing database don't quite work as well on a production one (e.g. `CREATE INDEX` on a table with a billion rows) so sometimes it's useful to apply the migration manually before deploying anything. If the migration tool would be able to print out SQL that would be run on the live database instead of the tool insisting that it has to run the SQL itself, that would be pretty great! If not, then oh well, we'll think of something.

* If it so happens that your migration tool of choice uses Java, consider using (and rebasing `postgresql-base` on) `java-base` app which is a base container image for apps that use Java.

* If you don't like it that migrations get applied at deployment time, that's up for a discussion too - it just seemed to make sense to me so I did it that way, but maybe there's a better way (point in time) to apply those migrations.

Years ago, I started doing this in a [`flyway` branch](https://github.com/mediacloud/backend/tree/flyway/schema) but never finished up the work. What might be useful for you in this branch is the pre-migrations full schema file (the first "migration" so to say) and a bunch of migrations with some fixed up SQL.

While doing the task, please keep in mind that if it seems to take too much time to do something (e.g. port existing migrations or rewrite comments to `COMMENT ON` statements), we could always decide to skip on the too-time-consuming parts of this task (e.g. existing migrations is just a nice-to-have, and comments on various tables could potentially be fished out some other way). Don't let yourself get scope-creeped!

As always, do let me know if you have any questions and / or need any help.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace home-grown PostgreSQL database migrations with something modern #754

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Replace home-grown PostgreSQL database migrations with something modern #754

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions