-
Notifications
You must be signed in to change notification settings - Fork 92
Replace home-grown PostgreSQL database migrations with something modern #754
Description
Hey James!
So, are you up for fixing up some past mistakes of mine?
As you might have noticed (and as is described in the docs), we use a self-made database migrations system that I came up with. The essence of it is that to come up with a migration, one has to:
- Update
apps/postgresql-server/schema/mediawords.sqlwhich serves as our reference schema file. - Add a new migration under
apps/postgresql-server/schema/migrations/mediawords-XXX0-XXX1.sqlthat will get imported when the production database actually get migrated.
This process kinda works, but there are a few issues with it:
- Migrations don't get automatically tested at any point. When
postgresql-servergets built into a Docker image, only the referenceschema/mediawords.sqlgets imported which is not something that we do at all when deploying. Instead, migrations get applied onpostgresql-servercontainer start, i.e. thebin/postgresql_server.sh(viabin/apply_migrations.sh) starts the server in an unpublished port (IIRC1234), tests whether the schema of the database is up to date, applies the migrations if needed, stops the server and then restarts in on a "proper" port. If a migration fails, then the container never really starts (which is okay because we don't want a database instance with an out-of-date schema but also not okay because we should somehow learn that the migration is broken before we even try deploying it). - Duplicate work. To add a table, one has to edit both
schema/mediawords.sqland a migration file inschema/migrations/.
So, a new migration system is due. Something like this would be a tremendous improvement IMHO:
- We get rid of the
mediawords.sqlfile altogether and manage the schema only via migrations. - At build time, we import all the migrations sequentially to end up with a live schema.
- At runtime, we maybe follow a similar approach to what's currently being done - start a "private" instance of PostgreSQL, run a migration tool against it to get it up-to-date, and then restart it to make it public.
Or is there a better way to do migrations these days?
Vague to-do:
- Come up with a few migration tools that you like, describe them in this issue and pick one that you think we should go with (I mentioned Flyway during our meeting but it doesn't have to be Flyway at all, just something that makes the most sense to you).
- (Attempt to) rewrite current migrations to said tool.
- I think essentially you'll have to rename existing migrations somehow, get rid of the duplicate comments at the top of each migration file, and remove the
set_database_schema_version()stuff, quite possibly fix a few syntax errors here and there (let me know if you get stuck with those, I might be able to help out). - We have quite a few migrations, and redoing them with a new migration tool might be a bit of work (even with the help of regex find-and-replace) but trying out the new tool of choice with quite a few of migrations might prove to be useful as we'd get to find out things like: 1) how fast does it work with a hundred of migrations? 2) how does it handle migrations that can't be run in a transaction? 3) how does it deal with errors? etc.
- With that said, feel free to skip the migrations that are better off being skipped, e.g. the ones that create a table that gets dropped right away in a subsequent migration.
- I think essentially you'll have to rename existing migrations somehow, get rid of the duplicate comments at the top of each migration file, and remove the
- Loosely compare
schema/mediawords.sqlwith whatever schema got imported though those migrations and make sure that they more or less look like the same thing.- There might (will!) be discrepancies between what gets imported via the main
mediawords.sqlschema file and what ends up in the database after going through all of the migrations. If it's just column order or something like that that's different, then that's fine, but I think it's important to not miss a table or two. - The simpliest way of comparing those would be to import
schema/mediawords.sqlinto one database, all of the migrations into another,pg_dumpboth databases and review a diff between them.
- There might (will!) be discrepancies between what gets imported via the main
- Update the docs for them to describe the new migration process.
Notes, considerations and a wishlist:
-
While
schema/mediawords.sqlhas to go, it would still be tremendously useful to have a single (auto-generated) file with the currently active schema for our own reference (i.e. something that you could look at while developing things). Maybe the container image build process could import all the migrations and then do a quick schema dump to a separate file that we could then be able to look at? Something like:RUN import_migrations.sh RUN pg_dump > /mediawords.sql
So that later one could do:
docker run dockermediacloud/postgresql-server cat /mediawords.sql > mediawords.sqlto extract the latest schema.
Or is there some sort of a better way?
- One issue that comes to mind with this approach (
pg_dump-generated reference schema files) is that we'd lose the-- commentsthat we have in the schema. Some of these comments are not particularly useful (e.g.feeds.typecolumn is described as-- Feed type:)) but others are something that we'd like to retain. Maybe the most useful comments could be ported toCOMMENT ONstatements? Or would this be too much of a hassle?
- One issue that comes to mind with this approach (
-
It would be nice to have (retain) to have some way of applying any given migrations manually before deploying the rebuilt
postgresql-serverservice. Sometimes migrations that work fine on an empty / testing database don't quite work as well on a production one (e.g.CREATE INDEXon a table with a billion rows) so sometimes it's useful to apply the migration manually before deploying anything. If the migration tool would be able to print out SQL that would be run on the live database instead of the tool insisting that it has to run the SQL itself, that would be pretty great! If not, then oh well, we'll think of something. -
If it so happens that your migration tool of choice uses Java, consider using (and rebasing
postgresql-baseon)java-baseapp which is a base container image for apps that use Java. -
If you don't like it that migrations get applied at deployment time, that's up for a discussion too - it just seemed to make sense to me so I did it that way, but maybe there's a better way (point in time) to apply those migrations.
Years ago, I started doing this in a flyway branch but never finished up the work. What might be useful for you in this branch is the pre-migrations full schema file (the first "migration" so to say) and a bunch of migrations with some fixed up SQL.
While doing the task, please keep in mind that if it seems to take too much time to do something (e.g. port existing migrations or rewrite comments to COMMENT ON statements), we could always decide to skip on the too-time-consuming parts of this task (e.g. existing migrations is just a nice-to-have, and comments on various tables could potentially be fished out some other way). Don't let yourself get scope-creeped!
As always, do let me know if you have any questions and / or need any help.