Skip to content

Postgres Migration#1190

Draft
dancoates wants to merge 41 commits intodevfrom
postgres-migration
Draft

Postgres Migration#1190
dancoates wants to merge 41 commits intodevfrom
postgres-migration

Conversation

@dancoates
Copy link
Collaborator

No description provided.

@codecov
Copy link

codecov bot commented Feb 3, 2026

Codecov Report

❌ Patch coverage is 74.70912% with 413 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.40%. Comparing base (782cab8) to head (9932472).
⚠️ Report is 5 commits behind head on dev.

Files with missing lines Patch % Lines
api/graphql/schema.py 48.96% 74 Missing ⚠️
api/graphql/loaders.py 61.25% 62 Missing ⚠️
db/python/tables/sample.py 78.29% 28 Missing ⚠️
db/python/tables/analysis.py 64.47% 27 Missing ⚠️
db/python/tables/participant.py 85.90% 21 Missing ⚠️
api/graphql/mutations/project.py 35.48% 20 Missing ⚠️
api/graphql/mutations/sample.py 20.83% 19 Missing ⚠️
api/graphql/mutations/cohort.py 56.75% 16 Missing ⚠️
api/graphql/mutations/analysis.py 6.25% 15 Missing ⚠️
db/python/tables/sequencing_group.py 87.87% 12 Missing ⚠️
... and 25 more
Additional details and impacted files
@@            Coverage Diff             @@
##              dev    #1190      +/-   ##
==========================================
- Coverage   83.20%   78.40%   -4.80%     
==========================================
  Files         216      167      -49     
  Lines       18740    14402    -4338     
==========================================
- Hits        15592    11292    -4300     
+ Misses       3148     3110      -38     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions
Copy link

github-actions bot commented Feb 6, 2026

PR Checklist ✅

Have you checked the following?

  • Changes do not contain the names of any sensitive CPG projects
  • Tests cover any newly added code
  • There aren't any unnecessary autoformatting changes
  • Pull request description is appropriate for use as a commit message for squash commit

Check all boxes before merging.

…abase (#1183)

As the first step of switching to postgres we need a working postgres
database with the correct schema. This change removes the old liquibase
changelog and replaces it with a folder of dbmate migrations. The
migrations are split into a few steps:

1. The base tables
2. Addition of sys_period columns to allow for temporal tables
3. Addition of history tables based on main tables
4. Addition of triggers to automatically write changes to history tables
5. Addition of all indexes and foreign key constraints - this is done as
a separate step to the table creation to make it easier to bulk import
the initial data without needing to do it in the correct order.

This postgres schema fixes a few longstanding issues with the mariadb
schema:

- removed _external_id_unused columns from family, participant, sample
tables
- Added missing audit log id columns to analysis_outputs and output_file
tables
- Fixed type of output file meta column to be json rather than varchar
- renamed nullIfInactive to null_if_archived in
sequencing_group_external_id table, to be consistent with other column
names and the `archived` column in sequencing_group table
- added missing primary key to sequencing_group_external_id table
- removed full text index on comment table, it wasn't being used

The Dockerfile in the `/db` directory is based on postgres 18, but
installs dbmate and the postgres temporal tables extension in addition.
The docker-compose file only has a single service at the moment but
allows easy management of the necessary volumes.

To run database migrations, run:

```bash
docker compose exec postgres dbmate up
```

and to run the database, run `docker compose up`

The `/db/schema.sql` file is automatically generated by dbmate, and can
be used to keep track of the effect of new migrations on the overall
schema
This updates the existing `connect.py` to add an implementation of
postgres connections using psycopg and an async connection pool. The
choice to use psycogp is largely due to its postgres specific features
like supporting row_cursors, supporting COPY and better parameterization
of queries. The `databases` library that we have been using for mariadb
is also now archived so it is a good time to move to something else.

This also does the migration of the necessary queries to get a list of
the user's projects as an example of the postgres connection in use. The
rest of the API is broken.
This still uses testcontainers but takes a different approach to the
previous test setup. The tests have been very slow for a while due to
heavy set up and tear down requirements for each test. This update
utilizes the ability of postgres to very quickly create a database based
on another template database. This allows us to run the database
migrations once, and then clone the template database for each test.

In addition this disables a bunch of durability settings for postgres
and uses a ram disk for the data storage which should help to make the
database creation process as quick as possible.

This also commences a move to a more idiomatic pytest test configuration
using fixtures for the database setup. This allows tests to easily
request what resources they need, including what permissions they want
the test user to have.

See the test README.md for more information and examples of tests.

As most tests haven't been ported yet, to stop pytest from trying to run
the broken tests, there is a include list of working tests specified in
the `pytest.toml` file. We can extend this as we port tests across to
use this new setup, and eventually remove the include list once all
tests are ported.
We use mariadb's JSON_MERGE_PATCH function in several places to merge
entity meta fields. Postgres doesn't have a directly comparable function
that we can use. We could switch to using postgres' json concat ||
operator which does merge jsonb documents, but doesn't have the more
advanced functionality that JSON_MERGE_PATCH offers that allows updating
nested objects.

While we could probably work around not having the ability to update
nested objects, the main feature of JSON_MERGE_PATCH that we use
frequently is the ability to delete top level keys by passing in a null
value as the patch. Postgres' json concat doesn't have a way to do this.

So to avoid breaking changes it is best to implement a spec compliant
version of JSON_MERGE_PATCH that we can use.

The implementation is in the migration, and there are tests of all the
examples from the RFC, as well as some extra edge cases.

There is a significant performance loss in going down this route, the
new json_merge_patch function is about 10 times slower than the built in
json concat with `||`. We only use this method on entity updates and
never do very large batch updates so this shouldn't be too much of a
problem.


__Benchmark__

```sql
CREATE TEMP TABLE benchmark_json_data AS
SELECT 
    id,
    -- Target: nested JSON with various depths
    jsonb_build_object(
        'id', id,
        'name', 'item_' || id,
        'count', id * 10,
        'active', (id % 2 = 0),
        'metadata', jsonb_build_object(
            'created', now()::text,
            'version', 1,
            'tags', jsonb_build_array('tag1', 'tag2', 'tag3'),
            'nested', jsonb_build_object(
                'level2', jsonb_build_object(
                    'level3', jsonb_build_object(
                        'deep_value', id * 100
                    )
                )
            )
        ),
        'settings', jsonb_build_object(
            'option_a', true,
            'option_b', false,
            'option_c', 'value_' || id
        )
    ) as target,
    -- Patch: updates some values, adds new ones, removes others (null)
    jsonb_build_object(
        'name', 'updated_item_' || id,
        'new_field', 'new_value_' || id,
        'metadata', jsonb_build_object(
            'version', 2,
            'updated', now()::text,
            'nested', jsonb_build_object(
                'level2', jsonb_build_object(
                    'level3', jsonb_build_object(
                        'new_deep', 'added'
                    )
                )
            )
        ),
        'settings', jsonb_build_object(
            'option_a', false,
            'option_d', 'new_option'
        )
    ) as patch
FROM generate_series(1, 100000) as id;
```

```sql

SELECT id, target || patch as merged 
FROM benchmark_json_data;
-- ~350ms
```

```sql
SELECT id, json_merge_patch(target, patch) as merged 
FROM benchmark_json_data;
--- ~ 3000ms
```
Notable changes:

- The project deletion query is updated to run as multiple queries as
psycopg doesn't support multiple queries in the same request
- merging project meta is now done using the custom json_merge_patch
function introduced in the previous commit, to maintain functional
compatibility with the mariadb implementation
- Tests no longer need to do manual manipulation of groups, they can use
the test fixtures introduced recently
dancoates and others added 10 commits February 6, 2026 15:17
This will allow us to use template strings to build sql queries. This
also updates all other python dependencies to improve compatibility with
python 3.14, this was necessary for some dependencies, particularly
those that are written in rust and use PyO3 for bindings, as older
versions of PyO3 didn't support 3.14 and attempting to build failed with
the error:

```
the configured Python interpreter version (3.14) is newer than PyO3's maximum supported version (3.13)
```

Other packages are updated as we'll need to do comprehensive testing of
this anyway so now seemed like as good a time as any.

After updating to 3.14 I immediately noticed that the tests were taking
about 10x as long as previously. Investigation led down a bit of a
rabbit hole but ended up at the credential helpers specified in
~/.docker/config.json for gcloud container authentication. For some
reason during tests these credential helpers were being invoked many
times and they are quite slow so were greatly impacting test times. This
slowdown wasn't present when we were using 3.11, I think because the
credential helpers used the active python version rather than their own
one. Something about spinning up a process with a different python
version was very slow.

To work around this, there is a new fixture introduced in the tests that
creates an empty docker config and sets that as the config to use. The
test doesn't require the credential helpers as the container it uses is
not in artifact registry or gcr. This problem wouldn't have existed in
CI but this fixture won't break anything there either

This also updates some settings and CI that were pointing to 3.11
This adds a second ruff config file for the python package which
specifies 3.10 compatibility. This allows the rest of code to use 3.14
features
Storing the active postgres connection on the `Connection` object
simplifies instances where a transaction is initiated in the `layer/`
function but executed in the `table/` function.
Prior to these changes, the connection pool was stored on `Connection`,
which resulted in ambiguity in the `table/` about which connection to
use.

---------

Co-authored-by: Dan Coates <dan.coates@populationgenomics.org.au>
This also rewrites the tests to actually run the filter queries against
some data in the database.

Some bits of the query building have been removed to reduce complexity,
things like checking if a query would logically result in an overall
falsey where statement isn't necessary as the query planner is smart
enough to optimize for that.

The implementation of filtering by querying json blobs in the db is left
out of this as the correct approach is yet to be established
Adds a migration to add default values to the following tables:
```
assay_type
sample_type
sequencing_type
sequencing_technology
sequencing_platform
analysis_type
```
This takes the existing assay table methods and replaces them with
postgres compatible ones. Also updates the tests to the new pytest
structure.

notable changes:

- Removes the `open_transaction` param from assay methods, this isn't
  needed as we can check if we're in a transaction and conditionally
  avoid creating a nested transaction (nested transactions are supported
  in postgres with savepoints but there's a performance impact and we
  don't need them so best to avoid)
- Some dead/unused code has been removed

not implemented:

- Some tests that relate to sequencing groups are skipped
- Some bits of tests that relate to querying meta are commented out
- There's some temporary fixtures added to create the necessary samples
  for assays to be attached to, these should be replaced with ones that
  use python methods once samples/sgs are migrated
This helper on the connection class allows us to conditionally enter a
transaction if not already in one. Psycopg supports nested transactions
with postgres savepoints but partial rollbacks is not a feature that we
actually need - so in most cases it is fine to just have a single parent
transaction. There's a bit of code involved in checking the current
transaction status so this wraps that in a method on the connection
model to save having to write it each time.
This PR migrates `family` and `family_participant` entities to
**PostgreSQL**. It also updates related test classes that the `Family`
layer is tested.

---------

Co-authored-by: Dan Coates <dan.coates@populationgenomics.org.au>
ignore PLR2004 in tests as we often need to assert that the length of a
list equals a certain number, in these cases there's no point defining a
constant to use. Also fixes a few other misc errors
dancoates and others added 10 commits February 18, 2026 11:30
this'll enable us to enforce that other linting checks pass while
working on the migration
This probably isn't as performant as it could be as it is converting the
jsonb columns to text, only to then parse them again in duckdb. We could
probably get them as dicts and then create an arrow table from that, but
would need to manually define the schema for the table which would be a
bit tricky. There's a rough plan to change how these meta tables work in
the future so don't want to do much of a refactor for now as long as
performance isn't too bad when we test in production.

All tests are skipped for now, but should work in theory. I've manually
tested the endpoints and they seem to work fine.
Mirgrate the participant and participant phenotype queries to use
postgres connection

In the process of this migration an issue was uncoverd in the test
`test_graphql_upsert_participant_with_phenotypes`. The project passed
through graphql as a string was not being used in the upsert. The
appropriate data access checks were made, but then the value was not
being utilised in the upsert_participants method itself.

This PR includes a fix where in `upsert_participants` in
`api/graphql/mutations/participant.py` the project id is fetched when
the project access is checked and that value is passed down all the way
through to the table where `create_participant` is called.

All the tests should be passing excluding `test_query_by_sample` since
the Sample tables have yet to be migrated.

---------

Co-authored-by: Dan Coates <dan.coates@populationgenomics.org.au>
Co-authored-by: Piyumi Rameshka <piyumi.amarasinghebaragamage@populationgenomics.org.au>
This PR migrate Audit log table-related SQL queries in the `audit_log.py` and `base.py` to PostgreSQL.

---------

Co-authored-by: Dan Coates <dan.coates@populationgenomics.org.au>
Pretty straightforward migration. Most tests skipped at the moment but should in theory work when
re-enabled as more things are migrated.
Migrate the sample queries to postgres

Also adds in a fix to the `to_sql` function in the generic filter to
return `None` if the filters are empty.

---------

Co-authored-by: Dan Coates <dan.coates@populationgenomics.org.au>
Co-authored-by: Piyumi Rameshka <piyumi.amarasinghebaragamage@populationgenomics.org.au>
These postgres migration queries were done prior to the update to Python
3.14. This uplifts them to use the Template string feature we are now
relying on in 3.14.
This PR migrates the Enum table queries to PostgreSQL and adds a new test class to verify those queries.
This migrates sequencing group table methods to be compatible with
postgres, as well as converts the corresponding test code to use pytest.
Additionally, unused sequencing group table methods have been removed.

Not implemented:
- Table functions that are not covered by sequencing group tests have
not had tests written. This includes the following:
    - `get_sequencing_groups_by_analysis_ids`
    - `get_samples_create_date_from_sgs`
    - `get_all_sequencing_group_ids_by_sample_ids_by_type`
    - `get_type_numbers_for_project`
    - `recreate_sequencing_group_with_new_assays`
- Sequencing group tests functions that relate to Analyses have been
skipped
- Sequencing group tests that rely on the use of samples across multiple
projects have been skipped
    - This is only until migration of samples is complete
Adds the ability to set the name of the test project that the `connection_with_project` fixture uses.
This can be done using the `@pytest.mark.project_name(<name>)` decorator.
hrrsheen and others added 16 commits February 24, 2026 15:25
Adds a check for None values to the `SequencingGroupAssayFilter` when
querying graphql for sequencing groups.
This PR migrate analysis_runner queries.
This PR migrates the `cohort` table to PostgreSQL. Along with the query
changes, `test_cohort.py` and `test_cohort_status.py` files are also updated.

Additionally, `test_mutations.py` also covers parts of the cohort layer.
Since this test class depends on several other entities, it would be
more appropriate to update it once those related migrations are
completed.

---------

Co-authored-by: Dan Coates <dan.coates@populationgenomics.org.au>
Migrate the analysis queries to use postgres, and migrate all the tests
to pytest.

Discovered and fixed two minor bugs in `participant.py` and
`sequencing_group.py` also.

---------

Co-authored-by: Dan Coates <dan.coates@populationgenomics.org.au>
Co-authored-by: Piyumi Rameshka <piyumi.amarasinghebaragamage@populationgenomics.org.au>
This PR migrate project insights queries to use PostgreSQL syntax.
There aren't many test cases for this work. But the existing two
testcases invoke almost all the DB functions.
In addition, all the new queries were manually tested.
Migrates the web questies to postgres and migrates the web tests to pytest.
Additionally identified and fixed some small query bugs in `tables/assay.py`, `tables/participant.py` and `tables/sample.py`.
This PR unskips the test cases in:

- test/test_external_id.py
- test/test_graphql.py
- test/test_pedigree.py
- test/test_search.py
- test/test_update_participant_family.py

The test cases in these files were skipped until all the dependent
entity queries are migrated to PostgreSQL syntax.
While unskipping them, some test cases were failing and required some
minor fixes.

**There are still some test cases skipped until _querying JSON keys_ and
_case insensitive string matching_ work.**

In addition, the test cases in test/test_cohort_status.py file are
updated to reuse the `connection/db` created in the test class `set_up`
fixture instead of creating a new test DB both in the set_up and again
in the testcase.
Create a filter for json columns in the new land of postgres

There are some notes and a new function called `infer_type_from_filter`.
I'm open to alternatives for inferring the generic `T` type but for
right now we pull it from the type of values in the filters.

There was also a [discussion on
slack](https://centrepopgen.slack.com/archives/C0ADQ18TAES/p1771550184864699)
around the behaviour of exclusion filters `neq` and `nin` around whether
they should include rows that either do not have the meta key it's
filtering on or even an empty meta.

Originally, ALL rows with the key missing would be excluded if you
filter on the meta columns. This PR modifies that behaviour and will
include those rows on any `neq` or `nin` filter unless the
`isnull=False` filter is also specified.

The tests have been modified accordingly to test these new behaviours
and the `neq` operation has been changed from `!=` to `IS DISTINCT FROM`
to enable null values to pass through.

In order to test the functionality as we currently use it, the tests
that cover the new logic are added to `test_sequencing_groups.py` since
this filtering is predominantly used to filter sgs based off of assay
meta data.

---------

Co-authored-by: hrrsheen <harrison.sheen@populationgenomics.org.au>
Co-authored-by: Dan Coates <dan.coates@populationgenomics.org.au>
Reorders the migration files so that rolling back the default data
migration doesn't violate foreign key constraints.
This PR migrate `comment` query-related test cases.
Fixes `sequencing_group` tests that were previously skipped due to missing functionality.
Also replaces fixture that relied on manually inserting into the database.
Migrates the analysis output file queries to be compatible with postgres.
Migrates the corresponding tests to pytest.

---------

Co-authored-by: Yash <nevoodoo@users.noreply.github.com>
As described in the ticket for SET-1003, Mariadb was automatically doing
very flexible, case ignoring string comparisons.

This update to the GenericFilter makes all string comparisons `ILIKE` or
`CASEFOLD(value)` to replicate that pre-exiting functionality after the
move to Postgres.

The tests have been updated accordingly.
…r call (#1238)

The get_connection() function returned by dependable_get_connection_getter
is called repeatedly for related graphql resolvers. Caching the project maps
in dependable_get_connection_getter avoids repeated calls to get the same
information from refresh_projects() and speeds up this hot path.

But don't cache them for other refresh_projects() call sites for which
it can be important to get a fresh list of projects.
Migrate billing-related test cases from unittest to pytest.
Remove `unittest.mock.MagicMock` usages and introduce dummy mock classes
(in `test/testbqbase.py`) and pytest `monkeypatch`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants