Add three new temporal checks: plausibleAfterBirth, plausibleBeforeDeath, plausibleStartBeforeEnd#516
Add three new temporal checks: plausibleAfterBirth, plausibleBeforeDeath, plausibleStartBeforeEnd#516katy-sadowski merged 12 commits intodevelopfrom
plausibleAfterBirth, plausibleBeforeDeath, plausibleStartBeforeEnd#516Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #516 +/- ##
===========================================
+ Coverage 86.42% 86.48% +0.06%
===========================================
Files 16 16
Lines 884 888 +4
===========================================
+ Hits 764 768 +4
Misses 120 120 ☔ View full report in Codecov by Sentry. |
dimshitc
left a comment
There was a problem hiding this comment.
why do we keep plausibleTemporalAfter and plausibleDuringLife checks?
aren't they covered now by these new 3 checks?
Removing the 'replaced' checks might be a breaking change for some people (e.g. when comparing number of violating records over time). My thinking was to implement a 'soft' disable, making it possible to execute the old checks if desired. e.g. by removing the plausibleDuringLife and plausibleTemportalAfter checks from the |
R/executeDqChecks.R
Outdated
| if (length(checkNames) > 0) { | ||
| checkDescriptionsDf <- checkDescriptionsDf[checkDescriptionsDf$checkName %in% checkNames, ] | ||
| } else { | ||
| # No checkNames specified. Do not run plausibleDuringLife and plausibleTemportalAfter as these are superseded three new checks |
There was a problem hiding this comment.
@dimshitc This disables the old checks, if user did not specify them specifically. We do need to document this behaviour.
There was a problem hiding this comment.
This makes sense to me.
Should we set threshold in all these checks to 0? as these are obvious mistakes and shouldn't be in the data.
There was a problem hiding this comment.
I've been thinking this over. I think we should actually enable all of the checks (new & old) by default, and instruct users how to disable the ones they don't want in the documentation. Otherwise this is essentially a breaking change because users relying on the old behavior will need to take action to adapt to the change. We can also add a warning that raises when the 2 old checks are enabled to alert users that the new checks are preferred.
Does that sound OK to you guys?
There was a problem hiding this comment.
Yes, that makes even more sense. I added the following warnings if one of the old checks is run:
"DEPRECATION WARNING - The plausibleDuringLife check has been reimplemented with the plausibleBeforeDeath check."
DEPRECATION WARNING - The plausibleTemporalAfter check has been reimplemented with the plausibleAfterBirth and plausibleStartBeforeEnd checks.
There was a problem hiding this comment.
Regarding disabling the checks by the user; there is not really a nice solution as executeDqChecks only takes the names of checks to be run, not an argument with checks to exclude. So the user needs to list all checks except these two. Or does one of you see a more elegant solution?
There was a problem hiding this comment.
Warnings look great; thanks!
And I wouldn't call it elegant, but I did add these instructions to the docs earlier this year on how to exclude a list of checks:
DataQualityDashboard/vignettes/DataQualityDashboard.rmd
Lines 114 to 120 in 81230bd
I think for now we can point to this in the release notes. In the future I'd love to improve the workflow for enabling/disabling checks and editing thresholds (a la #253 ).
katy-sadowski
left a comment
There was a problem hiding this comment.
Thanks so much for this @MaximMoinat ! I left a couple suggestions.
R/executeDqChecks.R
Outdated
| if (length(checkNames) > 0) { | ||
| checkDescriptionsDf <- checkDescriptionsDf[checkDescriptionsDf$checkName %in% checkNames, ] | ||
| } else { | ||
| # No checkNames specified. Do not run plausibleDuringLife and plausibleTemportalAfter as these are superseded three new checks |
There was a problem hiding this comment.
I've been thinking this over. I think we should actually enable all of the checks (new & old) by default, and instruct users how to disable the ones they don't want in the documentation. Otherwise this is essentially a breaking change because users relying on the old behavior will need to take action to adapt to the change. We can also add a warning that raises when the 2 old checks are enabled to alert users that the new checks are preferred.
Does that sound OK to you guys?
| p.year_of_birth, '-', | ||
| COALESCE(p.month_of_birth, 1), '-', | ||
| COALESCE(p.day_of_birth, 1) |
There was a problem hiding this comment.
I think we will actually want to use the date format YYYYMMDD here to ensure we're compatible with all SQL dialects (see this thread here where I was discussing this with Martijn).
There was a problem hiding this comment.
Good catch, I actually came across a similar issues in sqlite.
Not a trivial change, but I came up with the following code, which is a bit complex but should be compatible with all Sql dialects:
CAST(CONCAT(
p.year_of_birth,
COALESCE(
RIGHT('00' + CAST(p.month_of_birth AS VARCHAR), 2),
'01'
),
COALESCE(
RIGHT('00' + CAST(p.day_of_birth AS VARCHAR), 2),
'01'
)
) AS DATE)There was a problem hiding this comment.
hah, crafty little solution! only question - do we need 00 or just 0 there? (won't actually impact the result)
There was a problem hiding this comment.
Well spotted, will change it
| AND c.COHORT_DEFINITION_ID = @cohortDefinitionId | ||
| } | ||
| JOIN @cdmDatabaseSchema.person p ON cdmTable.person_id = p.person_id | ||
| WHERE CAST(cdmTable.@cdmFieldName AS DATE) < COALESCE( |
There was a problem hiding this comment.
i think to be safe (here and in other queries) we should also have a condition that the field is non-null.
also what do you think about adding that condition to the denominator as well? i.e. among all records where we have a date, that date must be plausible.
There was a problem hiding this comment.
Agree, added same condition for the other two checks as well
katy-sadowski
left a comment
There was a problem hiding this comment.
OK I promise this is my last review 😅 I did a deep review and left a couple more comments on the SQL comments. I also noticed the following in the threshold files (I only checked 5.4 but assume they're all the same):
- EPISODE start_date and start_datetime are missing a threshold for plausibleStartBeforeEnd
- The plausibleStartBeforeEnd threshold for OBSERVATION_PERIOD is 1. Should this actually be 0, since this is a derived table and folks can control it in ETL (and also would prob cause some pretty weird stuff in analytics if the dates were flipped)?
- plausibleAfterBirth and plausibleBeforeDeath need to be activated for MEASUREMENT.measurement_datetime
|
Thanks @katy-sadowski for this thorough review, I agree with all your comments and adjusted the code accordingly. |
katy-sadowski
left a comment
There was a problem hiding this comment.
Looks great! Thanks so much @MaximMoinat! I will go ahead and merge this into develop.
As discussed in #454: these checks separate out the quality checks in plausibleDuringLife and plausibleTemportalAfter.
plausibleAfterBirthchecks for any events happening before birth (was implicitly checked in plausibleTemportalAfter)plausibleBeforeDeathchecks for any events happening after death (improving upon implementation of plausibleDuringLife)plausibleStartBeforeEndchecks for end dates that are before start date in the same table (specific implementation of plausibleTemportalAfter)TODO:
plausibleDuringLifeandplausibleTemportalAfterby default. These are redundant if the three new checks are run, but kept in for backwards compatibility.fixes #513 and fixes #445