-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Backing up the redis and having a recovery procedure is important as a production-readiness step. The recovery procedure will be tricky; open jobs from the backup shouldn't be naively picked back up by the queue and potentially double-transferred.
Any open jobs post-recovery should be stopped and moved to a new special work stage (or queue) for "recovering". There, each needs to be checked to
- call
assert_patient_exists_and_can_be_transferred- if false, the job can be marked completed, if true move to the second step
- may need more granular checks here. If a patient record still exists but can not be transferred, this job might have succeeded. May be worth having a mechanism to check the patient's link record for the relevant transfer completion info
- in this case, the job may have never started, may have tried and been rejected, or may have been interrupted by the incident that caused the recovery while mid process and might have been written in on the inbound end without having time to either write this job stage completion back to the queue (or it did, but after the latest backup), and may not have yet finished the subsequent
setting_patient_post_transfer_metadatastage- the best automated way to continue would be to add some ability for the inbound service to answer questions about past transfer requests it's processed (assuming the incident doesn't take out the receiving PT's data too 😅)
- failing the above, these cases may need to be parked and resolved outside the system. This might mean a separate automated system that uses a second source (like event logs, if those are complete and timely in this instance) to seek confirmation, or it might mean an out of band process for confirming the existence of transfered records on the receiver's end (may or may not be something that can be automated, depends on non-technical differences between individual PTs)
Jobs that were started (completed or not) before the incident but after the latest backup also need to be accounted for. They will need to be recreated directly in the "recovering" stage, from there the above steps will resolve them. If system logs are intact, the jobs in this category can be recovered from there. Otherwise, upstream processes/systems can be assumed to have their own records/logs, though using data from out-of-scope sources is, itself, out-of-scope.