Feat/theodw 2795 prevent spark datetime failures in curated mipins 23 02 26#2380
Closed
KranthiRayipudi wants to merge 17 commits intomainfrom
Closed
Conversation
…ures-in-curated-mipins_23_02_26
…ures-in-curated-mipins_23_02_26
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
https://pins-ds.atlassian.net/browse/THEODW-2795
For notebooks with only a few date fields, I added direct WHERE filters to exclude invalid or placeholder dates (before 1900 01 01) while retaining NULLs:
• appeal_event_curated_mipins
Added filters for: eventStartDateTime, eventEndDateTime, NotificationOfSiteVisit, IngestionDate, ValidTo, and restricted to ODTSourceSystem = 'ODT'.
• appeal_event_estimate_curated_mipins
Added filters for: IngestionDate, ValidTO.
• Entraid_curated_mipins
Added filters for: ingestionDate, validTo.
• appeal_service_user_curated_mipins
Added filters for: ingestionDate, validTo.
For notebooks with many date columns stored as string (e.g., appeal_s78_curated_mipins and appeals_has_curated_mipins), I implemented a generic scripted approach:
• Automatically detect all columns containing the word “date”
• Apply a standard filter:
• (column IS NULL OR to_timestamp(column) >= '1900-01-01')
• This avoids manual maintenance and handles string-formatted dates safely.
This approach ensures consistent data quality across all curated layers.
Execution & Validation
• I executed all updated notebooks end-to-end.
• All notebooks ran successfully without errors.
• I did not find much test data with dates before 1900, but the logic is now in place and will protect us from future issues.
**PR Template**
Note: Run the correct ADO pipeline for this PR - check the list here:
ODW Repositories
JIRA Ticket Reference :
[ Enter JIRA ticket number and Title here]
Summary of the work :
[ Enter Summary here]
New Source-to-Raw Datasets
New Tables in Standardised Layer
New Tables in Harmonised or Curated Layers
Schema or Column Changes
(Only new columns or columns with changed data types are in scope)
Script Execution in Build
Table Creation and Schema Validation
Deployment and Schema Change Documentation
Archiving Process Review