Skip to content

Feat/theodw 2795 prevent spark datetime failures in curated mipins 23 02 26#2380

Closed
KranthiRayipudi wants to merge 17 commits intomainfrom
Feat/THEODW-2795-prevent-spark-datetime-failures-in-curated-mipins_23_02_26
Closed

Feat/theodw 2795 prevent spark datetime failures in curated mipins 23 02 26#2380
KranthiRayipudi wants to merge 17 commits intomainfrom
Feat/THEODW-2795-prevent-spark-datetime-failures-in-curated-mipins_23_02_26

Conversation

@KranthiRayipudi
Copy link
Copy Markdown
Collaborator

@KranthiRayipudi KranthiRayipudi commented Feb 24, 2026

https://pins-ds.atlassian.net/browse/THEODW-2795

I have updated multiple notebooks to ensure date validation is applied consistently across the curated tables.

For notebooks with only a few date fields, I added direct WHERE filters to exclude invalid or placeholder dates (before 1900 01 01) while retaining NULLs:
• appeal_event_curated_mipins
Added filters for: eventStartDateTime, eventEndDateTime, NotificationOfSiteVisit, IngestionDate, ValidTo, and restricted to ODTSourceSystem = 'ODT'.
• appeal_event_estimate_curated_mipins
Added filters for: IngestionDate, ValidTO.
• Entraid_curated_mipins
Added filters for: ingestionDate, validTo.
• appeal_service_user_curated_mipins
Added filters for: ingestionDate, validTo.
For notebooks with many date columns stored as string (e.g., appeal_s78_curated_mipins and appeals_has_curated_mipins), I implemented a generic scripted approach:
• Automatically detect all columns containing the word “date”
• Apply a standard filter:
• (column IS NULL OR to_timestamp(column) >= '1900-01-01')
• This avoids manual maintenance and handles string-formatted dates safely.
This approach ensures consistent data quality across all curated layers.
Execution & Validation
• I executed all updated notebooks end-to-end.
• All notebooks ran successfully without errors.
• I did not find much test data with dates before 1900, but the logic is now in place and will protect us from future issues.

**PR Template**

Note: Run the correct ADO pipeline for this PR - check the list here:
ODW Repositories

  1. JIRA Ticket Reference :

    [ Enter JIRA ticket number and Title here]

  2. Summary of the work :

    [ Enter Summary here]

  3. New Source-to-Raw Datasets

    • New source data has been added
      • A trigger has been attached at the appropriate frequency
  4. New Tables in Standardised Layer

    • New standardised tables have been created
      • orchestration.json is updated and tested in Dev, and PR is open or merged to main
      • Schema exists in odw-config/standardised-table-definitions or is about to be PRd
  5. New Tables in Harmonised or Curated Layers

    • New harmonised or curated tables have been created
      • Script is configured in the pipeline pln_post_deployments
      • Schema exists in odw-config/harmonised-table-definitions or curated-table-definitions or is about to be PRd
  6. Schema or Column Changes
    (Only new columns or columns with changed data types are in scope)

    • Changes to table structure or columns
      • py_change_table is set to run in pln_post_deployments
      • A script has been created to backfill or populate new column(s) in Test and Prod
        • Avoid dropping and recreating tables unless strictly necessary
  7. Script Execution in Build

    • Scripts have run in isolation in Build
      • Script has been added to pln_post_deployments
      • Script is now part of a scheduled pipeline with correct triggers
    • No scripts have run or no action required in Test/Prod
  8. Table Creation and Schema Validation

    • All required tables have been created
    • Schema has been validated against the requirements
  9. Deployment and Schema Change Documentation

    • Deployment steps and rollback procedures are documented
    • Schema change handling is outlined and tested
  10. Archiving Process Review

    • Automatic archiving logic has been reviewed
    • Archiving schedules and retention policies are validated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants