Feat/theodw 2863 nsip document entity refactor by KalyaniNik · Pull Request #2413 · Planning-Inspectorate/odw-synapse-workspace

KalyaniNik · 2026-03-17T10:17:31Z

Jira : https://pins-ds.atlassian.net/browse/THEODW-2863

Changes summary :
Refactored harmonised and curated notebooks to python classes for nisp-document entity.

Note: Unit and Integration tests yet to be done .

PR Template

Note: Run the correct ADO pipeline for this PR - check the list here:
ODW Repositories

JIRA Ticket Reference :

[ Enter JIRA ticket number and Title here]
Summary of the work :

[ Enter Summary here]
New Source-to-Raw Datasets
- New source data has been added
  - A trigger has been attached at the appropriate frequency
New Tables in Standardised Layer
- New standardised tables have been created
  - orchestration.json is updated and tested in Dev, and PR is open or merged to main
  - Schema exists in odw-config/standardised-table-definitions or is about to be PRd
New Tables in Harmonised or Curated Layers
- New harmonised or curated tables have been created
  - Script is configured in the pipeline pln_post_deployments
  - Schema exists in odw-config/harmonised-table-definitions or curated-table-definitions or is about to be PRd
Schema or Column Changes
(Only new columns or columns with changed data types are in scope)
- Changes to table structure or columns
  - py_change_table is set to run in pln_post_deployments
  - A script has been created to backfill or populate new column(s) in Test and Prod
    - Avoid dropping and recreating tables unless strictly necessary
Script Execution in Build
- Scripts have run in isolation in Build
  - Script has been added to pln_post_deployments
  - Script is now part of a scheduled pipeline with correct triggers
- No scripts have run or no action required in Test/Prod
Table Creation and Schema Validation
- All required tables have been created
- Schema has been validated against the requirements
Deployment and Schema Change Documentation
- Deployment steps and rollback procedures are documented
- Schema change handling is outlined and tested
Archiving Process Review
- Automatic archiving logic has been reviewed
- Archiving schedules and retention policies are validated

HarrisonBoyleThomas · 2026-03-17T10:46:09Z

+        LoggingUtil().log_info(f"Loading harmonised NSIP Document data from {self.HARMONISED_TABLE}")
+        harmonised_docs = self.spark.sql(f"SELECT * FROM {self.HARMONISED_TABLE}")
+
+        LoggingUtil().log_info(f"Loading curated NSIP Project data from {self.CURATED_PROJECT_TABLE}")
+        curated_projects = self.spark.sql(f"SELECT * FROM {self.CURATED_PROJECT_TABLE}")


Do you need to load the data here? In the nsip_document notebook doesn't seem to do this?

HarrisonBoyleThomas · 2026-03-17T10:47:20Z

+        harmonised_docs.createOrReplaceTempView("harmonised_nsip_document")
+        curated_projects.createOrReplaceTempView("curated_nsip_project")
+
+        df = self.spark.sql("""


Could you select the data during load_data, but join here instead? Just to keep the reading/writing separate

HarrisonBoyleThomas · 2026-03-17T10:47:52Z

+        harmonised_docs.createOrReplaceTempView("harmonised_nsip_document")
+        curated_projects.createOrReplaceTempView("curated_nsip_project")


If you simplify the SELECT queries (and moved the logic to load_data, then this could be removed i think

HarrisonBoyleThomas · 2026-03-17T10:50:34Z

+]
+
+# Columns used for the final deduplication step
+_DEDUP_COLUMNS = [


sorry unable to find the typo

Sorry, i think it should be DEDUPE, altho this isnt a big issue

HarrisonBoyleThomas · 2026-03-17T11:01:10Z

+            FROM
+                {self.HORIZON_TABLE} AS Doc
+            LEFT JOIN {self.AIE_EXTRACTS_TABLE} AS Aie
+            ON Doc.dataid = Aie.DocumentId
+            AND Doc.version = Aie.version
+            AND Doc.dataSize = Aie.size
+            WHERE


I think the joins should be moved to process instead

Join moved to process

HarrisonBoyleThomas

You need to add some unit/integration tests

HarrisonBoyleThomas · 2026-03-18T09:31:31Z

+        curated_projects: DataFrame = self.load_parameter("curated_projects", source_data)
+
+        # Filter to active records
+        docs = harmonised_docs.filter(F.col("IsActive") == "Y")


I would move this into the SQL query to load_data to boost performance, otherwise it'll load all the data and filter afterwards

HarrisonBoyleThomas · 2026-03-18T09:32:17Z

+
+        # Filter to active records and select curated columns
+        df = (
+            harmonised_subscriptions.filter(F.col("IsActive") == "Y")


Same here, i'd move the filter to load_data

KalyaniNik added 3 commits March 15, 2026 22:27

THEODW-2863: nsip document harmoisation code

296a085

THEODW-2863: nsip document harmoisation code

c85b0f5

THEODW-2863: nsip document curation code

8d6a411

KalyaniNik requested review from Fred83200, HarrisonBoyleThomas, KranthiRayipudi, RohitShuklaPINS, harriet-stuart-wd, prathapA100, raamvar, roytalari and stef-solirius as code owners March 17, 2026 10:17

HarrisonBoyleThomas reviewed Mar 17, 2026

View reviewed changes

KalyaniNik added 2 commits March 17, 2026 17:03

THEODW-2863 :Updated for review comments

1a36fbb

THEODW-2867 :nsip subscription curated code

011ee22

HarrisonBoyleThomas reviewed Mar 18, 2026

View reviewed changes

KalyaniNik added 8 commits March 18, 2026 12:03

THEODW-2867 :nsip subscription curated code

88e493c

THEODW-2863 :nsip document curated code

655d283

THEODW-2863 :nsip document harmonisation code

bc46980

THEODW-2866 :nsip examination timetable harmonisation and curated code

5cda2a8

THEODW-2868 :nsip representation harmonisation and curated code

bafd750

THEODW-2868 :nsip representation harmonisation and curated code

6440198

THEODW-2864 :nsip s51 advice harmonisation and curated code

83a3ce2

THEODW-2861 :nsip meeting harmonisation and curated code

65bd094

THEODW-2861 :nsip meeting harmonisation and curated code

95ff511

mukundaraogummadi-pins added the stale label Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/theodw 2863 nsip document entity refactor#2413

Feat/theodw 2863 nsip document entity refactor#2413
KalyaniNik wants to merge 14 commits intomainfrom
feat/THEODW-2863-nsip-document-entity-refactor

KalyaniNik commented Mar 17, 2026

Uh oh!

HarrisonBoyleThomas Mar 17, 2026

Uh oh!

HarrisonBoyleThomas Mar 17, 2026

Uh oh!

HarrisonBoyleThomas Mar 17, 2026

Uh oh!

HarrisonBoyleThomas Mar 17, 2026

Uh oh!

KalyaniNik Mar 17, 2026

Uh oh!

HarrisonBoyleThomas Mar 18, 2026

Uh oh!

HarrisonBoyleThomas Mar 17, 2026

Uh oh!

KalyaniNik Mar 17, 2026

Uh oh!

HarrisonBoyleThomas left a comment

Uh oh!

HarrisonBoyleThomas Mar 18, 2026

Uh oh!

HarrisonBoyleThomas Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		harmonised_docs.createOrReplaceTempView("harmonised_nsip_document")
		curated_projects.createOrReplaceTempView("curated_nsip_project")

Conversation

KalyaniNik commented Mar 17, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HarrisonBoyleThomas left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants