Feat/theodw 2863 nsip document entity refactor#2413
Feat/theodw 2863 nsip document entity refactor#2413KalyaniNik wants to merge 14 commits intomainfrom
Conversation
| LoggingUtil().log_info(f"Loading harmonised NSIP Document data from {self.HARMONISED_TABLE}") | ||
| harmonised_docs = self.spark.sql(f"SELECT * FROM {self.HARMONISED_TABLE}") | ||
|
|
||
| LoggingUtil().log_info(f"Loading curated NSIP Project data from {self.CURATED_PROJECT_TABLE}") | ||
| curated_projects = self.spark.sql(f"SELECT * FROM {self.CURATED_PROJECT_TABLE}") |
There was a problem hiding this comment.
Do you need to load the data here? In the nsip_document notebook doesn't seem to do this?
| harmonised_docs.createOrReplaceTempView("harmonised_nsip_document") | ||
| curated_projects.createOrReplaceTempView("curated_nsip_project") | ||
|
|
||
| df = self.spark.sql(""" |
There was a problem hiding this comment.
Could you select the data during load_data, but join here instead? Just to keep the reading/writing separate
| harmonised_docs.createOrReplaceTempView("harmonised_nsip_document") | ||
| curated_projects.createOrReplaceTempView("curated_nsip_project") |
There was a problem hiding this comment.
If you simplify the SELECT queries (and moved the logic to load_data, then this could be removed i think
| ] | ||
|
|
||
| # Columns used for the final deduplication step | ||
| _DEDUP_COLUMNS = [ |
There was a problem hiding this comment.
sorry unable to find the typo
There was a problem hiding this comment.
Sorry, i think it should be DEDUPE, altho this isnt a big issue
| FROM | ||
| {self.HORIZON_TABLE} AS Doc | ||
| LEFT JOIN {self.AIE_EXTRACTS_TABLE} AS Aie | ||
| ON Doc.dataid = Aie.DocumentId | ||
| AND Doc.version = Aie.version | ||
| AND Doc.dataSize = Aie.size | ||
| WHERE |
There was a problem hiding this comment.
I think the joins should be moved to process instead
There was a problem hiding this comment.
Join moved to process
HarrisonBoyleThomas
left a comment
There was a problem hiding this comment.
You need to add some unit/integration tests
| curated_projects: DataFrame = self.load_parameter("curated_projects", source_data) | ||
|
|
||
| # Filter to active records | ||
| docs = harmonised_docs.filter(F.col("IsActive") == "Y") |
There was a problem hiding this comment.
I would move this into the SQL query to load_data to boost performance, otherwise it'll load all the data and filter afterwards
|
|
||
| # Filter to active records and select curated columns | ||
| df = ( | ||
| harmonised_subscriptions.filter(F.col("IsActive") == "Y") |
There was a problem hiding this comment.
Same here, i'd move the filter to load_data
Jira : https://pins-ds.atlassian.net/browse/THEODW-2863
Changes summary :
Refactored harmonised and curated notebooks to python classes for nisp-document entity.
Note: Unit and Integration tests yet to be done .
PR Template
Note: Run the correct ADO pipeline for this PR - check the list here:
ODW Repositories
JIRA Ticket Reference :
[ Enter JIRA ticket number and Title here]
Summary of the work :
[ Enter Summary here]
New Source-to-Raw Datasets
New Tables in Standardised Layer
New Tables in Harmonised or Curated Layers
Schema or Column Changes
(Only new columns or columns with changed data types are in scope)
Script Execution in Build
Table Creation and Schema Validation
Deployment and Schema Change Documentation
Archiving Process Review