Conversation
| "database": "table_schema", | ||
| "table": "table_name", | ||
| "column": "column_name", | ||
| "type": "data_type", |
There was a problem hiding this comment.
I renamed the columns from the source SQL query, so you don't need this rename any more
| self.classification_table.alias("target").merge( | ||
| staged_updates_df.alias("source"), | ||
| "target.table_catalog <=> source.table_catalog AND target.table_schema = source.table_schema AND target.table_name = source.table_name AND target.column_name = source.column_name AND target.tag_name = source.tag_name AND target.current = true", | ||
| "target.table_catalog <=> source.table_catalog AND target.table_schema = source.table_schema AND target.table_name = source.table_name AND target.column_name = source.column_name AND target.data_type = source.data_type AND target.tag_name = source.tag_name AND target.current = true", |
There was a problem hiding this comment.
Why do we need to match on data_type?
| @dataclass | ||
| class TaggedColumn: | ||
| name: str | ||
| data_type: str |
There was a problem hiding this comment.
We also need the full_data_type, which contains the full definition of the composed columns
| temp_sql = msql | ||
| for tagged_col in tagged_cols: | ||
| temp_sql = temp_sql.replace(f"[{tagged_col.tag}]", tagged_col.name) | ||
| # TODO: Can we avoid "replacing strings" for the different types in the future? This is due to the generation of MSQL. Maybe we should rather generate SQL directly from the search method... |
There was a problem hiding this comment.
Yes, we should do that instead.
| col_name_splitted = col_name.split(".") | ||
| return ".".join(["`" + col + "`" for col in col_name_splitted]) | ||
|
|
||
| def recursive_flatten_complex_type(self, col_name, schema, column_list): |
There was a problem hiding this comment.
Oh, wow! This was more complex than I expected.
| {"col_name": self.backtick_col_name(col_name + "." + field.name), "type": "string"} | ||
| ) | ||
| elif type(field.dataType) in self.COMPLEX_TYPES: | ||
| column_list = self.recursive_flatten_complex_type(col_name + "." + field.name, field, column_list) |
There was a problem hiding this comment.
I think you should be appending to column_list instead of replacing column_list. Otherwise you overwrite previously appended string types
| ,default,tb_1,mac,STRING, | ||
| ,default,tb_1,description,STRING, | ||
| ,default,tb_2,active,BOOLEAN, | ||
| ,default,tb_2,categories,"map<string,string>", |
There was a problem hiding this comment.
Here we need to add a "full_data_type" column to reflect the UC structure
A first version supporting complex types (struct with any level and arrays up to the first level only). Structs are scanned recursively but as soon as an array (or map) is found we don't continue for that column.
Still requires some code refactoring but it would be great if you could review the approach.
Steps remaining: