-
Notifications
You must be signed in to change notification settings - Fork 96
chore(bqetl_artifact_deployment): [DENG-10379] Skip updating query schemas if schema.yaml exists #2306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…hemas if schema.yaml exists
0fa6e23 to
7925222
Compare
| generate_sql_cmd_template | ||
| + "script/bqetl query initialize '*' --skip-existing --project-id=moz-fx-data-shared-prod --project-id=moz-fx-data-experiments --project-id=moz-fx-data-marketing-prod --project-id=moz-fx-data-bq-people && " | ||
| "script/bqetl query schema update '*' --use-cloud-function=false --ignore-dryrun-skip --project-id=moz-fx-data-shared-prod --project-id=moz-fx-data-experiments --project-id=moz-fx-data-marketing-prod --project-id=moz-fx-glam-prod --project-id=moz-fx-data-bq-people && " | ||
| "script/bqetl query schema update '*' --skip-existing --use-cloud-function=false --ignore-dryrun-skip --project-id=moz-fx-data-shared-prod --project-id=moz-fx-data-experiments --project-id=moz-fx-data-marketing-prod --project-id=moz-fx-glam-prod --project-id=moz-fx-data-bq-people && " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (blocking): I believe this will cause the subsequent bqetl query schema deploy command to fail in cases where the schema.yaml file only contains a subset of the fields currently in the table, which I know is currently the case for some ETLs which are using the ALLOW_FIELD_ADDITION schema update option.
This isn't an absolute dealbreaker, but does mean a BigQuery ETL PR will be needed first to update such ETLs' schema.yaml files. And then whenever new fields get added to such ETLs (e.g. due to new fields being added in upstream data sources and getting passed through) that will cause table deployment failures, so PRs to re-update their schema.yaml files will need to be submitted and merged promptly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example of such an ETL: https://github.com/mozilla/bigquery-etl/tree/main/sql/moz-fx-data-shared-prod/firefox_accounts_derived/fxa_gcp_stdout_events_v1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of these cases should be detected by bqetl_dryrun, which compares the schema.yaml files in the generated-sql branch against the query and table schema. However, some queries are skipped in dryruns, so those don't get caught and might cause the issues you described.
I created mozilla/bigquery-etl#8670 to still update schemas for queries that have ALLOW_FIELD_ADDITION configured in their metadata even when --skip_existing is used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. 👍
| generate_sql_cmd_template | ||
| + "script/bqetl query initialize '*' --skip-existing --project-id=moz-fx-data-shared-prod --project-id=moz-fx-data-experiments --project-id=moz-fx-data-marketing-prod --project-id=moz-fx-data-bq-people && " | ||
| "script/bqetl query schema update '*' --use-cloud-function=false --ignore-dryrun-skip --project-id=moz-fx-data-shared-prod --project-id=moz-fx-data-experiments --project-id=moz-fx-data-marketing-prod --project-id=moz-fx-glam-prod --project-id=moz-fx-data-bq-people && " | ||
| "script/bqetl query schema update '*' --skip-existing --use-cloud-function=false --ignore-dryrun-skip --project-id=moz-fx-data-shared-prod --project-id=moz-fx-data-experiments --project-id=moz-fx-data-marketing-prod --project-id=moz-fx-glam-prod --project-id=moz-fx-data-bq-people && " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. 👍
Description
Depends on mozilla/bigquery-etl#8648
Currently, bqetl
schema query update *is running for all queries before table deploys. This takes 13-15minutes to complete. We might not need to update schemas for queries with an existing schema.yaml file. If the schemas are out of date then the dryrun task will flag them instead. This saves about 7 minutes (when tested locally).I ran the schema update with the skip option against the
generated-sqlbranch and also executed a dryrun with schema validation and no errors showed up.Related Tickets & Documents