Skip to content

fix: join to sam_account_name for google_email changers#3596

Merged
kevinverhoff merged 5 commits intomainfrom
cristinabaldor/fix/survey-completion-join
Apr 8, 2026
Merged

fix: join to sam_account_name for google_email changers#3596
kevinverhoff merged 5 commits intomainfrom
cristinabaldor/fix/survey-completion-join

Conversation

@cristinabaldor
Copy link
Copy Markdown
Contributor

@cristinabaldor cristinabaldor commented Apr 8, 2026

Pull Request

Summary & Motivation

Edit the join to capture teammates whose google_email addresses have changed from @apps.teamschools.org to @kippmiami.org match on sam_account_names instead, as they will be the same before and after.

Self-review

General

  • If this is a same-day request, please flag that in the #data-team Slack
  • Update due date and assignee on the
    TEAMster Asana Project
  • Run trunk fmt on all modified files

Dagster

  • Run uv run dagster definitions validate for any modified code location
  • New assets follow the [code_location, integration, table_name] key
    pattern and use the appropriate IO manager (pickle, avro, or file)
  • New integrations include both a libraries/<integration>/assets.py
    factory function and a config/assets-*.yaml in the code location

dbt

  • Include a corresponding [model name].yml properties file for all models.
    These can be generated by running
    dbt run-operation generate_model_yaml --args '{"model_names": ["MODEL NAME"]}'
    and saving the console output to a YAML file.

    models:
      - name: [model name]
        config:
          contract:
            enforced: false  # required while creating a contracted model, delete before merging
        columns:  # required for contracted models
          - name: ...
            data_type: ...
            data_tests:  # column tests, optional
              - ...
        data_tests:  # model tests, optional
          - ...
          - dbt_utils.unique_combination_of_columns:  # recommended for intermediate models
              arguments:
                combination_of_columns:
                  - ...
              config:
                store_failures: true
    
  • Include (or update) an
    exposure for all
    models that will be consumed by a dashboard, analysis, or application:

    exposures:
      - name: [exposure name, snake_case]
        label: [exposure name, Title Case]
        type: dashboard | notebook | analysis | ml | application
        owner:
          name: Data Team
        depends_on:
          - ref("[model name]")
          - ...
        url: ...  # optional
        config:
          meta:
            dagster:
              kinds:
                - tableau | googlesheets | ...
                - ...
              asset:
                metadata:
                  id: [lsid]  # required for Tableau Server workbooks
                  cron_schedule:  # required for Dagster automation
                    - * * * * *
                    - ...
    

Dagster "kinds" Reference

SQL

  • Use the union_dataset_join_clause() macro for queries that employ models
    that use regional datasets

  • Do not use group by without any aggregations when you mean to use
    distinct

  • All distinct usage must be accompanied by a comment explaining its
    necessity

  • Do not use order by for select statements. That should be done in the
    reporting layer.

  • If you are adding a new external source, before building, run:

    dbt run-operation stage_external_sources --vars "{'ext_full_refresh': 'true'}" --args select: [model name(s)]
    

Troubleshooting

@cristinabaldor cristinabaldor requested review from a team as code owners April 8, 2026 15:19
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

Claude finished @cristinabaldor's task in 4m 47s —— View job


Code Review: fix join to sam_account_name for google_email changers

The change is correct for its stated purpose — recovering Miami staff survey responses after the @apps.teamschools.org@kippmiami.org domain migration. The regex and join logic are sound. A few things worth flagging before merge:


Important

1. Upstream model int_surveys__survey_responses has the same google_email join gap

At src/dbt/kipptaf/models/surveys/intermediate/int_surveys__survey_responses.sql:42:

left join {{ ref("stg_ldap__user_person") }} as ldap
    on fr.respondent_email = ldap.google_email
    and ldap.uac_account_disable = 0

This populates respondent_employee_number, respondent_samaccountname, and respondent_userprincipalname on the intermediate model. For historical responses submitted with the old @apps.teamschools.org address, these columns will be NULL once the LDAP record has been updated. The fix in rpt_tableau__survey_responses.sql sidesteps this by going directly to int_people__staff_roster_history, but any other consumer of int_surveys__survey_responses that relies on those columns will still see NULLs for migrated Miami staff.

2. Two other Tableau extract models have the same join pattern — not fixed

  • rpt_tableau__school_community_diagnostic.sql:79: on sr.respondent_email = srh.google_email — LEFT JOIN, so no hard failure, but Miami staff will have NULL demographic columns from srh.
  • rpt_tableau__operations_pm.sql:109: on responses_pivoted.respondent_email = full_roster.google_email — same pattern.

If Miami staff are included in those surveys, the same data dropout is happening there. Worth a follow-up issue if not already tracked.


Suggestions

3. Add lower() for defensive casing

sam_account_name is stored as lower(ldap.sam_account_name). Google Forms email addresses should be lowercase, but it's not explicitly normalised in the pipeline. A small guard:

on lower(regexp_extract(sr.respondent_email, r'^([^@]+)')) = eh.sam_account_name

4. Plus-addressing edge case

The regex r'^([^@]+)' captures everything before @, including any +tag suffix (e.g., jsmith+survey@kippmiami.orgjsmith+survey), which would not match sam_account_name = 'jsmith'. Unlikely in staff surveys, but worth awareness if responses are ever imported from external sources.


What Looks Good

  • regexp_extract correctly returns NULL on NULL input (no-match, not an error), so NULL emails safely fall out of the INNER JOIN — same behaviour as before.
  • The existing temporal guard (date_submitted BETWEEN effective_date_start_timestamp AND effective_date_end_timestamp) and primary_indicator filter remain intact.
  • SQL formatting follows project conventions (BigQuery dialect, trailing commas, no ORDER BY, 88-char line limit).
  • sam_account_name is LDAP-unique across the network, so the join cardinality assumption is the same as the previous email join.

Summary

The core fix is correct and addresses the described problem. The main recommendation is to file a follow-up issue to apply the same fix to int_surveys__survey_responses.sql:42, rpt_tableau__school_community_diagnostic.sql:79, and rpt_tableau__operations_pm.sql:109 if Miami staff appear in those survey types.

… addresses

Replace two-hop LDAP → staff_roster join in int_surveys__survey_responses with
a date-bounded join to int_people__staff_roster_history on sam_account_name for
both the Google Forms and Alchemer union branches. Resolves NULL respondent
identity columns for staff whose email addresses were migrated from
@apps.teamschools.org, and recovers 174 employees' worth of historical Alchemer
responses (2019–2022) that were silently dropped by a case-sensitive join.

Downstream rpt_tableau__survey_responses and rpt_tableau__school_community_diagnostic
updated to join on respondent_employee_number, removing the repeated email-based
join pattern. rpt_tableau__operations_pm fixed independently with lower() +
sam_account_name extraction since it bypasses the int model.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@cristinabaldor
Copy link
Copy Markdown
Contributor Author

@claude, made some changes, review again

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

Claude finished @cristinabaldor's task in 1m 55s —— View job


Code Review (updated): fix join to sam_account_name for google_email changers

  • Read the changed file and diff against origin/main
  • Analyze the updated code
  • Post updated review

Note: The branch (17686bef0) reflects the same single commit as the previous review — no new code changes are visible. If you pushed additional changes, they may not have been reflected in the checkout. Reviewing the current state.


Current change (rpt_tableau__survey_responses.sql:46)

-- Before
on sr.respondent_email = eh.google_email

-- After
on regexp_extract(sr.respondent_email, r'^([^@]+)') = eh.sam_account_name

The core fix is correct. sam_account_name is stored as lower(ldap.sam_account_name) in int_people__staff_roster_history, so it's already normalised to lowercase.


Still outstanding from previous review

1. Defensive lower() — suggestion, not a blocker

regexp_extract doesn't guarantee lowercase — it extracts whatever case the email contains. Google Forms addresses are typically lowercase, but a one-liner guards against any edge case:

on lower(regexp_extract(sr.respondent_email, r'^([^@]+)')) = eh.sam_account_name

Since sam_account_name is already lower(...) in the model, the left side should match. This is the only unaddressed suggestion from the previous review.

2. Related models still have the same gap (follow-up scope)

These are outside this PR's scope but worth tracking:

File Line Pattern
int_surveys__survey_responses.sql 42 fr.respondent_email = ldap.google_email
rpt_tableau__school_community_diagnostic.sql 79 sr.respondent_email = srh.google_email
rpt_tableau__operations_pm.sql 109 responses_pivoted.respondent_email = full_roster.google_email

What looks good

  • regexp_extract returns NULL on NULL input — NULL emails safely fall out of the INNER JOIN, preserving previous behaviour.
  • The temporal guard (date_submitted BETWEEN effective_date_start_timestamp AND effective_date_end_timestamp) and primary_indicator filter are intact.
  • sam_account_name is network-unique, so join cardinality is equivalent to the old email join.
  • SQL formatting follows project conventions.

Summary

The fix is correct and ready to merge as-is. Adding lower() around the regexp_extract call would be a small defensive improvement but is not required. If the "changes" you made weren't reflected here, check that the push went through (git push) — the branch tip is still 17686bef0.
| Branch

@cristinabaldor
Copy link
Copy Markdown
Contributor Author

@claude pushed more changes, take a look

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

Claude finished @cristinabaldor's task in 2m 35s —— View job


Code Review (updated): all-files fix for google_email changers

  • Reading changed files and diff
  • Analyze all 4 changed files
  • Post updated review

This is a substantial improvement over the previous commit. All three files flagged in the earlier review are now fixed, and the approach is architecturally cleaner.


What changed

int_surveys__survey_responses.sql — the core fix

Google Forms branch (lines 39–50): The two-hop stg_ldap__user_personint_people__staff_roster join is replaced with a direct join to int_people__staff_roster_history on sam_account_name, with lower(), a timestamp-cast temporal guard, and primary_indicator. Addresses the original identity resolution gap for email-migrated staff.

Alchemer branch (lines 100–105): Previously all identity columns (respondent_employee_number, respondent_samaccountname, etc.) were hard-coded null. This branch now joins int_people__staff_roster_history via ri.respondent_mail using the same pattern — recovering 174 employees' worth of Alchemer responses (2019–2022). term_name is also now populated from rt.name rather than null.

rpt_tableau__survey_responses.sql (line 46) — correctly simplified to sr.respondent_employee_number = eh.employee_number. Identity resolution is now upstream; the downstream join is cleaner.

rpt_tableau__school_community_diagnostic.sql (line 79) — same simplification, from respondent_email = srh.google_email to respondent_employee_number = srh.employee_number.

rpt_tableau__operations_pm.sql (lines 110–111) — fixed independently with lower(regexp_extract(..., r'^([^@]+)')) = full_roster.sam_account_name since this model bypasses the int model.


All previous suggestions addressed

Suggestion Status
lower() on regexp_extract ✅ Applied in all 4 files
int_surveys__survey_responses.sql:42 identity gap ✅ Fixed — both UNION branches now resolve identity
rpt_tableau__school_community_diagnostic.sql:79 email join ✅ Fixed
rpt_tableau__operations_pm.sql:109 email join ✅ Fixed

One minor note

rpt_tableau__school_community_diagnostic.sql:78–81 — the join to srh uses employee_number + temporal guard but no and srh.primary_indicator. This was also absent in the original google_email join, so it's pre-existing. Since employee_number was already resolved through primary_indicator in int_surveys__survey_responses, and the temporal guard is present, this should not fan-out in practice. Not a blocker.


Summary

All items from the previous review are resolved. The architecture is cleaner — identity resolution happens once in the intermediate model and flows cleanly into downstream extracts. Ready to merge.
| Branch

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
11 staff have google_email local parts that don't match sam_account_name due to
accents, hyphens, dots, or name changes. Add OR respondent_email = google_email
fallback to both union branches in int_surveys__survey_responses to recover
their 2,725 survey responses.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@kevinverhoff kevinverhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lfg

@kevinverhoff kevinverhoff merged commit ecbc257 into main Apr 8, 2026
7 checks passed
@kevinverhoff kevinverhoff deleted the cristinabaldor/fix/survey-completion-join branch April 8, 2026 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants