Skip to content

feat: collapse NSC semesters into enrollment periods via gaps-and-isl…#3415

Open
anthonygwalters wants to merge 13 commits intomainfrom
claude/salesforce-alumni-tracking-vtKNu
Open

feat: collapse NSC semesters into enrollment periods via gaps-and-isl…#3415
anthonygwalters wants to merge 13 commits intomainfrom
claude/salesforce-alumni-tracking-vtKNu

Conversation

@anthonygwalters
Copy link
Copy Markdown
Member

…ands

Adds int_nsc__enrollments, which collapses semester-level NSC rows into continuous enrollment periods. Consecutive semesters are merged into one enrollment; a gap > 200 days (summer ≈ 90 days, missing semester ≈ 270 days) signals a new enrollment period.

Updates rpt_gsheets__nsc_enrollment_new and rpt_gsheets__nsc_enrollment_updates to reference the intermediate model and match against Salesforce enrollments using ±180-day date proximity rather than calendar year equality, fixing the case where a Fall+Spring enrollment was incorrectly split into two records.

https://claude.ai/code/session_01FcUXjXQDFFHKo4E6d3oG6S

Pull Request

Summary & Motivation

Self-review

General

  • If this is a same-day request, please flag that in the #data-team Slack
  • Update due date and assignee on the
    TEAMster Asana Project
  • Run trunk fmt on all modified files

Dagster

  • Run uv run dagster definitions validate for any modified code location
  • New assets follow the [code_location, integration, table_name] key
    pattern and use the appropriate IO manager (pickle, avro, or file)
  • New integrations include both a libraries/<integration>/assets.py
    factory function and a config/assets-*.yaml in the code location

dbt

  • Include a corresponding [model name].yml properties file for all models.
    These can be generated by running
    dbt run-operation generate_model_yaml --args '{"model_names": ["MODEL NAME"]}'
    and saving the console output to a YAML file.

    models:
      - name: [model name]
        config:
          contract:
            enforced: false  # required while creating a contracted model, delete before merging
        columns:  # required for contracted models
          - name: ...
            data_type: ...
            data_tests:  # column tests, optional
              - ...
        data_tests:  # model tests, optional
          - ...
          - dbt_utils.unique_combination_of_columns:  # recommended for intermediate models
              arguments:
                combination_of_columns:
                  - ...
              config:
                store_failures: true
    
  • Include (or update) an
    exposure for all
    models that will be consumed by a dashboard, analysis, or application:

    exposures:
      - name: [exposure name, snake_case]
        label: [exposure name, Title Case]
        type: dashboard | notebook | analysis | ml | application
        owner:
          name: Data Team
        depends_on:
          - ref("[model name]")
          - ...
        url: ...  # optional
        config:
          meta:
            dagster:
              kinds:
                - tableau | googlesheets | ...
                - ...
              asset:
                metadata:
                  id: [lsid]  # required for Tableau Server workbooks
                  cron_schedule:  # required for Dagster automation
                    - * * * * *
                    - ...
    

Dagster "kinds" Reference

SQL

  • Use the union_dataset_join_clause() macro for queries that employ models
    that use regional datasets

  • Do not use group by without any aggregations when you mean to use
    distinct

  • All distinct usage must be accompanied by a comment explaining its
    necessity

  • Do not use order by for select statements. That should be done in the
    reporting layer.

  • If you are adding a new external source, before building, run:

    dbt run-operation stage_external_sources --vars "{'ext_full_refresh': 'true'}" --args select: [model name(s)]
    

Troubleshooting

@anthonygwalters anthonygwalters requested a review from a team as a code owner March 11, 2026 20:48
@anthonygwalters anthonygwalters requested a review from a team March 11, 2026 20:48
…ands

Adds int_nsc__enrollments, which collapses semester-level NSC rows into
continuous enrollment periods. Consecutive semesters are merged into one
enrollment; a gap > 200 days (summer ≈ 90 days, missing semester ≈ 270 days)
signals a new enrollment period.

Updates rpt_gsheets__nsc_enrollment_new and rpt_gsheets__nsc_enrollment_updates
to reference the intermediate model and match against Salesforce enrollments
using ±180-day date proximity rather than calendar year equality, fixing the
case where a Fall+Spring enrollment was incorrectly split into two records.

https://claude.ai/code/session_01FcUXjXQDFFHKo4E6d3oG6S
@anthonygwalters anthonygwalters force-pushed the claude/salesforce-alumni-tracking-vtKNu branch from 6a80366 to 29871a3 Compare March 11, 2026 20:54
@anthonygwalters
Copy link
Copy Markdown
Member Author

@claude please review this PR

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 13, 2026

Claude finished @anthonygwalters's task in 2m 36s —— View job


PR Review

  • Gather context and understand the request
  • Read changed SQL files
  • Review intermediate model logic
  • Review extract model updates
  • Check dbt YAML properties
  • Post final review

Overall: The gaps-and-islands approach is correct and well-structured. The 200-day threshold is well-documented and the deduplication logic is sound. A few issues to address below.


int_nsc__enrollments.sql

Logic is solid. The three-step gaps-and-islands pattern is clean:

  1. nsc_with_account — deduplicates overlapping NSC rows for the same (contact_id, account_id, enrollment_begin) by keeping the most recent search_date. Good.
  2. semesters_with_lag — computes the gap to the previous semester's end. The lag() is correctly ordered by enrollment_begin within the right partition.
  3. enrollment_groups — uses a running sum() to assign group IDs, which is the standard gaps-and-islands idiom.

The final array_agg(... order by enrollment_begin desc limit 1)[offset(0)] to pick the most recent semester's fields is a valid BigQuery pattern, and is deterministic here because enrollment_begin is unique within each (contact_id, account_id) pair after the deduplication step.

No issues.


rpt_gsheets__nsc_enrollment_updates.sql — potential duplicate id bug

The uniqueness test on this model asserts id is unique. However, it's possible for a single Salesforce enrollment (e.id) to match two NSC enrollment periods via the ±180-day window, producing duplicate rows:

Example scenario:

  • NSC enrollment period 1: enrollment_begin = 2020-01-01, enrollment_end = 2020-05-15
  • NSC enrollment period 2: enrollment_begin = 2020-12-01 (gap = ~200 days → separate group)
  • Salesforce enrollment: start_date = 2020-07-01

Both NSC enrollment_begin values fall within ±180 days of 2020-07-01, so both join to the same e.id. The uniqueness test would fail at run time.

This is a theoretical edge case but the ±180-day window is wide enough to make it realistic. A safer join might add a ROW_NUMBER() in nsc_enrollment_derived to pick the closest NSC period per Salesforce enrollment, or narrow the window. Worth an explicit decision on the right behavior here.

Fix this →


rpt_gsheets__nsc_enrollment_new.sql — double-quoted string literals

Lines 23–24 use double quotes for string literals:

then "Bachelor's (4-year)"
...
then "Associate's (2 year)"

The .sqlfluff config enforces preferred_quoted_literal_style = single_quotes. In BigQuery, double quotes are for identifiers, not literals. These should be escaped single-quote strings:

then 'Bachelor''s (4-year)'
...
then 'Associate''s (2 year)'

Fix this →


rpt_gsheets__nsc_enrollment_new.yml — stale column description

The start_date__c description still says:

"Date the enrollment began (first day of the academic year)."

With the new intermediate model, enrollment_begin is the first day of the earliest semester in the enrollment period — not necessarily the first day of an academic year. The description should be updated to match (e.g., "First day of the earliest NSC semester in this enrollment period.").


int_nsc__enrollments.yml

Well-formed. The dbt_utils.unique_combination_of_columns test on (contact_id, account_id, enrollment_group) is correct and required for intermediate models per project conventions. All columns are documented with accurate descriptions.

Minor: enrollment_group is described as "monotonically increasing" but technically it increments per group boundary detection — not guaranteed to be strictly sequential across all rows. The description is accurate enough, no change needed.


Summary

Severity File Issue
🔴 Medium rpt_gsheets__nsc_enrollment_updates.sql Potential duplicate e.id rows when two NSC periods match the same SF enrollment via ±180-day window
🟡 Low rpt_gsheets__nsc_enrollment_new.sql Double-quoted string literals violate sqlfluff single-quote rule
🟢 Trivial rpt_gsheets__nsc_enrollment_new.yml start_date__c description is stale

The core gaps-and-islands logic in int_nsc__enrollments is correct and the refactor meaningfully simplifies the extract models. The duplicate-ID risk in the updates model is the main thing to resolve before merging.
|

claude and others added 2 commits March 13, 2026 19:23
- Add QUALIFY ROW_NUMBER() to rpt_gsheets__nsc_enrollment_updates to
  prevent duplicate e.id rows when multiple NSC enrollment periods fall
  within ±180 days of the same Salesforce enrollment start date; picks
  the period with enrollment_begin closest to start_date
- Fix double-quoted string literals in rpt_gsheets__nsc_enrollment_new
  to use escaped single quotes per sqlfluff config
- Correct stale start_date__c description in properties YAML

https://claude.ai/code/session_01FcUXjXQDFFHKo4E6d3oG6S
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants