Skip to content

Conversation

@fvankrieken
Copy link
Contributor

@fvankrieken fvankrieken commented Dec 4, 2025

#1836

A lot of the lines added are docs!! I add them with relevant commits, but easiest spot to view the full file is here

latest build

Background

(most of this is just copied from design_doc.md)

The Special Address File (SAF) contains address information for some street segments supplementary to the address information for the same segments contained in the LION file. There are 9 outputs - 5 different let's call them "output formats", 4 of which have 2 different outputs.

SAF records come from

  • altsegmentdata - these are relatively "fully-fledged" and are basically designed to generate these output saf records
  • certain CommonPlace records
  • certain AddressPoint records
  • NamedIntersection records

The different flags are:

Flag Source Corresponds to Meaning Output
A altsegmentdata segment Alternative address ranges ABCEGNPX
B altsegmentdata segment Alternative street names ABCEGNPX
C altsegmentdata segment Special record for Ruby Street on Brooklyn-Queens border ABCEGNPX
D altsegmentdata segment Duplicate or overlapping address ranges on the same street D
E altsegmentdata segment Neighborhood name usable in place of street name in addresses ABCEGNPX
G commonplace segment Non-Addressable place name of complexes ABCEGNPX
I namedintersection intersection Contain intersection names I
N commonplace segment Non-Addressable placename ABCEGNPX
O altsegmentdata segment Contain out-of-sequence addresses and opposite-parity addresse OV
P altsegmentdata segment Addressable placename ABCEGNPX
S addresspoint segment Suffixed house number(s) occurring at an intersection S
V addresspoint segment Contain Vanity Addresses OV
X commonplace segment Non-Addressable Place name of constituent entities of complexes ABCEGNPX

Outputs are really just grouped by what data/fields are expected to be associated with an output, as far as I can tell

Review

Commits are generally split up by output file. Though for ABCEGNPX, since it's the largest and first output I tackled, I split it up into a few commits for clarity. The other outputs just have a single commit for essentially all their logic.

READ THE DESIGN DOC CHANGES IN EACH COMMIT FIRST (for commits that edit design_doc.md). This is the documentation for those transformations, so if that isn't giving you proper context for review, that's an issue that I need to address.

All feedback welcome on the documentation! Making it more readable, expanding on jargon, structuring it better.

@codecov
Copy link

codecov bot commented Dec 4, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.68%. Comparing base (fc5a230) to head (a770170).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@fvankrieken fvankrieken force-pushed the fvk-cscl-saf branch 3 times, most recently from d69c9b5 to b763d54 Compare December 9, 2025 21:42
@fvankrieken fvankrieken force-pushed the fvk-cscl-saf branch 26 times, most recently from 7517fd7 to 0a1981c Compare December 23, 2025 18:50
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally was using this to create some QA tables. However, it seems overkill to start actually making qa tables for each file output for these smaller files, I've just been running this ad-hoc in dbeaver.

@fvankrieken fvankrieken changed the title WIP: cscl saf CSCL SAF Dec 30, 2025
@fvankrieken fvankrieken marked this pull request as ready for review December 30, 2025 18:12
@fvankrieken
Copy link
Contributor Author

fvankrieken commented Dec 30, 2025

Have done some QA of the tables with the query included in that macro

WITH combined AS (
    SELECT 
    	'dev' as source,
        *,
        md5(CAST(dev AS text)) AS row_hash
    FROM saf_{abcegnpx_generic}_by_field as dev
    UNION ALL
    SELECT 
    	'prod' as source,
        *,
        md5(CAST(prod AS text)) AS row_hash
    FROM production_outputs.saf_{abcegnpx_generic} as prod
),
counts AS (
    SELECT 
        *,
        COUNT(*) OVER (PARTITION BY row_hash) AS match_count,
        COUNT(CASE WHEN source = 'dev' THEN 1 END) OVER (PARTITION BY row_hash) AS dev_count,
        COUNT(CASE WHEN source = 'prod' THEN 1 END) OVER (PARTITION BY row_hash) AS prod_count
    FROM combined
)
select counts.*
--, gnx.atomicid, gnx.geom, gnx.ap_geom
FROM counts
--left join int__saf_gnx gnx on counts.segmentid::int = gnx.segmentid and counts.source = 'dev'
WHERE dev_count <> prod_count  -- Different counts = unmatched rows
order by boroughcode, face_code, segment_seqnum, source

ABCEGNPX

131 diffs (both generic and roadbed).

  • majority of cases are different APs being joined. I've done a little querying - all records are gnx address points that are essentially on top of segments. However, in all cases, they're slightly off center, and ours seems to be correct. Will send list to GR
  • some "keys" that are in dev but not prod and vice-versa. Going to punt this issue. Seems like some have different face codes but same other fields, which would imply some segments are not being assigned correctly.

D

No diffs between dev and prod

I

No diffs between dev and prod

OV

No diffs between dev and prod

S

32 diffs, some weirdness. Want to punt this to a new issue

  • hn - some seem to have the interesting Edgewater logic applied, but others don't
  • lgc - don't seem to line up. Prod code is a little weird - looks fine, but it's implemented differently from standard segment lgc lookup for some reason

models:
# Models with column per field
- name: saf_abcegnpx_generic_by_field
columns: &safa_columns
Copy link
Contributor Author

@fvankrieken fvankrieken Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started to lose a little faith in this while writing it - I feel like it's a lot of effort to write/maintain tests like these when really, I'd rather just trust the formatting seed and the macro that generates all these columns (and also, these tests take a while to run - for LION, I think it's like a minute). And then, we test that the final text output (the "dat" table with row concatenated as text) has the right length, so that would catch most issues here.

So I stopped part way through and thought I'd open it up to thoughts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea seems like a pain to maintain the column length tests. always doing the final "dat" table testing sounds good to me

Comment on lines +21 to +31
CASE
WHEN street_names.snd_feature_type IN ('E', 'F') AND address_points.house_number_suffix IS NOT NULL
THEN
-- Last character is A, B, etc -> converted to 1, 2, etc
(
10000 * COALESCE(ASCII(RIGHT(address_points.house_number_suffix, 1)) - 64, 0)
+ address_points.house_number::INT
)::TEXT
WHEN address_points.hyphen_type = 'R'
THEN TRIM(SPLIT_PART(address_points.house_number, '-', 1))
ELSE address_points.house_number
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a way to generalize the logic for Edgewater without having to explicitly write a WHEN for just that area?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate? WHEN street_names.snd_feature_type IN ('E', 'F') is explicitly for Edgewater

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh sorry I forgot that's how it's flagged in the source data, and that's in the doc!

@fvankrieken fvankrieken merged commit a0072b6 into main Jan 6, 2026
23 checks passed
@fvankrieken fvankrieken deleted the fvk-cscl-saf branch January 6, 2026 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants