feat: add multi-catalog and multi-schema support for Stitch integration #45

punit-naik-amp · 2025-11-27T04:46:56Z

This commit implements the ability to scan and configure Stitch across multiple Databricks catalogs and schemas in a single operation, while maintaining full backward compatibility with single-location usage.

Key changes:

Core functionality (stitch_tools.py):

Add validate_multi_location_access() to pre-validate access permissions
Add _helper_prepare_multi_location_stitch_config() for aggregating PII scans across multiple locations
Refactor _helper_prepare_stitch_config() to support both single and multi-location modes with unified interface
Gracefully handle partial failures when some locations are inaccessible

Command interface (setup_stitch.py):

Add support for 'targets' parameter accepting list of catalog.schema pairs
Add 'output_catalog' parameter to specify where outputs should be stored
Enhance _display_config_preview() to show scan results for multiple locations
Update command definition with new parameters and usage examples
Fix pylint issues: remove unnecessary elif after return statements

Type safety (url_utils.py):

Update normalize_workspace_url() to accept Optional[str]
Update detect_cloud_provider() to accept Optional[str]
Ensures proper handling of None/empty workspace URLs

Testing (test_stitch_tools.py):

Add 10 new tests in TestMultiCatalogSupport class covering:
- Access validation for all/partial/missing locations
- Successful multi-location configuration
- Partial failure handling with graceful degradation
- No PII found across locations
- Backward compatibility verification
Fix 12 existing tests to add required schema validation setup
All 27 tests passing

Usage examples:
Single location (backward compatible): /setup-stitch --catalog_name prod --schema_name crm

Multiple locations: /setup-stitch --targets prod.crm,prod.ecommerce,analytics.customers --output_catalog prod

Demo:
NOTE: Relevant tables' columns were already tagged with PII info

chuck > I need to run Stitch across different catalog and schema. They're all in different places - punit.punit, punit_02.punit, and punit_local.punit
Thinking...

Preparing Stitch configuration for 3 locations...
  • punit.punit
  • punit_02.punit
  • punit_local.punit
Scanning punit.punit.pos_customers...
Scanning punit_02.punit.loyalty_customers...
Scanning punit_local.punit.ecommerce_customers...

Stitch Configuration Preview:
• Scanned locations: 3
  - punit.punit
  - punit_02.punit
  - punit_local.punit
• Output: punit.stitch_outputs
• Job Name: stitch-multi-2025-11-26_19-43
• Config Path: /Volumes/punit/punit/chuck/stitch-multi-2025-11-26_19-43.json

Scan Results:
  ✓ punit.punit (1 tables, 17 PII columns)
  ✓ punit_02.punit (1 tables, 17 PII columns)
  ✓ punit_local.punit (1 tables, 17 PII columns)

• Tables to process: 3
• Total PII fields: 51

Tables:
  - punit.punit.pos_customers (17 fields)
    • master_id
    • cid
    • name_prefix (title)
    • firstName (given-name)
    • lastName (surname)
    • gender (gender)
    • emailAddress (email)
    • account_status
    • address (address)
    • city (city)
    • state (state)
    • postal_code (postal)
    • dateofbirth (birthdate)
    • employment
    • occupation
    • phone (phone)
    • dtUpdateDate (update-dt)
  - punit_02.punit.loyalty_customers (17 fields)
    • lm_id_uuid
    • master_id
    • lm_id
    • fname (given-name)
    • lname (surname)
    • emailaddress (email)
    • gender (gender)
    • addr1 (address)
    • city (city)
    • state (state)
    • zipcode (postal)
    • birthdate (birthdate)
    • created (create-dt)
    • points
    • current_tier
    • lmProgramName
    • dtUpdateDate (update-dt)
  - punit_local.punit.ecommerce_customers (17 fields)
    • master_id
    • customer_id
    • name_prefix (title)
    • name_first (given-name)
    • name_last (surname)
    • gender (gender)
    • email (email)
    • account_status
    • addr_ln_1_txt (address)
    • city (city)
    • state (state)
    • postal_code (postal)
    • birth_dt (birthdate)
    • employment
    • job_title
    • phone (phone)
    • dtUpdateDate (update-dt)

What would you like to do?
• Type 'launch' or 'yes' to launch the job
• Describe changes (e.g., 'remove table X', 'add email semantic to field Y')
• Type 'cancel' to abort the setup
chuck (interactive) > yes
When you launch Stitch it will create a job in Databricks and a notebook that will show you Stitch results when the job completes.
Stitch will create a schema called stitch_outputs with two new tables called unified_coalesced and unified_scores.
The unified_coalesced table will contain the standardized PII and amperity_ids.
The unified_scores table will contain the links and confidence scores.
Be sure to check out the results in the Stitch Report notebook!

Ready to launch Stitch job. Type 'confirm' to proceed or 'cancel' to abort.
chuck (interactive) > confirm

Launching Stitch job...

Stitch job launched successfully!

Technical Summary:
Stitch setup for punit.punit initiated.
Config: /Volumes/punit/punit/chuck/stitch-multi-2025-11-26_19-43.json
Chuck Job ID: chk-20251126-51224-7g9nVexyAsH
Databricks Job Run ID: 225210058521786

Created Stitch Report notebook:
Notebook Path: /Workspace/Users/v-punit.naik@amperity.com/Stitch Report: punit.punit
Stitch is now running in your Databricks workspace!

Running Stitch creates a job that will take at least a few minutes to complete.

What Stitch will create:
• Schema: punit.stitch_outputs
• Table: punit.stitch_outputs.unified_coalesced (standardized PII and amperity_ids)
• Table: punit.stitch_outputs.unified_scores (links and confidence scores)

A Stitch report showing the results has been created to help you see the results.
The report will not work until Stitch is complete.


What you can do now:
• you can ask me about the status of the Chuck job (job-id: chk-20251126-51224-7g9nVexyAsH)
• you can ask me about the status of the Databricks job run (run-id: 225210058521786)
• Open Databricks job in browser: https://dbc-6e75f43b-0f28.cloud.databricks.com/jobs/802978059061168/runs/225210058521786?o=dbc-6e75f43b-0f28
• Open Stitch Report notebook in browser: https://dbc-6e75f43b-0f28.cloud.databricks.com/?o=dbc-6e75f43b-0f28#workspace/Users/v-punit.naik%40amperity.com/Stitch%20Report%3A%20punit.punit
• Open Databricks workspace: https://dbc-6e75f43b-0f28.cloud.databricks.com

Databricks notebook screenshots after successful multi-catalog multi-schema stitch operation:

pragyan-amp

looks good..

This commit implements the ability to scan and configure Stitch across multiple Databricks catalogs and schemas in a single operation, while maintaining full backward compatibility with single-location usage. Key changes: Core functionality (stitch_tools.py): - Add validate_multi_location_access() to pre-validate access permissions - Add _helper_prepare_multi_location_stitch_config() for aggregating PII scans across multiple locations - Refactor _helper_prepare_stitch_config() to support both single and multi-location modes with unified interface - Gracefully handle partial failures when some locations are inaccessible Command interface (setup_stitch.py): - Add support for 'targets' parameter accepting list of catalog.schema pairs - Add 'output_catalog' parameter to specify where outputs should be stored - Enhance _display_config_preview() to show scan results for multiple locations - Update command definition with new parameters and usage examples - Fix pylint issues: remove unnecessary elif after return statements Type safety (url_utils.py): - Update normalize_workspace_url() to accept Optional[str] - Update detect_cloud_provider() to accept Optional[str] - Ensures proper handling of None/empty workspace URLs Testing (test_stitch_tools.py): - Add 10 new tests in TestMultiCatalogSupport class covering: * Access validation for all/partial/missing locations * Successful multi-location configuration * Partial failure handling with graceful degradation * No PII found across locations * Backward compatibility verification - Fix 12 existing tests to add required schema validation setup - All 27 tests passing Usage examples: Single location (backward compatible): /setup-stitch --catalog_name prod --schema_name crm Multiple locations: /setup-stitch --targets prod.crm,prod.ecommerce,analytics.customers --output_catalog prod

punit-naik-amp requested a review from pragyan-amp November 27, 2025 04:47

pragyan-amp approved these changes Nov 27, 2025

View reviewed changes

punit-naik-amp force-pushed the CHUCK-5-multi-catalog-and-multi-schema-scan-support branch from 39fd697 to 0efacac Compare November 27, 2025 14:31

punit-naik-amp merged commit 90864f0 into main Nov 27, 2025
2 checks passed

punit-naik-amp deleted the CHUCK-5-multi-catalog-and-multi-schema-scan-support branch November 27, 2025 14:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add multi-catalog and multi-schema support for Stitch integration #45

feat: add multi-catalog and multi-schema support for Stitch integration #45

Uh oh!

punit-naik-amp commented Nov 27, 2025 •

edited

Loading

Uh oh!

pragyan-amp left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: add multi-catalog and multi-schema support for Stitch integration #45

feat: add multi-catalog and multi-schema support for Stitch integration #45

Uh oh!

Conversation

punit-naik-amp commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pragyan-amp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

punit-naik-amp commented Nov 27, 2025 •

edited

Loading