Skip to content

Conversation

@punit-naik-amp
Copy link
Contributor

@punit-naik-amp punit-naik-amp commented Nov 27, 2025

This commit implements the ability to scan and configure Stitch across multiple Databricks catalogs and schemas in a single operation, while maintaining full backward compatibility with single-location usage.

Key changes:

Core functionality (stitch_tools.py):

  • Add validate_multi_location_access() to pre-validate access permissions
  • Add _helper_prepare_multi_location_stitch_config() for aggregating PII scans across multiple locations
  • Refactor _helper_prepare_stitch_config() to support both single and multi-location modes with unified interface
  • Gracefully handle partial failures when some locations are inaccessible

Command interface (setup_stitch.py):

  • Add support for 'targets' parameter accepting list of catalog.schema pairs
  • Add 'output_catalog' parameter to specify where outputs should be stored
  • Enhance _display_config_preview() to show scan results for multiple locations
  • Update command definition with new parameters and usage examples
  • Fix pylint issues: remove unnecessary elif after return statements

Type safety (url_utils.py):

  • Update normalize_workspace_url() to accept Optional[str]
  • Update detect_cloud_provider() to accept Optional[str]
  • Ensures proper handling of None/empty workspace URLs

Testing (test_stitch_tools.py):

  • Add 10 new tests in TestMultiCatalogSupport class covering:
    • Access validation for all/partial/missing locations
    • Successful multi-location configuration
    • Partial failure handling with graceful degradation
    • No PII found across locations
    • Backward compatibility verification
  • Fix 12 existing tests to add required schema validation setup
  • All 27 tests passing

Usage examples:
Single location (backward compatible): /setup-stitch --catalog_name prod --schema_name crm

Multiple locations: /setup-stitch --targets prod.crm,prod.ecommerce,analytics.customers --output_catalog prod

Demo:
NOTE: Relevant tables' columns were already tagged with PII info

chuck > I need to run Stitch across different catalog and schema. They're all in different places - punit.punit, punit_02.punit, and punit_local.punit
Thinking...

Preparing Stitch configuration for 3 locations...
  • punit.punit
  • punit_02.punit
  • punit_local.punit
Scanning punit.punit.pos_customers...
Scanning punit_02.punit.loyalty_customers...
Scanning punit_local.punit.ecommerce_customers...

Stitch Configuration Preview:
• Scanned locations: 3
  - punit.punit
  - punit_02.punit
  - punit_local.punit
• Output: punit.stitch_outputs
• Job Name: stitch-multi-2025-11-26_19-43
• Config Path: /Volumes/punit/punit/chuck/stitch-multi-2025-11-26_19-43.json

Scan Results:
  ✓ punit.punit (1 tables, 17 PII columns)
  ✓ punit_02.punit (1 tables, 17 PII columns)
  ✓ punit_local.punit (1 tables, 17 PII columns)

• Tables to process: 3
• Total PII fields: 51

Tables:
  - punit.punit.pos_customers (17 fields)
    • master_id
    • cid
    • name_prefix (title)
    • firstName (given-name)
    • lastName (surname)
    • gender (gender)
    • emailAddress (email)
    • account_status
    • address (address)
    • city (city)
    • state (state)
    • postal_code (postal)
    • dateofbirth (birthdate)
    • employment
    • occupation
    • phone (phone)
    • dtUpdateDate (update-dt)
  - punit_02.punit.loyalty_customers (17 fields)
    • lm_id_uuid
    • master_id
    • lm_id
    • fname (given-name)
    • lname (surname)
    • emailaddress (email)
    • gender (gender)
    • addr1 (address)
    • city (city)
    • state (state)
    • zipcode (postal)
    • birthdate (birthdate)
    • created (create-dt)
    • points
    • current_tier
    • lmProgramName
    • dtUpdateDate (update-dt)
  - punit_local.punit.ecommerce_customers (17 fields)
    • master_id
    • customer_id
    • name_prefix (title)
    • name_first (given-name)
    • name_last (surname)
    • gender (gender)
    • email (email)
    • account_status
    • addr_ln_1_txt (address)
    • city (city)
    • state (state)
    • postal_code (postal)
    • birth_dt (birthdate)
    • employment
    • job_title
    • phone (phone)
    • dtUpdateDate (update-dt)

What would you like to do?
• Type 'launch' or 'yes' to launch the job
• Describe changes (e.g., 'remove table X', 'add email semantic to field Y')
• Type 'cancel' to abort the setup
chuck (interactive) > yes
When you launch Stitch it will create a job in Databricks and a notebook that will show you Stitch results when the job completes.
Stitch will create a schema called stitch_outputs with two new tables called unified_coalesced and unified_scores.
The unified_coalesced table will contain the standardized PII and amperity_ids.
The unified_scores table will contain the links and confidence scores.
Be sure to check out the results in the Stitch Report notebook!

Ready to launch Stitch job. Type 'confirm' to proceed or 'cancel' to abort.
chuck (interactive) > confirm

Launching Stitch job...

Stitch job launched successfully!

Technical Summary:
Stitch setup for punit.punit initiated.
Config: /Volumes/punit/punit/chuck/stitch-multi-2025-11-26_19-43.json
Chuck Job ID: chk-20251126-51224-7g9nVexyAsH
Databricks Job Run ID: 225210058521786

Created Stitch Report notebook:
Notebook Path: /Workspace/Users/v-punit.naik@amperity.com/Stitch Report: punit.punit
Stitch is now running in your Databricks workspace!

Running Stitch creates a job that will take at least a few minutes to complete.

What Stitch will create:
• Schema: punit.stitch_outputs
• Table: punit.stitch_outputs.unified_coalesced (standardized PII and amperity_ids)
• Table: punit.stitch_outputs.unified_scores (links and confidence scores)

A Stitch report showing the results has been created to help you see the results.
The report will not work until Stitch is complete.


What you can do now:
• you can ask me about the status of the Chuck job (job-id: chk-20251126-51224-7g9nVexyAsH)
• you can ask me about the status of the Databricks job run (run-id: 225210058521786)
• Open Databricks job in browser: https://dbc-6e75f43b-0f28.cloud.databricks.com/jobs/802978059061168/runs/225210058521786?o=dbc-6e75f43b-0f28
• Open Stitch Report notebook in browser: https://dbc-6e75f43b-0f28.cloud.databricks.com/?o=dbc-6e75f43b-0f28#workspace/Users/v-punit.naik%40amperity.com/Stitch%20Report%3A%20punit.punit
• Open Databricks workspace: https://dbc-6e75f43b-0f28.cloud.databricks.com

Databricks notebook screenshots after successful multi-catalog multi-schema stitch operation:
visualization-01
visualization-02
visualization-03
visualization-04
visualization-05
visualization-06

Copy link
Contributor

@pragyan-amp pragyan-amp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good..

This commit implements the ability to scan and configure Stitch across
multiple Databricks catalogs and schemas in a single operation, while
maintaining full backward compatibility with single-location usage.

Key changes:

Core functionality (stitch_tools.py):
- Add validate_multi_location_access() to pre-validate access permissions
- Add _helper_prepare_multi_location_stitch_config() for aggregating
  PII scans across multiple locations
- Refactor _helper_prepare_stitch_config() to support both single and
  multi-location modes with unified interface
- Gracefully handle partial failures when some locations are inaccessible

Command interface (setup_stitch.py):
- Add support for 'targets' parameter accepting list of catalog.schema pairs
- Add 'output_catalog' parameter to specify where outputs should be stored
- Enhance _display_config_preview() to show scan results for multiple locations
- Update command definition with new parameters and usage examples
- Fix pylint issues: remove unnecessary elif after return statements

Type safety (url_utils.py):
- Update normalize_workspace_url() to accept Optional[str]
- Update detect_cloud_provider() to accept Optional[str]
- Ensures proper handling of None/empty workspace URLs

Testing (test_stitch_tools.py):
- Add 10 new tests in TestMultiCatalogSupport class covering:
  * Access validation for all/partial/missing locations
  * Successful multi-location configuration
  * Partial failure handling with graceful degradation
  * No PII found across locations
  * Backward compatibility verification
- Fix 12 existing tests to add required schema validation setup
- All 27 tests passing

Usage examples:
  Single location (backward compatible):
    /setup-stitch --catalog_name prod --schema_name crm

  Multiple locations:
    /setup-stitch --targets prod.crm,prod.ecommerce,analytics.customers --output_catalog prod
@punit-naik-amp punit-naik-amp force-pushed the CHUCK-5-multi-catalog-and-multi-schema-scan-support branch from 39fd697 to 0efacac Compare November 27, 2025 14:31
@punit-naik-amp punit-naik-amp merged commit 90864f0 into main Nov 27, 2025
2 checks passed
@punit-naik-amp punit-naik-amp deleted the CHUCK-5-multi-catalog-and-multi-schema-scan-support branch November 27, 2025 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants