Skip to content

Conversation

@punit-naik-amp
Copy link
Contributor

This PR establishes the base provider architecture for accessing data from different platforms and running Stitch jobs on different compute backends.

Changes:

  • Add DataProvider protocol defining the interface for data sources
  • Add DatabricksProviderAdapter stub (implementation in PR 2)
  • Add RedshiftProviderAdapter stub with required AWS credentials, IAM role, and EMR cluster ID (implementation in PR 2)
  • Add DataProviderFactory for creating data providers
  • Add ComputeProvider protocol defining the interface for compute backends
  • Add DatabricksComputeProvider stub (implementation in PR 3)
  • Add EMRComputeProvider stub (implementation in PR 4)
  • Add ProviderFactory with unified interface for both provider types
  • Add comprehensive unit tests (52 tests, all passing)

Key design decisions:

  • Data providers handle storage operations (no separate abstraction)
  • EMR uses boto3 credential discovery (aws_profile, IAM roles, env vars)
  • RedshiftProviderAdapter requires AWS credentials and accepts redshift_iam_role for COPY/UNLOAD operations
  • ComputeProvider.prepare_stitch_job() receives data_provider parameter
  • Pure additive changes (no modifications to existing code)

Jira: CHUCK-10

These is just the scaffolding/additive changes. No code is modified. Doing it in stages so that reviewing becomes easy. Will fold in the actual implementation of databricks and redshift in later PRs.

This PR establishes the base provider architecture for accessing data from
different platforms and running Stitch jobs on different compute backends.

Changes:
- Add DataProvider protocol defining the interface for data sources
- Add DatabricksProviderAdapter stub (implementation in PR 2)
- Add RedshiftProviderAdapter stub with required AWS credentials, IAM role, and EMR cluster ID (implementation in PR 2)
- Add DataProviderFactory for creating data providers
- Add ComputeProvider protocol defining the interface for compute backends
- Add DatabricksComputeProvider stub (implementation in PR 3)
- Add EMRComputeProvider stub (implementation in PR 4)
- Add ProviderFactory with unified interface for both provider types
- Add comprehensive unit tests (52 tests, all passing)

Key design decisions:
- Data providers handle storage operations (no separate abstraction)
- EMR uses boto3 credential discovery (aws_profile, IAM roles, env vars)
- RedshiftProviderAdapter requires AWS credentials and accepts redshift_iam_role for COPY/UNLOAD operations
- ComputeProvider.prepare_stitch_job() receives data_provider parameter
- Pure additive changes (no modifications to existing code)

Jira: CHUCK-10
@punit-naik-amp punit-naik-amp force-pushed the CHUCK-10-provider-infrastructure-foundation branch from b537298 to ec5a0a6 Compare December 12, 2025 16:28
@punit-naik-amp punit-naik-amp changed the base branch from main to CHUCK-10-redshift December 13, 2025 06:57
@punit-naik-amp
Copy link
Contributor Author

@pragyan-amp Changed the base branch from main to CHUCK-10-redshift so that the main branch can stay clean while we merge a bunch of reviewed and tested PRs in stages to the CHUCK-10-redshift branch (as this feature involves a lot of code changes which can't be reviewed easily in one single and huge PR). In the end I will create one final PR from CHUCK-10-redshift to main.

Copy link
Contributor

@pragyan-amp pragyan-amp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Good.. :shipit:

@punit-naik-amp punit-naik-amp merged commit a7ffd33 into CHUCK-10-redshift Dec 15, 2025
2 checks passed
@punit-naik-amp punit-naik-amp deleted the CHUCK-10-provider-infrastructure-foundation branch December 15, 2025 03:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants