-
Notifications
You must be signed in to change notification settings - Fork 3
Update table display #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
fix asset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR updates the table display functionality, adding support for new fields ("updated_at" and "row_count") and improving timestamp and row count formatting. Key changes include:
- New tests in tests/test_tui_display.py covering field mapping, timestamp formatting, missing timestamps, and row count formatting.
- Updates to the TUI display logic in src/ui/tui.py to correctly format "created_at", "updated_at", and "row_count".
- Minor adjustments in related command, client, and project configuration files.
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_tui_display.py | Added tests to verify field mappings, timestamp formats, and row count formatting. |
| tests/test_agent_tool_display_routing.py | Removed unused import and adjusted formatting in a test case. |
| src/ui/tui.py | Updated timestamp field names and added row count formatting logic. |
| src/commands/list_tables.py | Updated field mapping to include "updated_at" and new row count extraction. |
| src/clients/databricks.py | Added logging of the API result (potential debugging artifact). |
| src/chuck_data/chuck_data/main.py | Added noqa comments to maintain import order. |
| src/chuck_data/main.py | Removed unused sys import. |
| pyproject.toml | Updated exclusion patterns for notebook files. |
11 tasks
punit-naik-amp
added a commit
that referenced
this pull request
Dec 19, 2025
# Setup Wizard Provider Selection & Redshift Integration Enhancements
## Overview
This PR introduces comprehensive improvements to the setup wizard,
enabling flexible provider combinations (Databricks + Redshift, Redshift
+ Databricks compute, etc.) and fixing critical bugs in multi-provider
configurations. The changes support the full Redshift + Databricks
integration flow end-to-end.
**Branch**: `CHUCK-10-pr7-setup-wizard-provider-selection`
**Base**: `CHUCK-10-redshift`
**Changes**: 53 files changed, 6,594 insertions(+), 997 deletions(-)
**Commits**: 29
---
## 🎯 Key Features
### 1. **Enhanced Setup Wizard with Provider Selection**
The wizard now supports explicit provider selection for data, compute,
and LLM:
```
Setup Flow:
┌─────────────────────────────────────────────────────────────┐
│ 1. Amperity Auth │
├─────────────────────────────────────────────────────────────┤
│ 2. Data Provider Selection │
│ ├─ Databricks (Unity Catalog) │
│ └─ AWS Redshift │
├─────────────────────────────────────────────────────────────┤
│ 3a. IF Databricks → Workspace URL + Token │
│ 3b. IF Redshift → AWS Profile + Region + Account ID + │
│ Cluster/Workgroup + S3 + IAM Role │
├─────────────────────────────────────────────────────────────┤
│ 4. Compute Provider Selection │
│ └─ Databricks (required for both data providers) │
├─────────────────────────────────────────────────────────────┤
│ 5. LLM Provider Selection │
│ ├─ Databricks │
│ └─ AWS Bedrock │
├─────────────────────────────────────────────────────────────┤
│ 6. Model Selection (based on LLM provider) │
├─────────────────────────────────────────────────────────────┤
│ 7. Usage Tracking Consent │
└─────────────────────────────────────────────────────────────┘
```
**New wizard steps added:**
- `DataProviderSelectionStep` - Choose between Databricks or Redshift
- `ComputeProviderSelectionStep` - Choose compute backend
- `AWSProfileInputStep` - Configure AWS profile for Redshift
- `AWSRegionInputStep` - Configure AWS region
- `AWSAccountIdInputStep` - Configure AWS account ID (required for
Redshift manifests)
- `RedshiftClusterSelectionStep` - Select Redshift cluster or serverless
workgroup
- `S3BucketInputStep` - Configure S3 for Spark-Redshift connector
- `IAMRoleInputStep` - Configure IAM role for Redshift access
### 2. **Provider Abstraction & Dependency Injection**
Introduced clean separation between data providers, compute providers,
and storage providers:
```python
# Before: Tightly coupled, hard-coded Databricks assumptions
client = DatabricksAPIClient(...)
compute = DatabricksComputeProvider(...)
# After: Flexible provider composition with dependency injection
data_provider = ProviderFactory.create_data_provider("redshift", config)
storage_provider = ProviderFactory.create_storage_provider("s3", config)
compute_provider = ProviderFactory.create_compute_provider(
"databricks",
config,
storage_provider=storage_provider # Injected dependency
)
```
**Key abstractions:**
- `IStorageProvider` protocol - Abstract storage (S3, DBFS, Volumes)
- Storage provider injection into compute providers
- Provider detection utilities for automatic routing
- Runtime-checkable protocols for proper type safety
### 3. **Redshift-Specific Commands & Configuration**
Added dedicated Redshift commands and configuration management:
**New Commands:**
- `/list-redshift-schemas` - List Redshift schemas with database context
- `/select-redshift-schema` - Select active Redshift database and schema
- `/redshift-status` - Show current Redshift configuration
**Configuration Management:**
- Automatic cleanup of incompatible config on provider switch
- Proper persistence of AWS account ID, region, cluster info
- Redshift-specific config fields (`redshift_workgroup_name`,
`redshift_iam_role`, etc.)
### 4. **Enhanced Stitch Integration for Redshift**
Complete end-to-end Stitch support for Redshift data sources:
```
Redshift Stitch Flow:
┌──────────────────────────────────────────────────────────────┐
│ 1. Scan Redshift tables for PII (chuck-data) │
│ - Uses LLM to detect semantic tags │
│ - Stores tags in chuck_metadata.semantic_tags │
└────────────────┬─────────────────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────────┐
│ 2. Generate manifest JSON with semantic tags │
│ - Includes redshift_config with all connection details │
│ - Includes aws_account_id for JDBC URL construction │
│ - Uploads to S3 for Databricks job access │
└────────────────┬─────────────────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────────┐
│ 3. Submit Stitch job to Databricks │
│ - Fetches init script from Amperity API │
│ - Uploads init script to S3 │
│ - Submits job with manifest and init script paths │
└────────────────┬─────────────────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────────┐
│ 4. Stitch job processes Redshift data │
│ - Reads manifest from S3 │
│ - Connects to Redshift using Spark-Redshift connector │
│ - Attaches semantic tags to DataFrame metadata │
│ - Runs identity resolution │
│ - Writes results back to Redshift │
└──────────────────────────────────────────────────────────────┘
```
**Manifest generation improvements:**
- Proper `redshift_config` with all required fields
- `aws_account_id` included for JDBC URL construction
- Explicit `data_provider` and `compute_provider` fields
- Support for both provisioned clusters and serverless workgroups
### 5. **Storage Provider Abstraction**
New storage abstraction for managing artifacts across backends:
```python
class IStorageProvider(Protocol):
"""Protocol for storage backends (S3, DBFS, Volumes)"""
def upload_file(self, local_path: str, remote_path: str) -> bool
def download_file(self, remote_path: str, local_path: str) -> bool
def exists(self, remote_path: str) -> bool
def delete(self, remote_path: str) -> bool
```
**Implementations:**
- `S3StorageProvider` - AWS S3 backend (for Redshift)
- `DBFSStorageProvider` - Databricks DBFS (legacy)
- `VolumesStorageProvider` - Unity Catalog Volumes (preferred for
Databricks)
### 6. **Provider-Aware Command Routing**
Commands now automatically detect the active provider and route
appropriately:
```python
# Before: Commands assumed Databricks
def handle_command(client, **kwargs):
catalogs = client.list_catalogs() # Always Databricks
# After: Commands detect provider and route correctly
def handle_command(client, **kwargs):
if is_redshift_client(client):
databases = client.list_databases() # Redshift
else:
catalogs = client.list_catalogs() # Databricks
```
**Provider detection:**
- `is_redshift_client()` - Check if client is RedshiftAPIClient
- `is_databricks_client()` - Check if client is DatabricksAPIClient
- Automatic routing in agent tool executor
---
## 🐛 Critical Bug Fixes
### Bug #1: LLM Provider Selection with Redshift
**Problem:**
When using Redshift as data provider and Databricks as LLM provider, the
wizard crashed with:
```
AttributeError: 'RedshiftAPIClient' object has no attribute 'list_models'
```
**Root cause:**
The wizard was passing `service.client` (RedshiftAPIClient) to
`DatabricksProvider`, which expected a DatabricksAPIClient or None.
**Fix:**
Added type checking to only use service.client if it's a
DatabricksAPIClient:
```python
# Added in LLMProviderSelectionStep and ModelSelectionStep
service = get_chuck_service()
databricks_client = None
if service and service.client and isinstance(service.client, DatabricksAPIClient):
databricks_client = service.client # Only use if correct type
databricks_provider = DatabricksProvider(
workspace_url=state.workspace_url,
token=state.token,
client=databricks_client # None if data provider is Redshift
)
```
### Bug #2: ConfigManager Not Saving Dynamic Fields
**Problem:**
`aws_account_id` and other dynamic fields were silently dropped when
saving config, even though `ChuckConfig` has `extra="allow"`.
**Root cause:**
`ConfigManager.update()` had a `hasattr()` check that prevented
non-schema fields from being set:
```python
# Old buggy code
for key, value in kwargs.items():
if hasattr(config, key): # Prevented dynamic fields!
setattr(config, key, value)
```
**Fix:**
Removed the `hasattr()` check to allow all fields:
```python
# Fixed code
for key, value in kwargs.items():
setattr(config, key, value) # Now accepts all fields
```
**Impact:**
- `aws_account_id` now properly saved to config
- Included in generated Redshift manifests
- All other dynamic config fields also work correctly
---
## 📊 Test Coverage
**New test files:**
- `tests/unit/commands/wizard/test_state.py` - 401 lines, comprehensive
wizard state tests
- `tests/unit/commands/wizard/test_steps.py` - 589 lines, wizard step
validation
- `tests/unit/test_workspace_and_init_scripts.py` - 446 lines, workspace
APIs and protocols
**Updated test coverage:**
- Setup wizard tests updated for new flow
- Stitch integration tests updated for Redshift support
- Compute provider tests updated for dependency injection
- Service tests updated for provider-aware routing
**Test isolation improvements:**
- Tests now use temporary config files to avoid modifying user's
`~/.chuck_config.json`
- Proper cleanup of test artifacts
- Mock providers for unit testing
---
## 🔄 Migration Impact
### Breaking Changes
None - all changes are backward compatible.
### New Required Fields for Redshift Manifests
Generated manifests now include:
```json
{
"settings": {
"data_provider": "redshift",
"compute_provider": "databricks",
"redshift_config": {
"database": "dev",
"schema": "public",
"workgroup_name": "my-workgroup",
"region": "us-west-2",
"aws_account_id": "123456789012" // NEW - required
},
"s3_temp_dir": "s3://bucket/temp/",
"redshift_iam_role": "arn:aws:iam::123456789012:role/Role"
}
}
```
### Config File Changes
New config fields (all optional, added only when Redshift is selected):
- `aws_account_id` - AWS account ID for Redshift
- `aws_region` - AWS region
- `aws_profile` - AWS profile name
- `redshift_workgroup_name` - Serverless workgroup name
- `redshift_cluster_identifier` - Provisioned cluster identifier
- `redshift_iam_role` - IAM role ARN
- `redshift_s3_temp_dir` - S3 temp directory for Spark-Redshift
- `s3_bucket` - S3 bucket for artifacts
---
## 📁 Key File Changes
### Core Setup & Configuration (10 files)
- `chuck_data/commands/setup_wizard.py` - Orchestrator for new wizard
flow
- `chuck_data/commands/wizard/steps.py` - All wizard step
implementations (+666 lines)
- `chuck_data/commands/wizard/state.py` - Wizard state management (+139
lines)
- `chuck_data/config.py` - Config manager with dynamic field support
(+101 lines)
- `chuck_data/service.py` - Provider-aware service initialization (+116
lines)
### Provider Abstraction (8 files)
- `chuck_data/provider_factory.py` - Factory for creating providers (+43
lines)
- `chuck_data/compute_providers/databricks.py` - Databricks with storage
injection
- `chuck_data/compute_providers/emr.py` - EMR with storage support
- `chuck_data/data_providers/utils.py` - Provider detection utilities
(NEW, 172 lines)
- `chuck_data/storage/manifest.py` - Manifest generation for Redshift
(NEW, 378 lines)
### Redshift Integration (5 files)
- `chuck_data/clients/redshift.py` - Enhanced Redshift client (+114
lines)
- `chuck_data/commands/list_redshift_schemas.py` - NEW command (118
lines)
- `chuck_data/commands/redshift_schema_selection.py` - NEW command (183
lines)
- `chuck_data/commands/redshift_status.py` - NEW command (98 lines)
- `chuck_data/commands/setup_stitch.py` - Full Redshift support (+1317
lines)
### Client Enhancements (3 files)
- `chuck_data/clients/databricks.py` - Workspace and init script APIs
(+147 lines)
- `chuck_data/clients/amperity.py` - Moved init script fetch here (+55
lines)
- `chuck_data/ui/tui.py` - Provider-aware UI updates (+163 lines)
---
## ✅ Testing Checklist
- [x] Setup wizard completes successfully for Databricks data provider
- [x] Setup wizard completes successfully for Redshift data provider
- [x] Setup wizard properly saves `aws_account_id` to config
- [x] Mixed provider setup works (Redshift data + Databricks compute +
Databricks LLM)
- [x] Mixed provider setup works (Redshift data + Databricks compute +
Bedrock LLM)
- [x] Generated manifests include all required fields for Redshift
- [x] Stitch setup works end-to-end with Redshift
- [x] Provider detection correctly routes commands
- [x] Config cleanup happens when switching providers
- [x] All unit tests pass
- [x] Test isolation prevents modifying user config
---
## 🎬 Demo Flow
### Complete Redshift + Databricks Setup
```bash
# 1. Run setup wizard
chuck> /setup
# Wizard flow:
# ✓ Amperity Auth (browser-based OAuth)
# ✓ Select Data Provider: AWS Redshift
# ✓ Enter AWS Profile: default
# ✓ Enter AWS Region: us-west-2
# ✓ Enter AWS Account ID: 123456789012
# ✓ Enter Redshift Workgroup: my-workgroup
# ✓ Enter S3 Bucket: my-bucket
# ✓ Enter IAM Role: arn:aws:iam::123456789012:role/RedshiftRole
# ✓ Select Compute Provider: Databricks
# ✓ Enter Workspace URL: https://my-workspace.databricks.com
# ✓ Enter Databricks Token: dapi***
# ✓ Select LLM Provider: Databricks
# ✓ Select Model: databricks-meta-llama-3-1-70b-instruct
# ✓ Usage Consent: yes
# 2. Check configuration
chuck> /redshift-status
✓ Data Provider: AWS Redshift
✓ Region: us-west-2
✓ Workgroup: my-workgroup
✓ Account ID: 123456789012
# 3. Select database and schema
chuck> /select-redshift-schema
# Lists databases, then schemas
# 4. Run Stitch setup
chuck> /setup-stitch
# Generates manifest, uploads to S3, submits Databricks job
✓ Manifest: s3://my-bucket/chuck/manifests/redshift_dev_public_20241218.json
✓ Job submitted: run-id 12345
```
---
## 📝 Commit History Summary
**Provider Abstraction & Architecture** (9 commits)
- Add storage provider abstraction
- Integrate storage providers into compute providers
- Make commands provider-aware
- Add provider detection utilities
- Make protocols runtime-checkable
**Redshift Integration** (8 commits)
- Add Redshift-specific commands
- Update setup_stitch for Redshift support
- Add manifest generation with semantic tags
- Flow AWS credentials through wizard
- Add AWS account ID to config and manifests
**Setup Wizard Enhancements** (7 commits)
- Add data provider selection step
- Add AWS configuration steps (profile, region, account ID)
- Add compute provider selection
- Update wizard orchestration
- Add comprehensive wizard tests
**Bug Fixes & Quality** (5 commits)
- Fix ConfigManager dynamic field saving
- Fix LLM provider selection with Redshift
- Fix test isolation issues
- Update all affected tests
- Add explicit provider fields to manifests
---
## 🚀 Next Steps
After merge, the following work can proceed:
1. End-to-end integration testing with real Redshift cluster
2. Performance validation on large Redshift tables
3. User documentation and guides
4. EMR compute provider support (architecture is ready)
---
## 📚 Related Documentation
- [Redshift Integration
Guide](../app/service/stitch/stitch-standalone/doc/redshift-integration.md)
- [Backend
Abstraction](../app/service/stitch/stitch-standalone/src/amperity/stitch_standalone/backend_abstraction.clj)
- [Generic Main Entry
Point](../app/service/stitch/stitch-standalone/src/amperity/stitch_standalone/generic_main.clj)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
updates the display of tables to include updated at and row counts when available.