Update table display #2

john-b-rush · 2025-06-05T20:30:12Z

updates the display of tables to include updated at and row counts when available.

fix asset

Copilot

Pull Request Overview

This PR updates the table display functionality, adding support for new fields ("updated_at" and "row_count") and improving timestamp and row count formatting. Key changes include:

New tests in tests/test_tui_display.py covering field mapping, timestamp formatting, missing timestamps, and row count formatting.
Updates to the TUI display logic in src/ui/tui.py to correctly format "created_at", "updated_at", and "row_count".
Minor adjustments in related command, client, and project configuration files.

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/test_tui_display.py	Added tests to verify field mappings, timestamp formats, and row count formatting.
tests/test_agent_tool_display_routing.py	Removed unused import and adjusted formatting in a test case.
src/ui/tui.py	Updated timestamp field names and added row count formatting logic.
src/commands/list_tables.py	Updated field mapping to include "updated_at" and new row count extraction.
src/clients/databricks.py	Added logging of the API result (potential debugging artifact).
src/chuck_data/chuck_data/main.py	Added noqa comments to maintain import order.
src/chuck_data/main.py	Removed unused sys import.
pyproject.toml	Updated exclusion patterns for notebook files.

src/clients/databricks.py

src/ui/tui.py

# Setup Wizard Provider Selection & Redshift Integration Enhancements ## Overview This PR introduces comprehensive improvements to the setup wizard, enabling flexible provider combinations (Databricks + Redshift, Redshift + Databricks compute, etc.) and fixing critical bugs in multi-provider configurations. The changes support the full Redshift + Databricks integration flow end-to-end. **Branch**: `CHUCK-10-pr7-setup-wizard-provider-selection` **Base**: `CHUCK-10-redshift` **Changes**: 53 files changed, 6,594 insertions(+), 997 deletions(-) **Commits**: 29 --- ## 🎯 Key Features ### 1. **Enhanced Setup Wizard with Provider Selection** The wizard now supports explicit provider selection for data, compute, and LLM: ``` Setup Flow: ┌─────────────────────────────────────────────────────────────┐ │ 1. Amperity Auth │ ├─────────────────────────────────────────────────────────────┤ │ 2. Data Provider Selection │ │ ├─ Databricks (Unity Catalog) │ │ └─ AWS Redshift │ ├─────────────────────────────────────────────────────────────┤ │ 3a. IF Databricks → Workspace URL + Token │ │ 3b. IF Redshift → AWS Profile + Region + Account ID + │ │ Cluster/Workgroup + S3 + IAM Role │ ├─────────────────────────────────────────────────────────────┤ │ 4. Compute Provider Selection │ │ └─ Databricks (required for both data providers) │ ├─────────────────────────────────────────────────────────────┤ │ 5. LLM Provider Selection │ │ ├─ Databricks │ │ └─ AWS Bedrock │ ├─────────────────────────────────────────────────────────────┤ │ 6. Model Selection (based on LLM provider) │ ├─────────────────────────────────────────────────────────────┤ │ 7. Usage Tracking Consent │ └─────────────────────────────────────────────────────────────┘ ``` **New wizard steps added:** - `DataProviderSelectionStep` - Choose between Databricks or Redshift - `ComputeProviderSelectionStep` - Choose compute backend - `AWSProfileInputStep` - Configure AWS profile for Redshift - `AWSRegionInputStep` - Configure AWS region - `AWSAccountIdInputStep` - Configure AWS account ID (required for Redshift manifests) - `RedshiftClusterSelectionStep` - Select Redshift cluster or serverless workgroup - `S3BucketInputStep` - Configure S3 for Spark-Redshift connector - `IAMRoleInputStep` - Configure IAM role for Redshift access ### 2. **Provider Abstraction & Dependency Injection** Introduced clean separation between data providers, compute providers, and storage providers: ```python # Before: Tightly coupled, hard-coded Databricks assumptions client = DatabricksAPIClient(...) compute = DatabricksComputeProvider(...) # After: Flexible provider composition with dependency injection data_provider = ProviderFactory.create_data_provider("redshift", config) storage_provider = ProviderFactory.create_storage_provider("s3", config) compute_provider = ProviderFactory.create_compute_provider( "databricks", config, storage_provider=storage_provider # Injected dependency ) ``` **Key abstractions:** - `IStorageProvider` protocol - Abstract storage (S3, DBFS, Volumes) - Storage provider injection into compute providers - Provider detection utilities for automatic routing - Runtime-checkable protocols for proper type safety ### 3. **Redshift-Specific Commands & Configuration** Added dedicated Redshift commands and configuration management: **New Commands:** - `/list-redshift-schemas` - List Redshift schemas with database context - `/select-redshift-schema` - Select active Redshift database and schema - `/redshift-status` - Show current Redshift configuration **Configuration Management:** - Automatic cleanup of incompatible config on provider switch - Proper persistence of AWS account ID, region, cluster info - Redshift-specific config fields (`redshift_workgroup_name`, `redshift_iam_role`, etc.) ### 4. **Enhanced Stitch Integration for Redshift** Complete end-to-end Stitch support for Redshift data sources: ``` Redshift Stitch Flow: ┌──────────────────────────────────────────────────────────────┐ │ 1. Scan Redshift tables for PII (chuck-data) │ │ - Uses LLM to detect semantic tags │ │ - Stores tags in chuck_metadata.semantic_tags │ └────────────────┬─────────────────────────────────────────────┘ ▼ ┌──────────────────────────────────────────────────────────────┐ │ 2. Generate manifest JSON with semantic tags │ │ - Includes redshift_config with all connection details │ │ - Includes aws_account_id for JDBC URL construction │ │ - Uploads to S3 for Databricks job access │ └────────────────┬─────────────────────────────────────────────┘ ▼ ┌──────────────────────────────────────────────────────────────┐ │ 3. Submit Stitch job to Databricks │ │ - Fetches init script from Amperity API │ │ - Uploads init script to S3 │ │ - Submits job with manifest and init script paths │ └────────────────┬─────────────────────────────────────────────┘ ▼ ┌──────────────────────────────────────────────────────────────┐ │ 4. Stitch job processes Redshift data │ │ - Reads manifest from S3 │ │ - Connects to Redshift using Spark-Redshift connector │ │ - Attaches semantic tags to DataFrame metadata │ │ - Runs identity resolution │ │ - Writes results back to Redshift │ └──────────────────────────────────────────────────────────────┘ ``` **Manifest generation improvements:** - Proper `redshift_config` with all required fields - `aws_account_id` included for JDBC URL construction - Explicit `data_provider` and `compute_provider` fields - Support for both provisioned clusters and serverless workgroups ### 5. **Storage Provider Abstraction** New storage abstraction for managing artifacts across backends: ```python class IStorageProvider(Protocol): """Protocol for storage backends (S3, DBFS, Volumes)""" def upload_file(self, local_path: str, remote_path: str) -> bool def download_file(self, remote_path: str, local_path: str) -> bool def exists(self, remote_path: str) -> bool def delete(self, remote_path: str) -> bool ``` **Implementations:** - `S3StorageProvider` - AWS S3 backend (for Redshift) - `DBFSStorageProvider` - Databricks DBFS (legacy) - `VolumesStorageProvider` - Unity Catalog Volumes (preferred for Databricks) ### 6. **Provider-Aware Command Routing** Commands now automatically detect the active provider and route appropriately: ```python # Before: Commands assumed Databricks def handle_command(client, **kwargs): catalogs = client.list_catalogs() # Always Databricks # After: Commands detect provider and route correctly def handle_command(client, **kwargs): if is_redshift_client(client): databases = client.list_databases() # Redshift else: catalogs = client.list_catalogs() # Databricks ``` **Provider detection:** - `is_redshift_client()` - Check if client is RedshiftAPIClient - `is_databricks_client()` - Check if client is DatabricksAPIClient - Automatic routing in agent tool executor --- ## 🐛 Critical Bug Fixes ### Bug #1: LLM Provider Selection with Redshift **Problem:** When using Redshift as data provider and Databricks as LLM provider, the wizard crashed with: ``` AttributeError: 'RedshiftAPIClient' object has no attribute 'list_models' ``` **Root cause:** The wizard was passing `service.client` (RedshiftAPIClient) to `DatabricksProvider`, which expected a DatabricksAPIClient or None. **Fix:** Added type checking to only use service.client if it's a DatabricksAPIClient: ```python # Added in LLMProviderSelectionStep and ModelSelectionStep service = get_chuck_service() databricks_client = None if service and service.client and isinstance(service.client, DatabricksAPIClient): databricks_client = service.client # Only use if correct type databricks_provider = DatabricksProvider( workspace_url=state.workspace_url, token=state.token, client=databricks_client # None if data provider is Redshift ) ``` ### Bug #2: ConfigManager Not Saving Dynamic Fields **Problem:** `aws_account_id` and other dynamic fields were silently dropped when saving config, even though `ChuckConfig` has `extra="allow"`. **Root cause:** `ConfigManager.update()` had a `hasattr()` check that prevented non-schema fields from being set: ```python # Old buggy code for key, value in kwargs.items(): if hasattr(config, key): # Prevented dynamic fields! setattr(config, key, value) ``` **Fix:** Removed the `hasattr()` check to allow all fields: ```python # Fixed code for key, value in kwargs.items(): setattr(config, key, value) # Now accepts all fields ``` **Impact:** - `aws_account_id` now properly saved to config - Included in generated Redshift manifests - All other dynamic config fields also work correctly --- ## 📊 Test Coverage **New test files:** - `tests/unit/commands/wizard/test_state.py` - 401 lines, comprehensive wizard state tests - `tests/unit/commands/wizard/test_steps.py` - 589 lines, wizard step validation - `tests/unit/test_workspace_and_init_scripts.py` - 446 lines, workspace APIs and protocols **Updated test coverage:** - Setup wizard tests updated for new flow - Stitch integration tests updated for Redshift support - Compute provider tests updated for dependency injection - Service tests updated for provider-aware routing **Test isolation improvements:** - Tests now use temporary config files to avoid modifying user's `~/.chuck_config.json` - Proper cleanup of test artifacts - Mock providers for unit testing --- ## 🔄 Migration Impact ### Breaking Changes None - all changes are backward compatible. ### New Required Fields for Redshift Manifests Generated manifests now include: ```json { "settings": { "data_provider": "redshift", "compute_provider": "databricks", "redshift_config": { "database": "dev", "schema": "public", "workgroup_name": "my-workgroup", "region": "us-west-2", "aws_account_id": "123456789012" // NEW - required }, "s3_temp_dir": "s3://bucket/temp/", "redshift_iam_role": "arn:aws:iam::123456789012:role/Role" } } ``` ### Config File Changes New config fields (all optional, added only when Redshift is selected): - `aws_account_id` - AWS account ID for Redshift - `aws_region` - AWS region - `aws_profile` - AWS profile name - `redshift_workgroup_name` - Serverless workgroup name - `redshift_cluster_identifier` - Provisioned cluster identifier - `redshift_iam_role` - IAM role ARN - `redshift_s3_temp_dir` - S3 temp directory for Spark-Redshift - `s3_bucket` - S3 bucket for artifacts --- ## 📁 Key File Changes ### Core Setup & Configuration (10 files) - `chuck_data/commands/setup_wizard.py` - Orchestrator for new wizard flow - `chuck_data/commands/wizard/steps.py` - All wizard step implementations (+666 lines) - `chuck_data/commands/wizard/state.py` - Wizard state management (+139 lines) - `chuck_data/config.py` - Config manager with dynamic field support (+101 lines) - `chuck_data/service.py` - Provider-aware service initialization (+116 lines) ### Provider Abstraction (8 files) - `chuck_data/provider_factory.py` - Factory for creating providers (+43 lines) - `chuck_data/compute_providers/databricks.py` - Databricks with storage injection - `chuck_data/compute_providers/emr.py` - EMR with storage support - `chuck_data/data_providers/utils.py` - Provider detection utilities (NEW, 172 lines) - `chuck_data/storage/manifest.py` - Manifest generation for Redshift (NEW, 378 lines) ### Redshift Integration (5 files) - `chuck_data/clients/redshift.py` - Enhanced Redshift client (+114 lines) - `chuck_data/commands/list_redshift_schemas.py` - NEW command (118 lines) - `chuck_data/commands/redshift_schema_selection.py` - NEW command (183 lines) - `chuck_data/commands/redshift_status.py` - NEW command (98 lines) - `chuck_data/commands/setup_stitch.py` - Full Redshift support (+1317 lines) ### Client Enhancements (3 files) - `chuck_data/clients/databricks.py` - Workspace and init script APIs (+147 lines) - `chuck_data/clients/amperity.py` - Moved init script fetch here (+55 lines) - `chuck_data/ui/tui.py` - Provider-aware UI updates (+163 lines) --- ## ✅ Testing Checklist - [x] Setup wizard completes successfully for Databricks data provider - [x] Setup wizard completes successfully for Redshift data provider - [x] Setup wizard properly saves `aws_account_id` to config - [x] Mixed provider setup works (Redshift data + Databricks compute + Databricks LLM) - [x] Mixed provider setup works (Redshift data + Databricks compute + Bedrock LLM) - [x] Generated manifests include all required fields for Redshift - [x] Stitch setup works end-to-end with Redshift - [x] Provider detection correctly routes commands - [x] Config cleanup happens when switching providers - [x] All unit tests pass - [x] Test isolation prevents modifying user config --- ## 🎬 Demo Flow ### Complete Redshift + Databricks Setup ```bash # 1. Run setup wizard chuck> /setup # Wizard flow: # ✓ Amperity Auth (browser-based OAuth) # ✓ Select Data Provider: AWS Redshift # ✓ Enter AWS Profile: default # ✓ Enter AWS Region: us-west-2 # ✓ Enter AWS Account ID: 123456789012 # ✓ Enter Redshift Workgroup: my-workgroup # ✓ Enter S3 Bucket: my-bucket # ✓ Enter IAM Role: arn:aws:iam::123456789012:role/RedshiftRole # ✓ Select Compute Provider: Databricks # ✓ Enter Workspace URL: https://my-workspace.databricks.com # ✓ Enter Databricks Token: dapi*** # ✓ Select LLM Provider: Databricks # ✓ Select Model: databricks-meta-llama-3-1-70b-instruct # ✓ Usage Consent: yes # 2. Check configuration chuck> /redshift-status ✓ Data Provider: AWS Redshift ✓ Region: us-west-2 ✓ Workgroup: my-workgroup ✓ Account ID: 123456789012 # 3. Select database and schema chuck> /select-redshift-schema # Lists databases, then schemas # 4. Run Stitch setup chuck> /setup-stitch # Generates manifest, uploads to S3, submits Databricks job ✓ Manifest: s3://my-bucket/chuck/manifests/redshift_dev_public_20241218.json ✓ Job submitted: run-id 12345 ``` --- ## 📝 Commit History Summary **Provider Abstraction & Architecture** (9 commits) - Add storage provider abstraction - Integrate storage providers into compute providers - Make commands provider-aware - Add provider detection utilities - Make protocols runtime-checkable **Redshift Integration** (8 commits) - Add Redshift-specific commands - Update setup_stitch for Redshift support - Add manifest generation with semantic tags - Flow AWS credentials through wizard - Add AWS account ID to config and manifests **Setup Wizard Enhancements** (7 commits) - Add data provider selection step - Add AWS configuration steps (profile, region, account ID) - Add compute provider selection - Update wizard orchestration - Add comprehensive wizard tests **Bug Fixes & Quality** (5 commits) - Fix ConfigManager dynamic field saving - Fix LLM provider selection with Redshift - Fix test isolation issues - Update all affected tests - Add explicit provider fields to manifests --- ## 🚀 Next Steps After merge, the following work can proceed: 1. End-to-end integration testing with real Redshift cluster 2. Performance validation on large Redshift tables 3. User documentation and guides 4. EMR compute provider support (architecture is ready) --- ## 📚 Related Documentation - [Redshift Integration Guide](../app/service/stitch/stitch-standalone/doc/redshift-integration.md) - [Backend Abstraction](../app/service/stitch/stitch-standalone/src/amperity/stitch_standalone/backend_abstraction.clj) - [Generic Main Entry Point](../app/service/stitch/stitch-standalone/src/amperity/stitch_standalone/generic_main.clj)

update table display

ea2c57b

fix asset

john-b-rush requested a review from Copilot June 5, 2025 20:30

Copilot AI reviewed Jun 5, 2025

View reviewed changes

src/clients/databricks.py Outdated Show resolved Hide resolved

src/ui/tui.py Show resolved Hide resolved

john-b-rush added 2 commits June 5, 2025 13:38

remove logging

143ea44

black

972995e

john-b-rush merged commit 957a4f1 into main Jun 5, 2025
2 checks passed

john-b-rush deleted the john-list-tables branch June 5, 2025 21:13

punit-naik-amp mentioned this pull request Dec 18, 2025

Setup Wizard Provider Selection & Redshift Integration Support #56

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update table display #2

Update table display #2

Uh oh!

john-b-rush commented Jun 5, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Update table display #2

Update table display #2

Uh oh!

Conversation

john-b-rush commented Jun 5, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants