Skip to content

Conversation

@kaikaila
Copy link
Contributor

@kaikaila kaikaila commented Oct 19, 2025

Summary

This PR adds full PostgreSQL (pgx driver) support to Kubeflow Pipelines backend, enabling users to choose between MySQL and PostgreSQL as the metadata database. The implementation introduces a clean dialect abstraction layer and includes a major query optimization that benefits both database backends.

Key achievements
✅ Complete PostgreSQL integration for API Server and Cache Server, addressing #7512, #9813
✅ All CI tests passing (MySQL + PostgreSQL).
✅ Significant performance improvement for ListRuns queries. This PR is expected to address the root causes behind #10778, #10230, #9780, #9701
✅ Zero breaking changes - backward compatible with existing MySQL deployments

What Changed

  1. Storage Layer Refactoring - Dialect Abstraction ([backend/src/apiserver/common/sql/dialect]
  • Problem
    SQL syntax was tightly coupled to MySQL.

  • Solution
    Introduced a DBDialect interface that encapsulates database-specific behavior
    Identifier quoting (MySQL backticks vs PostgreSQL double quotes)
    Placeholder styles (? vs $1, $2, ...)
    Aggregation functions (GROUP_CONCAT vs string_agg)
    Concatenation syntax (CONCAT() vs ||)

  • Files

    • Core dialect implementation → backend/src/apiserver/common/sql/dialect/dialect.go
    • Dialect-aware utility functions → backend/src/apiserver/storage/sql_dialect_util.go
    • Reusable filter builders with proper quoting → backend/src/apiserver/storage/list_filters.go

All storage layer code now uses

q := s.dbDialect.QuoteIdentifier
qb := s.dbDialect.QueryBuilder()

This ensures queries work correctly across MySQL, PostgreSQL, and SQLite (for tests).

  1. ListRuns Query Performance Optimization
  • Problem
    The original ListRuns query called addMetricsResourceReferencesAndTasks which performed a 3-layer LEFT JOIN with GROUP BY on all columns, including LONGTEXT fields like PipelineSpecManifest WorkflowSpecManifest etc. This caused slow response times for large datasets.
  • Solution
    Layers 1-3: LEFT JOIN only on PrimaryKeyUUID + aggregated columns (refs, tasks, metrics)
    Final layer: INNER JOIN back to run_details to fetch LONGTEXT columns
  • Performance impact
    Eliminates GROUP BY on LONGTEXT columns entirely. Expected substantial performance improvements for deployments with large pipeline specifications, though formal load testing has not yet been conducted.
  1. Deployment Configurations
  • Production-ready PostgreSQL kustomization → manifests/kustomize/env/platform-agnostic-postgresql/
  • Local development setup → manifests/kustomize/env/dev-kind-postgresql/
  • PostgreSQL StatefulSet → manifests/kustomize/third-party/postgresql/

Configuration is symmetric to existing MySQL manifests for consistency.

  1. CI Manifest Overlays

Created CI-specific Kustomize overlays to ensure tests use locally built images from the Kind registry instead of pulling official images from ghcr.io:

  • Add PostgreSQL CI overlay .github/resources/manifests/standalone/postgresql/
  • Added kfp-cache-server image override to .github/resources/manifests/standalone/base/kustomization.yaml
  1. Added 2 PostgreSQL-specific CI workflows
  • V2 API and integration tests (cache enabled/disabled matrix) → api-server-test-Postgres.yml
  • V1 integration tests → integration-tests-v1-postgres.yml

PostgreSQL tests cover the core cache enabled/disabled matrix.

  1. Local development support
  • make dev-kind-cluster-pg - Provision Kind cluster with PostgreSQL
  • Updated README for PostgreSQL setup and debugging, achieving parity with MySQL documentation.

Testing

Unit Tests

23 test files modified/added
New test coverage: dialect_test.go, list_filters_test.go, sql_dialect_util_test.go
All existing tests updated to use dialect abstraction

Integration Tests

✅ V1 API integration tests (PostgreSQL)
✅ V2 API integration tests (PostgreSQL, cache enabled/disabled)
✅ Existing MySQL tests remain green

Migration Guide

  • For new deployments:
    kubectl apply -k manifests/kustomize/env/platform-agnostic-postgresql
  • For existing MySQL deployments:
    No action required. This PR is fully backward compatible.
  • For local development, to set up the kind cluster with Postgres
    make -C backend dev-kind-cluster-pg

This PR continues from #12063.

@google-oss-prow
Copy link

Hi @kaikaila. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@github-actions
Copy link

🚫 This command cannot be processed. Only organization members or owners can use the commands.

@kaikaila kaikaila force-pushed the feature/postgres-integration branch 7 times, most recently from cd1d08b to 85498ed Compare October 22, 2025 05:03
@kaikaila
Copy link
Contributor Author

Currently, both MySQL and PGX setups use the DB superuser for all KFP operations, which is why client_manager.go contains a “create database if not exist” step here.

From a security standpoint, would it be preferable to:

  1. Move DB creation out of the client manager and into the deployment/init phase (i.e. add a manifests/kustomize/third-party/postgresql/base/pg-init-configmap.yaml) and
  2. Introduce a dedicated restricted user for KFP components, limited to the mlpipeline database?

If the team agrees, I can propose a follow-up PR to refactor accordingly.

@HumairAK
Copy link
Collaborator

I'm fine with this, I don't think it's great that KFP tries to create a database (or a bucket frankly)

fyi @mprahl / @droctothorpe

@kaikaila
Copy link
Contributor Author

Thanks, @HumairAK — totally agree on the security point.
Since this PR is already getting quite heavy, would you be okay if I leave the user permission changes for a separate follow-up PR?

@kaikaila kaikaila force-pushed the feature/postgres-integration branch 3 times, most recently from 09fd370 to 1e0caa8 Compare October 23, 2025 07:10
@HumairAK
Copy link
Collaborator

yes that is fine

@kaikaila kaikaila force-pushed the feature/postgres-integration branch 6 times, most recently from 4d33821 to e6c943c Compare October 24, 2025 02:47
@kaikaila
Copy link
Contributor Author

Question about the PostgreSQL test workflow organization

Current situation

The V2 integration tests for PostgreSQL logically belong in a "PostgreSQL counterpart" to legacy-v2-api-integration-tests.yml
However, I didn't want to create a new workflow with "legacy" in the name from day one.
As a temporary solution, I merged them into api-server-test-Postgres.yml
This causes asymmetry with api-server-tests.yml and the workflow has mixed responsibilities.

Question: What's the recommended workflow organization for PostgreSQL tests?

Should I:

  • a. Create legacy-v2-api-integration-tests-postgres.yml for consistency (even though it's new)?
  • b. Keep current structure and accept the asymmetry?
  • c. Refactor both MySQL and PostgreSQL to a unified structure?

Would love guidance on the long-term vision for test workflow organization, especially from @nsingla

@kaikaila kaikaila force-pushed the feature/postgres-integration branch 5 times, most recently from 606e287 to 7d89e02 Compare December 28, 2025 04:34
@kaikaila kaikaila changed the title [wip]feat(backend): postgres integration feat(backend): postgres integration Dec 28, 2025
@juliusvonkohout
Copy link
Member

The listruns query optimization is really useful, there are installations with over 10^6 runs.

@kaikaila kaikaila force-pushed the feature/postgres-integration branch from 7d89e02 to 800d18a Compare December 29, 2025 22:15
@kaikaila kaikaila requested review from HumairAK and nsingla December 29, 2025 22:17
@kaikaila kaikaila force-pushed the feature/postgres-integration branch from 800d18a to 26b5f59 Compare January 3, 2026 08:59
newName: kind-registry:5000/cache-server
newTag: latest

patches:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you still need replacements here, just like here

Also, i believe we do support postgresql in multi user mode as well (which is default for kubeflow)

TEST_MANIFESTS="${TEST_MANIFESTS}/cache-disabled"
elif $USE_PROXY; then
TEST_MANIFESTS="${TEST_MANIFESTS}/proxy"
elif [ "${STORAGE_BACKEND}" == "minio" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct me if I am wrong, but is it not possible to have Postgres as DB with minio as storage backend? if its possible, can we add manifests for that and an option to deploy that here?

Copy link
Contributor

@nsingla nsingla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you also need to add db_type to e2e_tests workflow as well

Signed-off-by: kaikaila <lyk2772@126.com>
Signed-off-by: kaikaila <lyk2772@126.com>
…iveExperiment

- Extract repeated subquery SQL into resourceReferenceSubquery variable
- Unify code style: consistently use SetMap() throughout
- Add detailed comments explaining PostgreSQL $N placeholder handling
- Simplify error messages

optimization according to sanchesoon's suggestion

Signed-off-by: kaikaila <lyk2772@126.com>
Consolidate all changes made in response to Humair's Oct 31 review,
improving code quality, test coverage, and CI/CD workflows.

- Use fmt.Sprintf for SQL string construction in dialect.go and storage
- Reuse escapeSQLString utility across storage package
- Move QualifyIdentifier logic to storage package with unit tests
- Remove quoteIdentifier from list.Options
- Move table prefix dot appending from model methods to list options
- Rename QuoteFunction for better clarity
- Replace == with errors.Is() for error comparison
- Parameterize timeout as a constant

- Replace t.Errorf with require/assert, t.Fatalf with require.FailNow
- Hardcode test expectations instead of generating them
- Add unit tests for placeholder numbering and qualifyIdentifier
- Use consistent test file naming for database-agnostic sorting
- Remove UUID generation from experiment names in integration tests
- Differentiate report_name to avoid conflicts
- Handle binary files in pipeline spec replacement
- Use first yaml file for smoke tests
- Add pipeline name filtering tests
- Update compile golden files

- Add PostgreSQL CI job with Argo Workflows 3.6.7
- Rename and consolidate PostgreSQL workflow files
- Add database parameter to api-server-test (standalone mode only)
- Update api-server-test workflow to include PostgreSQL
- Use shared test-and-report action for v2 API tests
- Remove unused PostgreSQL environment variables
- Skip upgrade tests for PostgreSQL (to be addressed later)
- Fix branch path in workflow configuration

- Make dev-kind-cluster OS-aware (macOS/Linux support)
- Add DB parameter to dev-kind-cluster make target
- Extract default-allow-same-namespace NetworkPolicy into dedicated files
- Revert dev-kind-cluster bridge to 172.17.0.1 for Linux
- Clean up redundant PostgreSQL config in ConfigMap
- Change database default from empty string to "mysql"
- Change 127.0.0.1 to localhost for better compatibility
- Rename arguments_parameters.zip to arguments-parameters.zip

- Update AGENTS.md and README.md for PostgreSQL support
- Add storage/README.md with detailed documentation
- Add platform-agnostic-postgresql documentation

- Add error handling and retry logic when listing runs during cleanup
- Use ReferenceKey ID for namespace filter when querying pipelines

Signed-off-by: kaikaila <lyk2772@126.com>
@kaikaila kaikaila force-pushed the feature/postgres-integration branch from 2dfe321 to 650e522 Compare January 8, 2026 20:52
…d sorting

This commit addresses code review feedback from nsingla (Jan 5) and resolves
the metric-based pagination issue reported in PR#11889.

1. **E2E test matrix**: Add `db_type` matrix parameter to E2E workflows,
   supporting MySQL and PostgreSQL (pgx) configurations. Update job naming,
   environment setup, and test reports to reflect database type.

2. **Kustomization manifests**: Add standalone PostgreSQL + MinIO deployment
   configuration with base resources, KFP component image replacements,
   patches, and DNS configuration.

Resolves sorting/pagination issues when ordering experiments by metrics.

1. **SQL column visibility**: Include metric sort columns in final SELECT
   when sorting by metrics, as MySQL/PostgreSQL require WHERE-referenced
   columns to be in the SELECT list (unlike SQLite).

2. **Type handling for numeric comparisons**: Convert string metric values
   to float64 and embed them directly in SQL to avoid pgx driver type
   interpretation issues where float64 parameters may be treated as text.

3. **Dynamic result scanning**: Adjust scan destinations to handle the
   additional metric column (31 vs 30 base columns) when present.

4. **Placeholder format**: Use Question format for subquery construction
   to ensure correct parameter numbering, then convert to Dollar format
   for PostgreSQL execution.

Also includes comprehensive integration tests for metric-based sorting
with pagination in both ascending and descending order.

Signed-off-by: kaikaila <lyk2772@126.com>
@kaikaila kaikaila force-pushed the feature/postgres-integration branch from a3a2e38 to 4a5d68b Compare January 9, 2026 08:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants