Skip to content

Migrate CI orchestration from bash scripts to a more robust framework #5527

@renecannao

Description

@renecannao

Problem

The CI test infrastructure orchestration (test/infra/control/*.bash, test/tap/groups/*/pre-proxysql.bash, setup-infras.bash hooks) is built entirely in bash. This has proven unreliable and difficult to debug in practice:

Error suppression patterns

Scripts routinely suppress errors, making failures invisible:

# Actual code that was in run-tests-isolated.bash:
mysql -uradmin -pradmin -hproxysql -P6032 -e 'SELECT ...' 2>/dev/null || echo 'ERROR: Failed to query mysql_servers'

This hides the actual error (connection refused? auth failure? timeout?), prints a generic message, and continues execution as if nothing happened. Tests then fail later with confusing symptoms (e.g., "table doesn't exist" because the entire ProxySQL config was silently broken).

Fragile error handling

  • set -e is a bandaid — it doesn't propagate errors through pipes, subshells, or command substitutions without additional set -o pipefail
  • Functions can't return structured errors — only exit codes (0-255)
  • No distinction between "expected failure" and "bug" — both are just non-zero exit codes
  • Traps (trap ... EXIT) are the only cleanup mechanism, and they don't compose well

No structured logging

  • echo to stdout mixes with command output
  • No log levels, timestamps are inconsistent
  • Debugging requires adding set -x and wading through hundreds of lines of shell expansion

Hardcoded paths and assumptions

  • Scripts assume specific directory structures relative to ${BASH_SOURCE[0]}
  • REPO_ROOT derivation via ../../.. counting is error-prone (we had bugs with 3 vs 4 levels)
  • Environment variable passing between scripts is fragile — export in one script doesn't reach a docker exec or a subprocess reliably

Race conditions and timing

  • sleep 10 as synchronization primitive
  • Health checks in while loops with arbitrary timeouts
  • No proper readiness probes — just "can I connect?" checks
  • Parallel execution (run-multi-group.bash) has no proper coordination

Real bugs found during CI migration (March 2026)

  1. Silent config dump failures: ProxySQL admin queries failed silently due to 2>/dev/null, tests continued with broken config, failed with misleading "table doesn't exist" errors
  2. Hardcoded /home/rene/proxysql paths: 12 scripts had absolute paths that only worked on one machine
  3. Wrong directory traversal: ../../.. instead of ../../../.. — worked locally by accident, failed on CI runners
  4. WITHGCOV defaulting to 1: Caused PROXYSQL GCOV DUMP syntax errors on non-coverage builds, hidden by error suppression
  5. Missing feature flag propagation: Unit test Makefile relied on environment variables that weren't set, causing link failures
  6. Hostgroup monitor conflicts: Setup hooks silently failed to LOAD changes to runtime, monitor overwrote configuration

Proposal

Migrate the orchestration layer to Python, keeping bash only for thin wrappers where shell is genuinely the right tool (e.g., Docker commands).

What to migrate

Current (bash) Purpose Complexity
ensure-infras.bash Start ProxySQL + backends for a TAP group High — Docker orchestration, health checks, hook execution
run-tests-isolated.bash Launch test runner container, env setup, coverage collection High — Docker run, env propagation, coverage traps
run-multi-group.bash Parallel group execution with timeouts High — Process management, log aggregation, coverage combining
start-proxysql-isolated.bash Start ProxySQL in Docker Medium — Docker run, health check
stop-proxysql-isolated.bash Cleanup ProxySQL Low
destroy-infras.bash Cleanup backends Low
docker-compose-init.bash Start Docker Compose infra Medium — Volume prep, health checks, user provisioning
docker-*-post.bash Post-startup provisioning Medium — SQL provisioning, replication setup
env-isolated.bash Environment variable setup Low — but critical for correctness
*/pre-proxysql.bash Group-specific pre-test hooks Medium — cluster setup, config manipulation
*/setup-infras.bash Group-specific post-infra hooks Medium — hostgroup remapping, fixture setup

Benefits of Python

  • Proper exceptions: errors propagate automatically, no silent failures
  • Structured logging: logging module with levels, timestamps, formatters
  • Testable: orchestration logic can have unit tests
  • Type hints: catch configuration errors before runtime
  • subprocess management: subprocess.run(check=True) fails loudly by default
  • Docker SDK: docker Python package instead of parsing CLI output
  • Parallel execution: concurrent.futures instead of bash background jobs

Migration approach

  1. Start with a Python CIOrchestrator class that wraps the current bash scripts
  2. Migrate one script at a time, keeping bash as fallback
  3. proxysql-tester.py already exists as a Python test runner — extend the pattern
  4. Keep bash only for: Docker Compose files, simple one-liner wrappers

What NOT to change

  • Docker Compose YAML files — these are fine as-is
  • proxysql-tester.py — already Python, works well
  • Makefile build system — separate concern
  • GitHub Actions YAML — separate concern

Priority

Medium-term. The immediate bash fixes (removing 2>/dev/null, fixing paths) are done. This issue tracks the longer-term migration to make the infrastructure actually reliable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions