Skip to content

Support Dynamic Workflow Generation and Execution #558

@brifordwylie

Description

@brifordwylie

Dynamic Step Functions Workflow Generation for ML Pipelines

Summary

Add optional Step Functions execution backend to the ML pipeline system. The DAG would be dynamically generated from WORKBENCH_BATCH configs in scripts - no static workflow definitions to maintain.

Background

Currently, ML pipelines are executed via SQS FIFO + Batch dependsOn. This works well for dev/sandbox but lacks:

  • Visual execution graph and history
  • Approval gates for production deployments
  • Built-in retry/error handling UI

The dependency graph logic already exists in ml_pipelines_dt/handler.py via build_dependency_graph(). This ticket adds an alternative execution backend that uses the same DAG.

Proposed Changes

1. Lambda handler changes (ml_pipelines_dt/handler.py)

  • Add EXECUTION_MODE environment variable (sqs | step_functions)
  • New function generate_step_functions_definition(scripts_with_config, root_map) that builds state machine JSON from the DAG
  • New function submit_via_step_functions() that creates/updates and executes the state machine

2. Infrastructure (CDK)

  • IAM role for Step Functions to invoke Batch
  • IAM permissions for Lambda to create/execute state machines
  • Optional: S3 bucket for state machine definition storage

3. State machine structure

Parallel
├── Chain 1: script_a → script_b → script_c
├── Chain 2: script_d → script_e
└── Chain 3: script_f (standalone)

Each step invokes Batch job and waits for completion.

Non-goals (for now)

  • Approval gates (future enhancement)
  • Conditional execution logic
  • Cross-account workflows

Trade-offs

Current (SQS) Step Functions
Simpler, fewer AWS resources More infrastructure to manage
~Free ~$25/million state transitions
Logs scattered in CloudWatch Unified execution history
No visual debugging Nice graph UI

Testing

  • Unit tests for generate_step_functions_definition()
  • Integration test: run same pipeline set via both backends, verify identical results
  • Test error handling: verify failed step stops downstream

Estimate

Half-day to full-day including CDK changes and testing.

Future enhancements

  • Approval gates before production model promotion
  • Slack/email notifications on completion/failure
  • Cost tracking per workflow execution

Metadata

Metadata

Assignees

Labels

Python CDKAWS CDK PythonawsAWS Specific Workaws_iamAWS Identity/AIM/SSO/Roles/etcci_cdTesting and Deployment Tasks

Projects

Status

Backlog

Relationships

None yet

Development

No branches or pull requests

Issue actions