-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Labels
Python CDKAWS CDK PythonAWS CDK PythonawsAWS Specific WorkAWS Specific Workaws_iamAWS Identity/AIM/SSO/Roles/etcAWS Identity/AIM/SSO/Roles/etcci_cdTesting and Deployment TasksTesting and Deployment Tasks
Milestone
Description
Dynamic Step Functions Workflow Generation for ML Pipelines
Summary
Add optional Step Functions execution backend to the ML pipeline system. The DAG would be dynamically generated from WORKBENCH_BATCH configs in scripts - no static workflow definitions to maintain.
Background
Currently, ML pipelines are executed via SQS FIFO + Batch dependsOn. This works well for dev/sandbox but lacks:
- Visual execution graph and history
- Approval gates for production deployments
- Built-in retry/error handling UI
The dependency graph logic already exists in ml_pipelines_dt/handler.py via build_dependency_graph(). This ticket adds an alternative execution backend that uses the same DAG.
Proposed Changes
1. Lambda handler changes (ml_pipelines_dt/handler.py)
- Add
EXECUTION_MODEenvironment variable (sqs|step_functions) - New function
generate_step_functions_definition(scripts_with_config, root_map)that builds state machine JSON from the DAG - New function
submit_via_step_functions()that creates/updates and executes the state machine
2. Infrastructure (CDK)
- IAM role for Step Functions to invoke Batch
- IAM permissions for Lambda to create/execute state machines
- Optional: S3 bucket for state machine definition storage
3. State machine structure
Parallel
├── Chain 1: script_a → script_b → script_c
├── Chain 2: script_d → script_e
└── Chain 3: script_f (standalone)
Each step invokes Batch job and waits for completion.
Non-goals (for now)
- Approval gates (future enhancement)
- Conditional execution logic
- Cross-account workflows
Trade-offs
| Current (SQS) | Step Functions |
|---|---|
| Simpler, fewer AWS resources | More infrastructure to manage |
| ~Free | ~$25/million state transitions |
| Logs scattered in CloudWatch | Unified execution history |
| No visual debugging | Nice graph UI |
Testing
- Unit tests for
generate_step_functions_definition() - Integration test: run same pipeline set via both backends, verify identical results
- Test error handling: verify failed step stops downstream
Estimate
Half-day to full-day including CDK changes and testing.
Future enhancements
- Approval gates before production model promotion
- Slack/email notifications on completion/failure
- Cost tracking per workflow execution
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Python CDKAWS CDK PythonAWS CDK PythonawsAWS Specific WorkAWS Specific Workaws_iamAWS Identity/AIM/SSO/Roles/etcAWS Identity/AIM/SSO/Roles/etcci_cdTesting and Deployment TasksTesting and Deployment Tasks
Type
Projects
Status
Backlog