Workshop 5: Orchestrating Complex ML Pipelines in Production

Cloud Service Docs (Core Services)

As you grow into the role of an ML Engineer, becoming comfortable with seeking out and reading official documentation is an essential professional skill — not just for AWS, but for whichever cloud platform you work on.

Official cloud provider documentation is actively maintained, updated to reflect new features and service changes, and reviewed by the teams who build those services. Relying on outdated or third-party sources can lead to misunderstandings, because unofficial materials often lag behind current service behaviour.

Whether you primarily work with AWS or GCP, the core patterns you practise here — deploying managed services, observing them with monitoring tools, and tuning orchestration — transfer directly. The specific service names and console layouts differ, but the thinking is the same.

By developing the habit of navigating and interpreting authoritative documentation, you strengthen your ability to troubleshoot effectively, make informed design decisions, and stay aligned with industry best practices.

Going further: If you want to understand how the AWS services used in this workshop map onto their GCP counterparts, the AWS, Azure, and GCP service comparison is a useful reference. Search for each service (Lambda, Step Functions, CloudWatch, CloudFormation) and find its equivalent — you may find there is more than one, which itself tells you something about how the platforms differ in their design philosophy. Reading the equivalent service's documentation is a good way to deepen your understanding of both platforms.

Scale or Fail

A hands-on workshop where you deploy, stress-test, and troubleshoot a serverless ML pipeline on AWS. You will hit a real scaling wall, diagnose it with evidence, fix it, and prove the fix worked.

Duration: 5 working hours (plus one hour for lunch)
Platform: AWS (Lambda, Step Functions, CloudWatch, CloudFormation)
Sandbox: Pluralsight AWS Cloud Sandbox

Scaling Knob Used In This Workshop

In simple terms, a "scaling knob" is a single setting you can turn up or down to make part of the system do more (or less) work at the same time.

For this workshop we use the Step Functions Map state's max_concurrency as the main scaling knob. It's an orchestration-level control that determines how many items inside a Map state run in parallel.

Step Functions is a service that helps different parts of your workflow run in the right order, like a flowchart that coordinates each step.

Why this knob?

It's easy to change and observe during hands-on exercises.
It works well inside sandbox environments where account-level limits (like Lambda reserved concurrency) may be restricted.
It demonstrates the common pattern: increase parallelism at the orchestrator, measure the result, then fix any new bottlenecks that appear.

What max_concurrency does:

If max_concurrency is 5, at most five Map iterations run at the same time. New items wait until one finishes.
Increasing it lets more tasks run at once (faster throughput), but can reveal downstream bottlenecks (databases, external APIs) or increase cost.

Useful links:

Where This Fits in the ML Lifecycle

The earlier AI6 workshops focused on the upstream phases of CRISP-ML(Q): business and data understanding, data engineering, model training, and model evaluation. This workshop picks up at the end of that journey.

Once a model passes evaluation, it enters the Deployment phase — integration into a live software system — and then the Monitoring and Maintenance phase, where you continuously observe its behaviour in production, detect degradation, and respond to incidents. This workshop gives you hands-on practice with both: you deploy the pipeline, stress-test it, observe it with metrics and logs, and apply structured Root Cause Analysis when things go wrong.

This maps directly to Duty 6 of the Machine Learning Engineer apprenticeship standard: "Deliver responsive technical engineering support services; to mitigate operational impact whilst ensuring business continuity."

The "model step" in this workshop is intentionally simulated so you can focus on the operational patterns rather than model training. If you want to see what those same patterns look like with a real managed inference endpoint, Activity 9 (Going Further) makes that connection concrete.

Learning Objectives

By the end of this workshop you will be able to:

Deploy an ML inference pipeline using Infrastructure as Code (CloudFormation)
Identify the bottleneck step in a multi-stage pipeline under burst load
Scale the bottleneck using orchestration parallelism and measure the improvement
Use CloudWatch metrics and Logs Insights to observe system behaviour under load
Classify production incidents using structured RCA (Root Cause Analysis)
Apply the Fishbone diagnostic method to separate evidence from hypothesis
Compile an evidence portfolio demonstrating scaling, monitoring, and decision-making skills

The Workshop Spine

Scaling is the job.

Orchestration is the mechanism.

Root Cause Analysis is the safety net.

Emoji Guide

Emoji	Meaning
🎯	Learning Objective — what you will achieve
📋	Expected Outputs — end result to aim for
📝	Task/Step — something to do
⌨️	Terminal — shell command to run
💻	Console — AWS Console action to take
✅	Checkpoint — verify your progress
🤔	Reflect — think deeply about this
💡	Tip/Hint — helpful suggestion
⚠️	Warning — do not miss this
📘	Explanation — background theory
🚀	Extension — optional stretch challenge
🎓	Complete — activity finished

Workshop Structure

Read the User Brief first to understand the scenario. You may also wish to look ahead to Activity 8 before starting Task 1 because it requires you to gather screenshots from previous activities. There's no harm in repeating previous activities (and, in fact, some benefit), but you may wish to proceed with your eyes open!

Scaling is the Job

Activity	Title	Focus
Activity 1	Environment Setup & Orientation	Deploy the stack, navigate the console
Activity 2	The Happy Path	Run a single ticket, identify pipeline steps
Activity 3	Hit the Wall	Burst load at low concurrency, find the bottleneck
Activity 4	Scale Up & Compare	Increase parallelism, measure the improvement

RCA is the Safety Net

Activity	Title	Focus
Activity 5	Understand Orchestration	Read the logs, query with Logs Insights
Activity 6	Controlled Failure: Bad Input	Trigger a data error, classify with RCA Tree
Activity 7	Controlled Failure: Throttling	Fishbone analysis, apply fix, verify
Activity 8	Evidence Portfolio & Reflection	Compile evidence, reflect, clean up

Optional — Going Further

Activity	Title	Focus
Activity 9	Replace Embed With SageMaker Endpoint (Isolated)	Managed inference bottlenecks + throttling surface area

Prerequisites

You will have met these prerequisites by engaging with previous workshops.

AWS Cloud Sandbox access (Pluralsight)
Familiarity with the AWS Console (basic navigation)
Comfort with running shell commands in a terminal

See the Setup Guide for environment preparation.

From Commands to Scripts

In the previous workshop you deployed Azure infrastructure by copying individual az CLI commands into the terminal one at a time. That works, but this workshop takes the next step: the AWS CLI commands are bundled into shell scripts in the scripts/ folder.

Instead of pasting a sequence of commands manually, you run a single script and it handles the sequence for you — setting variables, running the AWS CLI calls in the right order, and printing output so you can see what happened.

This pattern is a core technique in engineering teams, though not the only one — in the previous workshop you used Bicep for this on Azure, and in this workshop CloudFormation plays the same role. Tools like these handle the infrastructure declaration, while scripts handle the invocation. Deployment steps, pipeline invocations, and teardown procedures live in scripts because they are:

Repeatable — the same script run by any engineer produces the same result
Auditable — the script is the documentation as well as the automation
Extensible — a script is the seed of a CI/CD pipeline or runbook

You will not need to write scripts in this workshop, but you are encouraged to open them and read them. The commands inside are real AWS CLI calls, and understanding what they do (not just that they work) is part of the job.

Looking up a command in the AWS CLI reference

The AWS CLI reference is structured by service. To look up any command, navigate to the service name and then the subcommand.

For example, the first script, scripts/01_deploy.sh, runs:

aws cloudformation deploy \
  --stack-name "$STACK_NAME" \
  --template-file "$TEMPLATE_FILE" \
  --capabilities CAPABILITY_NAMED_IAM \
  --parameter-overrides ...

In the reference, this lives under cloudformation → deploy. There you can see that the `--capabilities CAPABILITY_NAMED_IAM` flag is an explicit acknowledgement that the template creates IAM resources with custom names — AWS requires you to opt in to this rather than letting it happen silently. Knowing that turns a flag you might have ignored into a safety design decision you can reason about.

Key Resources

Glossary — key terminology
Architecture Diagrams — pipeline and orchestration visuals
Fishbone Printable — for team RCA exercises
KSB Mapping — how activities map to the standard (optional)

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
activities		activities
data		data
diagrams		diagrams
docs		docs
handouts		handouts
infra		infra
observability		observability
scripts		scripts
README.md		README.md
user_brief.md		user_brief.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Workshop 5: Orchestrating Complex ML Pipelines in Production

Cloud Service Docs (Core Services)

Scale or Fail

Scaling Knob Used In This Workshop

Where This Fits in the ML Lifecycle

Learning Objectives

The Workshop Spine

Emoji Guide

Workshop Structure

Scaling is the Job

RCA is the Safety Net

Optional — Going Further

Prerequisites

From Commands to Scripts

Key Resources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Workshop 5: Orchestrating Complex ML Pipelines in Production

Cloud Service Docs (Core Services)

Scale or Fail

Scaling Knob Used In This Workshop

Where This Fits in the ML Lifecycle

Learning Objectives

The Workshop Spine

Emoji Guide

Workshop Structure

Scaling is the Job

RCA is the Safety Net

Optional — Going Further

Prerequisites

From Commands to Scripts

Key Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages