| sidebar_position | 2 | ||||
|---|---|---|---|---|---|
| title | Quickstart: Clone to First Training Job | ||||
| description | Deploy infrastructure and submit your first robotics training job in 9 steps | ||||
| author | Microsoft Robotics-AI Team | ||||
| ms.date | 2026-04-15 | ||||
| ms.topic | tutorial | ||||
| keywords |
|
Deploy the full Azure NVIDIA Robotics stack and submit a training job in ~1.5-2 hours. This guide uses full-public networking and Access Keys authentication for the simplest path.
Note
This guide expands on the Getting Started hub.
| Requirement | Details |
|---|---|
| Azure subscription | Contributor + User Access Administrator roles |
| GPU quota | Standard_NV36ads_A10_v5 (A10 Spot, default) or Standard_NC40ads_H100_v5 (H100) in target region |
| NVIDIA NGC account | Sign up at https://ngc.nvidia.com/ for API key |
| Development environment | Devcontainer (recommended) or local tools |
See Prerequisites for installation commands and version requirements.
Clone the repository and initialize the development environment.
git clone https://github.com/microsoft/physical-ai-toolchain.git
cd physical-ai-toolchainUse the devcontainer (recommended) or run local setup:
./setup-dev.shAuthenticate with Azure and register required resource providers.
source infrastructure/terraform/prerequisites/az-sub-init.sh
bash infrastructure/terraform/prerequisites/register-azure-providers.shVerify your subscription:
az account show --query "{name:name, id:id}" -o tableCreate a Terraform variables file for the full-public deployment path. From the repository root:
cd infrastructure/terraform
cp terraform.tfvars.example terraform.tfvarsEdit terraform.tfvars with your values:
environment = "dev"
location = "westus3"
resource_prefix = "yourprefix"
instance = "001"
// Full-public networking (simplest path for inner developer loop)
should_enable_private_endpoint = false
should_enable_private_aks_cluster = false
should_enable_public_network_access = true
// Single GPU pool (Spot A10)
node_pools = {
gpu = {
vm_size = "Standard_NV36ads_A10_v5"
subnet_address_prefixes = ["10.0.7.0/24"]
node_taints = ["nvidia.com/gpu:NoSchedule", "kubernetes.azure.com/scalesetpriority=spot:NoSchedule"]
gpu_driver = "Install"
priority = "Spot"
should_enable_auto_scaling = true
min_count = 1
max_count = 1
zones = []
eviction_policy = "Delete"
}
}
// System node pool — enable autoscaling for OSMO workloads
should_enable_system_node_pool_auto_scaling = true
system_node_pool_min_count = 1
system_node_pool_max_count = 3
// OSMO Backend Services
should_deploy_postgresql = true
should_deploy_redis = trueWarning
resource_prefix must be lowercase, alphanumeric, and short (6-8 characters recommended). It feeds into Key Vault (kv{prefix}{env}{instance}) and Storage Account names that have 24-character limits and must be globally unique.
Tip
For private networking, set should_enable_private_endpoint = true and should_enable_private_aks_cluster = true, then deploy the VPN from infrastructure/terraform/vpn/ before running any kubectl commands. See the Infrastructure Guide for details.
Initialize and apply the Terraform configuration. This step takes ~30-40 minutes.
terraform init
terraform plan -out=tfplan
terraform apply tfplanVerify deployment:
terraform outputConnect to the AKS cluster:
az aks get-credentials \
--resource-group "$(terraform output -json resource_group | jq -r '.name')" \
--name "$(terraform output -json aks_cluster | jq -r '.name')"Export your NVIDIA NGC API key for OSMO backend deployment. Obtain a key from https://ngc.nvidia.com/.
export NGC_API_KEY="<your-ngc-api-key>"Deploy GPU Operator, KAI Scheduler, and the AzureML extension. From the repository root:
cd infrastructure/setup
bash 01-deploy-robotics-charts.sh --config-preview
bash 01-deploy-robotics-charts.sh
bash 02-deploy-azureml-extension.sh --config-preview
bash 02-deploy-azureml-extension.shTip
All setup scripts support --config-preview to print configuration and exit without changes. Run it before each real deployment to verify values.
Verify GPU operator pods:
kubectl get pods -n gpu-operatorDeploy the OSMO control plane and backend using Access Keys authentication.
bash 03-deploy-osmo-control-plane.sh --config-preview
bash 03-deploy-osmo-control-plane.shImportant
When running from a devcontainer or codespace, the OSMO service URL is an internal load balancer IP only reachable from within the AKS VNet. Set up a port-forward before running the backend script:
kubectl port-forward svc/osmo-service -n osmo-control-plane 8080:80 &Then pass --service-url http://localhost:8080 to both scripts 03 and 04. The scripts detect unreachable URLs and print this guidance automatically.
bash 04-deploy-osmo-backend.sh --use-access-keys --service-url http://localhost:8080 --config-preview
bash 04-deploy-osmo-backend.sh --use-access-keys --service-url http://localhost:8080Verify OSMO pods:
kubectl get pods -n osmo-control-plane
kubectl get pods -n osmo-operatorSubmit a training job from the repository root:
bash training/rl/scripts/submit-osmo-training.shScripts auto-detect configuration from Terraform outputs. Override values with CLI arguments or environment variables as needed. See Scripts Reference for all submission options.
Confirm the training job is running:
kubectl get pods -n osmo-workflows --watchCheck OSMO training status through the OSMO web UI or query pod logs:
kubectl logs -n osmo-workflows -l app=osmo-training --tail=50Remove OSMO Helm releases before destroying infrastructure to avoid orphaned resources:
cd infrastructure/setup
helm uninstall osmo-operator -n osmo-operator --ignore-not-found
helm uninstall service router ui -n osmo-control-plane --ignore-not-foundDestroy all infrastructure when finished to stop incurring costs. From the repository root:
cd infrastructure/terraform
terraform destroySee Cost Considerations for detailed pricing.
| Resource | Description |
|---|---|
| MLflow Integration | Track experiments with MLflow |
| Infrastructure Guide | Full deployment reference and options |
| Contributing Guide | Development workflow and code standards |