Skip to content
This repository was archived by the owner on Jul 28, 2025. It is now read-only.
Matt McLoughlin edited this page Jan 23, 2025 · 6 revisions

Microsoft Genomics Execution Script

The Microsoft Genomics team has created a method to execute the original binary files used in the Microsoft Genomics service. You should pre-create a user-assigned managed identity in a resource group, and give it Storage Blob Data Contributor access to your storage account. Then when you execute the script, it will create a new VM in a new resource group, and use that identity to access storage. When the work is done, the VM will automatically delete its own resource group.


High-Level Summary of the Script

1. Setup and Initialization

  • Defines variables (location, names, VM size, URLs, etc.).
  • Generates a secure random password for the virtual machine (VM).
  • Sets a cleanup function to delete the resource group if there's an error.

2. Azure Resource Management

  • Checks if the identity resource group exists; creates it if not.
  • Checks if the main resource group exists; creates it if not.
  • Checks if the managed identity exists; creates it if not.
  • Assigns the Contributor role to the managed identity for the main resource group.

3. VM Creation

  • Fetches the latest image for MicrosoftWindowsServer:WindowsServer:2022-datacenter-azure-edition.
  • Creates a virtual machine (VM) in Azure with the specified configuration, enabling features like secure boot and vTPM.
  • Assigns the managed identity to the VM.

4. Script Execution

  • Constructs a PowerShell command (msgen.ps1) to run on the VM.
  • Installs and configures the Azure VM Custom Script Extension to execute the PowerShell script, passing in relevant input parameters.

Inside the PowerShell Script (msgen.ps1)

Environment Setup
  • Creates necessary directories for processing, like inputs, outputs, logs, and references.
  • Downloads and installs the AzCopy utility to manage file transfers.
File Handling
  • Downloads genomics reference files and other necessary resources.
  • Authenticates with Azure using the managed identity and downloads input files from the provided URLs.
Genomics Processing
  • Downloads and extracts the msgen-oss toolset.
  • Runs the genomics tool to process the input files and generate output and logs.
Output Upload
  • Uploads the generated output files to the specified Azure Blob Storage prefix.
  • Compresses and uploads log files for troubleshooting.
Resource Cleanup (if parameters provided)
  • Installs Azure CLI on the VM.
  • Authenticates with the managed identity to delete the VM's resource group, effectively cleaning up resources.

5. Completion and Cleanup

  • Prints the completion status and highlights that the VM will delete itself after the process completes.

Key Functionality

  • Resource Management: Handles Azure resource groups, VMs, and identities dynamically.
  • Automation: Automates genomics workflows using Azure resources and tools.
  • Cleanup: Ensures resource cleanup after task completion to avoid lingering costs.

Instructions

Prerequisites

  1. Pre-create a user-assigned managed identity in a separate resource group for re-use between executions.
    Grant the identity the Storage Blob Data Contributor role on your storage account.
  2. Clone this repository.
  3. Execute the run-on-azure-vm.sh script, replacing the script parameters accordingly:
./run-on-azure-vm.sh westus msgen2025 Standard_D64d_v5 msgen2025-vm msgen2025-identity msgen2025-uamidentity https://<YOUR-STORAGE-ACCOUNT>.blob.core.windows.net/inputs/1.fq.gz https://<YOUR-STORAGE-ACCOUNT>.blob.core.windows.net/inputs/2.fq.gz https://<YOUR-STORAGE-ACCOUNT>.blob.core.windows.net/outputs

Script Arguments

Argument Description Default Value Purpose
AZURE_LOCATION Specifies the Azure region/location where resources will be created (e.g., westus, eastus). westus Determines the geographic region for the Azure resources, which can impact performance and cost.
STEM_NAME A base name used as a prefix for naming resources. msgen2025 Ensures consistent and identifiable naming of all resources created by the script.
RESOURCE_GROUP The name of the Azure resource group where the VM and associated resources will be created. ${STEM_NAME}-vm Groups Azure resources (VM, networking, storage, etc.) into a logical container for management.
IDENTITY_RESOURCE_GROUP The name of the Azure resource group where the managed identity is created. ${STEM_NAME}-identity Separates identity resources from other resources for organizational or security purposes.
IDENTITY_NAME The name of the Azure managed identity to be created or used. ${STEM_NAME}-identity Provides the managed identity for authenticating the VM and performing operations like file transfers.
VM_SIZE Specifies the size of the VM to be created (e.g., CPU cores, memory). Standard_D64d_v5 Customizes the VM's capacity based on workload requirements.
INPUT_URL1 URL of the first input file to be downloaded and processed. None (mandatory) Provides the location of the primary input dataset.
INPUT_URL2 URL of the second input file (optional). None (optional) Allows for a secondary input dataset to be downloaded and processed if available.
OUTPUT_URL_PREFIX The URL prefix for uploading the output files generated by the workflow. None (mandatory) Specifies the destination where processed results will be stored (e.g., Azure Blob Storage).