This repo contains all the artifacts and infrastructure code for the PINS Operational Data Warehouse (ODW). It consists of the following:
- Infrastructure - Contains the root Terraform module for deploying the ODW environment
- Pipelines - Contains Azure DevOps Pipeline definitions and steps
- Workspace - Contains development data artifacts ingested into the development Azure Synapse Workspace
- odw - Contains ETL code and utility functions that are installed on Synapse spark pools
- Azure Data Landscape
- Azure Synapse
- Terraform
The following steps outline how to get up and running with this repo on your own system:
- Environment access
- Github access - if you're reading this repo readme you probably already have this
- Azure DevOps access to the operational-data-warehouse Azure DevOps project
- Azure Portal access - additional access is required to the Azure Portal and the corresponding Azure Resources in each environment
- Application Installation - the following desktop applications are optional but provide advantages when working with some of the Azure resources - PINS Azure auth policy is to restrict access to PINS devices only so non-PINS devices will need to be whitelisted to use these
- Install Visual Studio Code or equivalent IDE - for editing and commiting code artifacts
- Install Azure Data Studio - for connecting to Azure SQL instances and managing/commiting data notebooks
- Install Microsoft Azure Storage Explorer
- Clone Repo
- Create a Personal Access Token in GitHub or use another authentication method e.g. SSH
- Clone the repo in VSCode/Azure Data Studio to a local folder
The ODW environment is deployed to three Azure subscriptions as follows:
| Environment Name | Subscription Name | Subscription ID |
|---|---|---|
| Development | pins-odw-data-dev-sub | ff442a29-fc06-4a13-8e3e-65fd5da513b3 |
| Pre-Production | pins-odw-data-preprod-sub | 6b18ba9d-2399-48b5-a834-e0f267be122d |
| Production | pins-odw-data-prod-sub | a82fd28d-5989-4e06-a0bb-1a5d859f9e0c |
Within each subscription, the infrastructure is split into several resource groups, aligned to the data landing zone architecture:
| Resource Group Name | Description |
|---|---|
| pins-rg-data-odw-{env}-{region} | Contains the Data Lake and Synapse Workspace resources |
| pins-rg-data-odw-{env}-{region}-synapse-managed | Managed resource group for the Synapse Workspace |
| pins-rg-datamgmt-odw-{env}-{region} | Contains data management resource such as Purview and Bastion VM(s) |
| pins-rg-datamgmt-odw-{env}-{region}-purview-managed | Managed resource group for the Purview Account |
| pins-rg-devops-odw-{env}-{region} | Contains Azue DevOps agents for deployments into the private network |
| pins-rg-monitoring-odw-{env}-{region} | Contains monitoring resources such as Log Analytics and App Insights |
| pins-rg-network-odw-{env}-global | Contains private DNS zones for private-link-enabled resources |
| pins-rg-network-odw-{env}-{region} | Contains the virtual network, network security groups and private endpoints |
| pins-rg-shir-odw-{env}-{region} | Contains self-hosted integration runtime VM(s) used by the Synapse Workspace |
Some of the key resources used in the deployment are:
| Resource Name | Description |
|---|---|
| Synapse Workspace | Analytics product for loading, transforming and analysing data using SQL and/or Spark |
| ADLS Storage Account | Hierarchical namespace enabled Storage Account to act as a data lake |
| Key Vault | Secrets storage for connection strings, password, etc for connected services |
| Log Analytics | Activity and metric diagnostic log storage with querying capabilities using KQL |