diff --git a/.gitignore b/.gitignore index 73ad1bd..b478cac 100644 --- a/.gitignore +++ b/.gitignore @@ -2,6 +2,8 @@ .* __*cache* +*.egg.* + # Exclude build output and content of the virtual environment dbcicd *dist* diff --git a/README.md b/README.md index 87c36f9..7f0bb78 100644 --- a/README.md +++ b/README.md @@ -1,22 +1,614 @@ +--- +permalink: cicd-for-databricks-with-azure-devops +description: >- + How to implement a cicd pipeline to deploy notebooks and libraries in Azure Databricks using Azure DevOps +author: barthelemy +image: feature/ +categories: +- devops +tags: +- git +- release +- gitops + +--- + # Azure Databricks CI/CD pipeline using Azure DevOps -Throughout the Development lifecycle of an application, [CI/CD] is a [DevOps] process enforcing automation in building, testing and desploying applications. Development and Operation teams can leverage the advantages of CI/CD to deliver more frequently and reliably releases in a timly manner while ensuring quick iterations. +Throughout the Development lifecycle of an application, [CI/CD](https://en.wikipedia.org/wiki/CI/CD) is a [DevOps](/en/tag/devops) process enforcing automation in building, testing and desploying applications. Development and Operation teams can leverage the advantages of CI/CD to deliver more frequently and reliably releases in a timly manner while ensuring quick iterations. + +CI/CD is becoming a necessary process for data engineering and data science teams to deliver valuable data projects and increase confidence in the quality of the outcomes. With [Azure Databricks](https://azure.microsoft.com/en-gb/services/databricks/) you can use solutions like Azure DevOps, Gitlabs, Github Actions or Jenkins to build a CI/CD pipeline to reliably build, test, and deploy your notebooks and libraries. + +In this article we will guide you step by step to create an effective CI/CD pipeline using Azure Devop to deploy a simple notebook and a library to Azure Databricks. We will show how to manage sensitive data during the process using [Azure Keyvault](https://azure.microsoft.com/en-us/services/key-vault/) and how to secure the communication between Azure Databricks and our object storage. -CI/CD is becoming an increasingly necessary process for data engineering and data science teams to deliver valuable data project and increase confidence in the quality of the outcomes. With [Azure Databricks](https://azure.microsoft.com/en-gb/services/databricks/) you use solutions like Azure DevOps or Jenkins to build a CI/CD pipeline to reliably build, test, and deploy your notebooks and libraries. +## Description of our pipeline + +Our stack: + +- Azure Databricks +- Azure Devops: to build, test and deploy our artifacts (notebook and library) to Azure Databricks +- Azure Data Lake Storage Service: To store our dataset that will be consumed by Azure Databricks +- Azure Key vault: to store sensitive data ## Prerequisites 1. An Azure Account and an active subscription. You can create a free account [here](). 2. Azure Devops Organization that will hold a project for our repository and our pipeline assets -## Description of our ci/cd pipeline +Clone the repository + +```sh +git clone https://github.com/bngom/azure-databricks-cicd.git && cd azure-databricks-cicd +``` + +Create a python environment and install dependencies + +```sh +python -m venv dbcicd +``` + +Activate the virtual environment + +```sh +source dbcicd/bin/activate +``` + +Install requirements + +```sh +pip install -r requirements.txt +``` + +Run lint test + +```sh +python -m pip install flake8 +flake8 ./test/ ./src/ +``` + +Run unit test: + +```sh +python -m test +``` + +## Setting up Azure CLI + +You can use [Azure Cloud Shell](https://azure.microsoft.com/en-us/features/cloud-shell/) from the directry you would like to deploy your resources. Or ou can install [install Azure CLI](https://docs.microsoft.com/fr-fr/cli/azure/install-azure-cli-linux?pivots=apt). + +```sh +curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash +``` + +Configure Azure CLI, your default browser will prompt for you to enter your credentials. This will connect you to your default tenant. + +```sh +az login +``` + +Get details about your Azure account. + +```sh +az account show +``` + +![](./assets/account-info.png) + +You can now if you wish connect to another directory by specifying the `tenant_id`. + +```sh +az login --tenant +``` + +### Prerequisites for Azure + +- Resource Group: In our selected directory let us create a resource Group + +```sh +rg_name="databricks-rg" +location="francecentral" +az group create --name $rg_name --location $location +``` + +- Test the validity of our Azure Resource Management (ARM) Template: You will find in the folder `./template` a model file `dbx.parameter.json` that describe resources we want to deploy on Azure and a parameter file `dbx.parameter.json`. + +> For our parameter file `dbx.parameter.json` we choose to not store in plain text sensitive information such as the tenant_id. We use a templating form like ``. This value will be overridden in Azure DevOps during the build pipeline. But to test the validity of our ARM templase, make sure to replace all values `<>` with good values. This include: ``, ``, ``, ``, ``. E.g for `` you can replace it by `dbxbnsa`... +> +> ![](./assets/param-template.PNG) +> +> to get your `objectId` run `az ad signed-in-user show | jq '.objectId'` + +With the command below, we test the deployment of our ARM template against the resource group using the `what-if` option. This will only validate your template. The deployment will be done during the build phase in azure devops. + +```sh +cd template +az deployment group what-if --name TestDeployment --resource-group $rg_name --template-file dbx.template.json --parameters @dbx.parameters.json +``` + +## Setting up Azure DevOps + +In this section we will create a Azure DevOps Organization, create a project and upload our repository. We will also set up an *Azure Resource Manager connection* in Azure Devops to authorize communication between Azure DevOps and the Azure Resource Manager. + +### Configure the DevOps environment + +1. Go to [dev.azure.com](dev.azure.com) +2. [Create an Azure DevOps Organization](https://docs.microsoft.com/en-us/azure/devops/organizations/accounts/create-organization?view=azure-devops) + +- Click on *New organization*, then click on *Continue* + +![](./assets/neworg.png) + +- Set the name of your Azure DevOps organization, then click on *Continue*. + +![](./assets/dbx-cicd-org.png) + +3. [Create a Project](https://docs.microsoft.com/en-us/azure/devops/organizations/projects/create-project?view=azure-devops&tabs=preview-page) in your Azure DevOps Organisation + +- In your new organization, set your project name then click on *Create project* + +![](./assets/new-project.png) + +4. Import a repository + +- Click on your new project + +![](./assets/new-repo.png) + +- Click on *Repo* + +![](./assets/new-repo-1.PNG) + +- then on *Import repository*: Use the url `https://github.com/bngom/azure-databricks-cicd.git` + +![](./assets/import-repo.png) + +5. Set an Azure Resource Manager connection + +- Project Settings > Pipeline : Service connections > Create service connections + +![](./assets/new-service-connections.png) + +- Select Azure Resource Manager + +![](./assets/new-service-connections-1.png) + +- Select *Service principal (automatique)* + +![](./assets/new-service-connections-2.png) + +- Select your subscription and the resource group `databricks-rg` created previously and save. + +![](./assets/new-service-connections-3.png) + +### Create a Build Pipeline + +We are all set to create a Build pipeline. This operation will generate artifacts to be consumed in our release pipeline. + +- Pipelines > Pipelines: CLick on *New pipeline* + +![](./assets/new-pipeline.PNG) + +- Select *Azure GIt Repo* + +![](./assets/new-pipeline-1.PNG) + +- Select your repository + +![](./assets/new-pipeline-2.PNG) + +- Select *Existing Azure Pipelines YAML file* and select the `azure-pielines.yml` file from your repository. + +![](./assets/new-pipeline-3.PNG) + +![](./assets/new-pipeline-4.PNG) + +The `azure-pipelines.yml` file in your repository is automatically detected. The build pipeline is composed of steps that include tasks and scripts to be executed againts different python version. The pipeline will first install all requirements and run unit test. If succeeded it will publish an ARM template and a notebook artifacts, then build a python library and publish it. There is also a task involving test coverage evaluation. + +![](./assets/new-pipeline-5.PNG) + +- Now you can run your build pipeline. Click on *Run* + +The build pipeline executed successfully and artifacts are generated and ready to be consumed in a release pipeline. + +![](./assets/build-pipeline-success.png) + +![](./assets/summary-build.png) + +Before Creating a release pipeline let us do some configuration in azure devops. For security purpose, we have dummy variales in our template parameter file `dbx.parameters.json`. We will use a *variable group* and use it to overwrite our dummy variables during release process. + +Create Variable Group: + +- Pipeline > Library > +Variable Group + +![](./assets/new-variable-grp.png) + +- Create the following variables, and save + +![](./assets/variable-grp.png) + +### Create a Release Pipeline + +***PHASE 1*** + +In this section, we will create a release to deploy resources (Databricks workspace, KeyVault, Storage Account and Container for blob storage) on Microsoft Azure. + +- Pipelines > Release: New release pipeline + +![](./assets/new-release-0.png) + +- On the right blade, For *select a template* click on *Empty Job* + +![](./assets/empty-job.png) + +- Update the Stage name to *Development* + +![](./assets/stage-dev.png) + +- Click on *Add artifacts* and Select the source build pipeline `demo-cicd`. The click on *Add* + +![](./assets/add-artifacts.png) + +- Click on *Variable* and link our Variable Group to stage *Development* + +![](./assets/link-variable-grp.png) + +![](./assets/link-variable-grp-2.png) + +- Now, click on *Tasks*: + + - Check the Agent job set up: make sure the Agent Specification is set to `ubuntu-20.04` + - Clic on `+` sign near Agent job + + ![](./assets/plus-task.png) + + - Add ARM Resource Deployment and configure it accordingly giving the template and parameter files. And overwriting the variables: + ![](./assets/arm-template-deployment-0.png) + - `Deployment scope`: Resource Group + - `Resource manager connection`: Select your Azure Resource manager connection and click on *Authorize* + - `Subscription`: Select your subscription + - `Action`: Create or update resource group + - `Resource group`: Select the resource group created previously or use the variable $(rg_name) + - `Location`: idem, or $(location) + - `Template`: Click on `more` and select `dbx.template.json` in the template artifacts + - `Template parameters`: select `dbx.parameters.json`in the template artifacts + - `Override template parameter`: Update tha values of the parameters with the variables we created in our variable group. + + ![](./assets/template-parameters.png) + + Or you can past the folowing lines of code + + ```sh + -objectId $(object_id) -keyvaultName $(keyvault) -location $(location) -storageAccountName $(sa_name) -containerName $(container) -workspaceName $(workspace) -workspaceLocation $(location) -tier "premium" -sku "Standard" -tenant $(tenant_id) -networkAcls {"defaultAction":"Allow","bypass":"AzureServices","virtualNetworkRules":[],"ipRules":[]} + ``` + + - Deployment mode: Incremental + - Now, save your configuration + + ![](./assets/arm-template-deployment.png) + +We are ready now to create a first release + +- Click on the button: *Create release* + +![](./assets/create-release-btn.png) + +- Stages for a trigger change from automated to manual: select *Development* + +![](./assets/new-release.png) + +- Click on *create* +- Go to Pipelines > Release +- Select our release pipeline + +![](./assets/select-release.png) + +- Click on the newly created release and click on `Deploy` + +![](./assets/deploy-release.png) + +Our pipeline executed successfully + +![](./assets/release-pipeline-success.png) + +And our resources are deployed in Azure. Go to [Azure Portal](https://portal.azure.com/) and check the content of our resource group. You will find in the resource group a keyvault, a databricks workspace, a storage account and a container in the storage account. + +![](./assets/check-az-resources.png) + +***PHASE 2*** + +We will complete the deployment of our release pipeline but for security concerns, let us enforce the security between Databricks and other azure resources. + +**Generate Databricks Token** + +Log into the Databricks Workspace and under User settings (icon in the top right corner) and select “Generate New Token”. Choose a descriptive name and copy the token to a notebook or clipboard. The token is displayed just once. + +![](./assets/dbx-token.png) + +Make sure that for Git integration, the git provider is set to `Azure DevOps Services` + +![](./assets/git-integration.png) + +The following command show you how to copy the new token and save it into your keyvault. +> At the same time we will save the URI of our Databricks service, you can find this one in Azure portal. +> +> ![](./assets/dbx-service.png) + +```sh +dbxtoken="your databricks token" +keyvault_name="your keyvault name" +dbxuri="https://adb-..azuredatabricks.net" +az keyvault secret set --vault-name $keyvault_name --name "dbxtoken" --value $dbxtoken +az keyvault secret set --vault-name $keyvault_name --name "dbxuri" --value $dbxuri +``` + +**Key vault sas token** + +Generate a shared access signature token for the storage account. This will secure the communication between Azure Databricks and the object storage. + +```sh +sa_name="your storage account" +end=`date -u -d "10080 minutes" '+%Y-%m-%dT%H:%MZ'` +az storage account generate-sas \ + --permissions lruwap \ + --account-name $sa_name \ + --services b \ + --resource-types sco \ + --expiry $end \ + -o json +``` + +Your SAS Token is generated, copy it with the storage account and the container name the key vault. + +```sh +sastoken="key vault sas token" +container="your container" +az keyvault secret set --vault-name $keyvault_name --name "storagerw" --value $sastoken +az keyvault secret set --vault-name $keyvault_name --name "storageaccount" --value $sa_name +az keyvault secret set --vault-name $keyvault_name --name "container" --value $container +``` + +List your secrets + +```sh +az keyvault secret list --vault-name $keyvault_name --output table +``` + +**Configure Databricks CLI** + +```sh +pip install databricks-cli +``` + +Configure your cli to interact with databricks. You will need to enter the uri and the token generated + +```sh +databricks configure --token +# https://adb-..azuredatabricks.net +# your-databricks-token + +``` + +**Create a cluster in databricks** + +Once the connection is established, we can create a cluster in databricks + +```sh +rm -f create-cluster.json +cat <> create-cluster.json +{ + "num_workers": null, + "autoscale": { + "min_workers": 2, + "max_workers": 8 + }, + "cluster_name": "dbx-cluster", + "spark_version": "8.2.x-scala2.12", + "spark_conf": {}, + "azure_attributes": { + "first_on_demand": 1, + "availability": "ON_DEMAND_AZURE", + "spot_bid_max_price": -1 + }, + "node_type_id": "Standard_DS3_v2", + "ssh_public_keys": [], + "custom_tags": {}, + "spark_env_vars": { + "PYSPARK_PYTHON": "/databricks/python3/bin/python3" + }, + "autotermination_minutes": 30, + "cluster_source": "UI", + "init_scripts": [] +} +EOF + +databricks clusters create --json-file create-cluster.json +``` + +**Create an Azure Key Vault-backed secret scope in Databricks** + +Get the resource ID and the DNS name from your key vault properties and create a secret scope in Azure Databricks. + +```sh +keyvault_name="dbx-bn-keyvault" +vaultUri=$(az keyvault show --name $keyvault_name | jq '.properties.vaultUri') +vaultId=$(az keyvault show --name $keyvault_name | jq '.id') +databricks secrets create-scope --scope demo --scope-backend-type AZURE_KEYVAULT --resource-id $vaultId --dns-name $vaultUri +# List the scope(s) +databricks secrets list-scopes +``` + +Oups! The above command raised and error `Scope with Azure KeyVault must have userAADToken defined!`. It seems to be an bug. But no worries, you can still create the scope from the databricks workspace using the following url `https://adb-..azuredatabricks.net/?o=#secrets/createScope`. + +![](./assets/create-scope.png) + +And you can test if your secrets are accessible from your workspace. + +![](./assets/secret-scope.png) + +Great! Let's test if we can from our workspace mount the container in our storage account and read a dataset we uploaded inside. + +```sh +az storage blob upload \ + --name friends.csv \ + --account-name $sa_name \ + --container-name $container \ + --file ./data/friends.csv +``` + +Upload the following notebook into your workspace and test if you can securely access the dataset uploaded in your storage account container. + +```sh +cat <> demo.py +# Databricks notebook source +dbutils.secrets.listScopes() + +# COMMAND ---------- + +dbutils.secrets.list("demo") + +# COMMAND ---------- + +print(dbutils.secrets.get(scope="demo", key="storagerw")) + +# COMMAND ---------- + +# Unmount directory if previously mounted. +MOUNTPOINT = "/mnt/commonfiles" +if MOUNTPOINT in [mnt.mountPoint for mnt in dbutils.fs.mounts()]: + dbutils.fs.unmount(MOUNTPOINT) + +# Add the Storage Account, Container, and reference the secret to pass the SAS Token +STORAGE_ACCOUNT = dbutils.secrets.get(scope="demo", key="storageaccount") +CONTAINER = dbutils.secrets.get(scope="demo", key="container") +SASTOKEN = dbutils.secrets.get(scope="demo", key="storagerw") +SOURCE = "wasbs://{container}@{storage_acct}.blob.core.windows.net/".format(container=CONTAINER, storage_acct=STORAGE_ACCOUNT) +URI = "fs.azure.sas.{container}.{storage_acct}.blob.core.windows.net".format(container=CONTAINER, storage_acct=STORAGE_ACCOUNT) + +try: + dbutils.fs.mount( + source=SOURCE, + mount_point=MOUNTPOINT, + extra_configs={URI:SASTOKEN}) +except Exception as e: + if "Directory already mounted" in str(e): + pass # Ignore error if already mounted. + else: + raise e + +display(dbutils.fs.ls(MOUNTPOINT)) + +# COMMAND ---------- + +friendsDF = (spark.read + .option("header", True) + .option("inferSchema", True) + .csv(MOUNTPOINT + "/friends.csv")) + +display(friendsDF) +EOF +``` + +We can successfully access data stored in the storage account's container. + +![](./assets/test-sa.png). + +**Release Pipeline: continue** + +We can now finis the configuration of our release pipeline. Let's first create a new variable group and link it with our azure keyvault. + +- Go to Pipelines > Library +- Click on *Variables* and add a new variable group +- Toggle `Link secrets from an Azure vault as variables` +- Select your subscription +- Select the key vault name +- Add variables from the keyvault + +![](./assets/dbx-variable-grp-secrets.png) + +Add a third variable group to save the name of our Notebook and the folder where we will deploy it. + +![](./assets/link-variable-grp-notebook.png) + +Edit our release pipeline and link the variable groups we just created to `Release` scope + +- Go to Pipelines > Release: Select our pipeline and Edit the pipeline. +- Link the variable groups to Release scope + +![](./assets/link-azure-secrets.png) + +![](./assets/link-variable-grp-all.png) + +Now we are ready to update our tasks + +- Go to Tasks +- Add `UsePythonVersion` task. **Put it above the `AzureResourceManagerTemplateDeployment`** + + ![](./assets/use-python-task.png) + +- After `AzureResourceManagerTemplateDeployment` task add a Bash script + +![](./assets/dbx-cli-task.png) + +- Add a bash task; Rename it to `Databricks configure` and add the following code in the Inline Script. + +```sh +databricks configure --token <..azuredatabricks.net +$(dbxtoken) +EOF +``` + +- Add a Bash task; Rename it to `Import notebook into databricks` and add the following code in the Inline Script. + +```sh +databricks workspace mkdirs /$(folder) +databricks workspace import --language PYTHON --format SOURCE --overwrite _demo-cicd/notebook/$(notebook-name)-$(Build.SourceVersion).py /$(folder)/$(notebook-name)-$(Build.SourceVersion).py +``` + +- Add a Bash task; Rename it to `Import library into databricks` and add the following code in the Inline Script. + +```sh +# create a new directory +databricks fs mkdirs dbfs:/dbx-library +# Import the module +# databricks fs rm _demo-cicd//wheel/friends-0.0.1-py3-none-any.whl +databricks fs cp _demo-cicd/wheel/friends-0.0.1-py3-none-any.whl dbfs:/dbx-library/ +``` + +- Add a Bash task; Rename it to `Install Library and attach it to the cluster` and add the following code in the Inline Script. + +```sh +cluster_id=$(databricks clusters list --output JSON | jq '[ .clusters[] | { name: .cluster_name, id: .cluster_id, state: .state } ]' | jq '.[] | select(.name=="dbx-cluster")' | jq -r '.id') +# The above query have to be adapted if there is more than one cluster with the same name. They will be in different states. + +echo "Cluster id: $cluster_id" + +# Install library +databricks libraries install --cluster-id $cluster_id --whl dbfs:/dbx-library/friends-0.0.1-py3-none-any.whl +``` + +Our tasks will end up to look like this: + +![](./assets/final-tasks.png) + +- Save our tasks and Create a release pipeline. +- Make sure your Databricks cluster is in `RUNNING` state +- Then, Deploy the release + +![](./assets/release-pipeline-success-2.png) + +Our Notebook is deployed. + +![](./assets/notebook-deployed.png) + +Our library is imported and installed on our cluster. -## Define the Build pipeline +![](./assets/library-installed.png) -## Define the Release pipeline +![](./assets/pip-list.PNG) -## Check our results +We are now ready to play in our workspace. ## Conclusion -## Cheat Sheet \ No newline at end of file +In this article, we deployed Databricks notebook and library using Azure DevOps to manage a Continuous Integration and Continous Delivery pipeline. We saw that Azure Databricks is tightly integrated with other Microsoft Azure resources and services such as Keyvault, Storage Account and Azure Active Directory. For more complex workload you could consider integrating ETL tools like Azure Data Factory and its built-in linked service. diff --git a/assets/add-artifacts.png b/assets/add-artifacts.png new file mode 100644 index 0000000..e6bba0e Binary files /dev/null and b/assets/add-artifacts.png differ diff --git a/assets/arm-template-deployment-0.png b/assets/arm-template-deployment-0.png new file mode 100644 index 0000000..347273a Binary files /dev/null and b/assets/arm-template-deployment-0.png differ diff --git a/assets/arm-template-deployment.png b/assets/arm-template-deployment.png new file mode 100644 index 0000000..f8bfc1f Binary files /dev/null and b/assets/arm-template-deployment.png differ diff --git a/assets/check-az-resources.png b/assets/check-az-resources.png new file mode 100644 index 0000000..b4a1454 Binary files /dev/null and b/assets/check-az-resources.png differ diff --git a/assets/create-release-btn.png b/assets/create-release-btn.png new file mode 100644 index 0000000..0dd81b8 Binary files /dev/null and b/assets/create-release-btn.png differ diff --git a/assets/create-scope.png b/assets/create-scope.png new file mode 100644 index 0000000..728be50 Binary files /dev/null and b/assets/create-scope.png differ diff --git a/assets/dbx-cli-task.png b/assets/dbx-cli-task.png new file mode 100644 index 0000000..9215dc1 Binary files /dev/null and b/assets/dbx-cli-task.png differ diff --git a/assets/dbx-service.png b/assets/dbx-service.png new file mode 100644 index 0000000..4bdfdbf Binary files /dev/null and b/assets/dbx-service.png differ diff --git a/assets/dbx-token.png b/assets/dbx-token.png new file mode 100644 index 0000000..219d13d Binary files /dev/null and b/assets/dbx-token.png differ diff --git a/assets/dbx-uri.png b/assets/dbx-uri.png new file mode 100644 index 0000000..6624c67 Binary files /dev/null and b/assets/dbx-uri.png differ diff --git a/assets/dbx-variable-grp-secrets.png b/assets/dbx-variable-grp-secrets.png new file mode 100644 index 0000000..2692c0d Binary files /dev/null and b/assets/dbx-variable-grp-secrets.png differ diff --git a/assets/deploy-release.png b/assets/deploy-release.png new file mode 100644 index 0000000..d299a0c Binary files /dev/null and b/assets/deploy-release.png differ diff --git a/assets/empty-job.png b/assets/empty-job.png new file mode 100644 index 0000000..4bb779a Binary files /dev/null and b/assets/empty-job.png differ diff --git a/assets/final-tasks.png b/assets/final-tasks.png new file mode 100644 index 0000000..7157111 Binary files /dev/null and b/assets/final-tasks.png differ diff --git a/assets/git-integration.png b/assets/git-integration.png new file mode 100644 index 0000000..b57659c Binary files /dev/null and b/assets/git-integration.png differ diff --git a/assets/import-repo.png b/assets/import-repo.png index 4251be7..0232da4 100644 Binary files a/assets/import-repo.png and b/assets/import-repo.png differ diff --git a/assets/library-installed.png b/assets/library-installed.png new file mode 100644 index 0000000..0274a84 Binary files /dev/null and b/assets/library-installed.png differ diff --git a/assets/lik-variable-grp-2.png b/assets/lik-variable-grp-2.png new file mode 100644 index 0000000..58db9d1 Binary files /dev/null and b/assets/lik-variable-grp-2.png differ diff --git a/assets/link-azure-secrets.png b/assets/link-azure-secrets.png new file mode 100644 index 0000000..26eac24 Binary files /dev/null and b/assets/link-azure-secrets.png differ diff --git a/assets/link-variable-grp-2.png b/assets/link-variable-grp-2.png new file mode 100644 index 0000000..d082dbc Binary files /dev/null and b/assets/link-variable-grp-2.png differ diff --git a/assets/link-variable-grp-all.png b/assets/link-variable-grp-all.png new file mode 100644 index 0000000..a5a04bc Binary files /dev/null and b/assets/link-variable-grp-all.png differ diff --git a/assets/link-variable-grp-notebook.png b/assets/link-variable-grp-notebook.png new file mode 100644 index 0000000..985a256 Binary files /dev/null and b/assets/link-variable-grp-notebook.png differ diff --git a/assets/link-variable-grp.png b/assets/link-variable-grp.png new file mode 100644 index 0000000..004d772 Binary files /dev/null and b/assets/link-variable-grp.png differ diff --git a/assets/new-pipeline-1.PNG b/assets/new-pipeline-1.PNG new file mode 100644 index 0000000..91f7aec Binary files /dev/null and b/assets/new-pipeline-1.PNG differ diff --git a/assets/new-pipeline-2.PNG b/assets/new-pipeline-2.PNG new file mode 100644 index 0000000..9228bc0 Binary files /dev/null and b/assets/new-pipeline-2.PNG differ diff --git a/assets/new-pipeline-3.PNG b/assets/new-pipeline-3.PNG new file mode 100644 index 0000000..844e546 Binary files /dev/null and b/assets/new-pipeline-3.PNG differ diff --git a/assets/new-pipeline-4.PNG b/assets/new-pipeline-4.PNG new file mode 100644 index 0000000..f3b9042 Binary files /dev/null and b/assets/new-pipeline-4.PNG differ diff --git a/assets/new-pipeline-5.PNG b/assets/new-pipeline-5.PNG new file mode 100644 index 0000000..1bc6f3e Binary files /dev/null and b/assets/new-pipeline-5.PNG differ diff --git a/assets/new-pipeline.PNG b/assets/new-pipeline.PNG new file mode 100644 index 0000000..942da35 Binary files /dev/null and b/assets/new-pipeline.PNG differ diff --git a/assets/new-project.png b/assets/new-project.png index 32d7d27..317a26b 100644 Binary files a/assets/new-project.png and b/assets/new-project.png differ diff --git a/assets/new-release-0.PNG b/assets/new-release-0.PNG new file mode 100644 index 0000000..2bda215 Binary files /dev/null and b/assets/new-release-0.PNG differ diff --git a/assets/new-release.png b/assets/new-release.png new file mode 100644 index 0000000..315fd0a Binary files /dev/null and b/assets/new-release.png differ diff --git a/assets/new-repo-1.png b/assets/new-repo-1.png new file mode 100644 index 0000000..8f85533 Binary files /dev/null and b/assets/new-repo-1.png differ diff --git a/assets/new-repo.PNG b/assets/new-repo.PNG new file mode 100644 index 0000000..b0cf57b Binary files /dev/null and b/assets/new-repo.PNG differ diff --git a/assets/new-service-connections-1.png b/assets/new-service-connections-1.png new file mode 100644 index 0000000..19468da Binary files /dev/null and b/assets/new-service-connections-1.png differ diff --git a/assets/new-service-connections-2.png b/assets/new-service-connections-2.png index b91f026..1494850 100644 Binary files a/assets/new-service-connections-2.png and b/assets/new-service-connections-2.png differ diff --git a/assets/new-service-connections-3.png b/assets/new-service-connections-3.png index 2178ea3..835ed5f 100644 Binary files a/assets/new-service-connections-3.png and b/assets/new-service-connections-3.png differ diff --git a/assets/new-service-connections.png b/assets/new-service-connections.png index 2b164b3..681ac4f 100644 Binary files a/assets/new-service-connections.png and b/assets/new-service-connections.png differ diff --git a/assets/new-variable-grp.png b/assets/new-variable-grp.png index fb0e9ae..9ec59a0 100644 Binary files a/assets/new-variable-grp.png and b/assets/new-variable-grp.png differ diff --git a/assets/notebook-deployed.png b/assets/notebook-deployed.png new file mode 100644 index 0000000..231c349 Binary files /dev/null and b/assets/notebook-deployed.png differ diff --git a/assets/param-template.PNG b/assets/param-template.PNG new file mode 100644 index 0000000..fb26798 Binary files /dev/null and b/assets/param-template.PNG differ diff --git a/assets/pip-list.PNG b/assets/pip-list.PNG new file mode 100644 index 0000000..a4f02a1 Binary files /dev/null and b/assets/pip-list.PNG differ diff --git a/assets/plus-task.png b/assets/plus-task.png new file mode 100644 index 0000000..55742e8 Binary files /dev/null and b/assets/plus-task.png differ diff --git a/assets/project-setting.PNG b/assets/project-setting.PNG new file mode 100644 index 0000000..c42025f Binary files /dev/null and b/assets/project-setting.PNG differ diff --git a/assets/release-pipeline-success-2.png b/assets/release-pipeline-success-2.png new file mode 100644 index 0000000..b12a3b7 Binary files /dev/null and b/assets/release-pipeline-success-2.png differ diff --git a/assets/release-pipeline-success.png b/assets/release-pipeline-success.png new file mode 100644 index 0000000..4cecf74 Binary files /dev/null and b/assets/release-pipeline-success.png differ diff --git a/assets/run-pipeline.png b/assets/run-pipeline.png new file mode 100644 index 0000000..33d95d4 Binary files /dev/null and b/assets/run-pipeline.png differ diff --git a/assets/secret-scope.png b/assets/secret-scope.png new file mode 100644 index 0000000..145b702 Binary files /dev/null and b/assets/secret-scope.png differ diff --git a/assets/select-release.png b/assets/select-release.png new file mode 100644 index 0000000..570b14b Binary files /dev/null and b/assets/select-release.png differ diff --git a/assets/stage-dev.png b/assets/stage-dev.png new file mode 100644 index 0000000..b4643a0 Binary files /dev/null and b/assets/stage-dev.png differ diff --git a/assets/task-1.png b/assets/task-1.png new file mode 100644 index 0000000..7c85b32 Binary files /dev/null and b/assets/task-1.png differ diff --git a/assets/template-parameters.png b/assets/template-parameters.png new file mode 100644 index 0000000..7db09f8 Binary files /dev/null and b/assets/template-parameters.png differ diff --git a/assets/test-sa.PNG b/assets/test-sa.PNG new file mode 100644 index 0000000..8b64bb5 Binary files /dev/null and b/assets/test-sa.PNG differ diff --git a/assets/use-python-task.png b/assets/use-python-task.png new file mode 100644 index 0000000..d7757fd Binary files /dev/null and b/assets/use-python-task.png differ diff --git a/assets/variable-grp.png b/assets/variable-grp.png index 096ac8b..6ed6cf8 100644 Binary files a/assets/variable-grp.png and b/assets/variable-grp.png differ diff --git a/azure-pipelines.yml b/azure-pipelines.yml index 166aa57..ea9b808 100644 --- a/azure-pipelines.yml +++ b/azure-pipelines.yml @@ -5,12 +5,8 @@ pool: vmImage: ubuntu-latest strategy: matrix: - Python36: - python.version: '3.6' Python37: python.version: '3.7' - Python38: - python.version: '3.8' variables: - name: notebook-name @@ -27,11 +23,11 @@ steps: displayName: 'Install dependencies' - script: | python -m pip install flake8 - flake8 ./src/ + flake8 ./friends/ displayName: 'Run lint tests' - script: | python -m pip install pytest pytest-azurepipeline - pytest test --doctest-modules --junitxml=junit/test-results.xml --cov=./src --cov-report=xml --cov-report=html + pytest test --doctest-modules --junitxml=junit/test-results.xml --cov=./friends --cov-report=xml --cov-report=html displayName: 'Test with pytest' - task: PublishTestResults@2 condition: succeededOrFailed() @@ -63,10 +59,10 @@ steps: artifactName: notebook - script: | mkdir -p "$(Build.ArtifactStagingDirectory)/wheel" - python3 -m pip install --upgrade build - python3 -m build - cp dist/friends-0.0.1-py3-none-any.whl "$(Build.ArtifactStagingDirectory)/wheel/friends-0.0.1-py3-none-any.whl" - displayName: 'Copy wheel artifact' + pip install wheel setuptools + python setup.py bdist_wheel --universal + cp dist/friends-0.0.1-py2.py3-none-any.whl "$(Build.ArtifactStagingDirectory)/wheel/friends-0.0.1-py2.py3-none-any.whl" + displayName: 'Publish build wheel artifact' - task: PublishBuildArtifacts@1 displayName: Publish Wheel Build Artifacts inputs: diff --git a/src/__init__.py b/build/lib/friends/__init__.py similarity index 100% rename from src/__init__.py rename to build/lib/friends/__init__.py diff --git a/src/friends.py b/build/lib/friends/friends.py similarity index 100% rename from src/friends.py rename to build/lib/friends/friends.py diff --git a/build/lib/src/__init__.py b/build/lib/src/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/build/lib/src/friends.py b/build/lib/src/friends.py new file mode 100644 index 0000000..ca87464 --- /dev/null +++ b/build/lib/src/friends.py @@ -0,0 +1,33 @@ +from pyspark.sql.dataframe import DataFrame +from pyspark.sql.session import SparkSession +from pyspark.sql.types import StructType, StringType, IntegerType, StructField + + +class Friends: + + def __init__(self, spark: SparkSession, file_path: str): + self.spark = spark + self.file_path = file_path + + def mount_dataset(self, path): + return 1 + + def load(self): + friendSchema = StructType([ + StructField('id', IntegerType()), + StructField('name', StringType()), + StructField('age', IntegerType()), + StructField('friends', StringType()) + ]) + return (self.spark.read + .format("csv") + .option("header", True) + .schema(friendSchema) + .load(self.file_path)) + + def save_as_parquet(self, df: DataFrame, file_name: str): + df.write.parquet(file_name) + + def create_table(self, df: DataFrame, table_name: str, file_name: str): + parquetFile = self.spark.read.parquet(file_name) + parquetFile.createOrReplaceTempView(table_name) diff --git a/cheat cheet.md b/cheat cheet.md deleted file mode 100644 index c2e10dd..0000000 --- a/cheat cheet.md +++ /dev/null @@ -1,317 +0,0 @@ -# Cheat cheet - -This is a step by Step walkthrough you can use to reproduce this experiment. - -Clone the repository - -```sh -git clone https://github.com/bngom/azure-databricks-cicd.git && cd azure-databricks-cicd -``` - -Create a python environment and install dependencies - -```sh -python -m venv dbcicd -``` - -Activate the virtual environment - -```sh -source dbcicd/bin/activate -``` - -Install requirements - -```sh -pip install -r requirement.txt -``` - -Run lint test - -```sh -python -m pip install flake8 -flake8 ./test/ ./src/ -``` - -Run unit test: - -```sh -python -m test -``` -> Note: Do this in the build pipeline phase and copy artifacts for the release step -Generating distribution archives for our library. - -``` -python3 -m pip install --upgrade build -python3 -m build -``` - -## Microsoft Azure - -Here we are using [Azure Cloud Shell]() from the directry we want to deploy our resources. You can install [Azure CLI]() to perform the same task. - - -Configure Azure CLI, your difault browser will prompt for you to enter your credentials. This will connect you to your default tenant. - -```sh -az login -``` - -Get details about your account. - -```sh -az account show -``` - -![](./assets/account-info.PNG) - -You can now if you wish connect to a specific directory. - -```sh -az login --tenant -``` - -### Prerequisites - -In our selected directory let us Create a resource Group - -```sh -az group create --name datatabricks-rg --location francecentral -``` - -Test the deployment against the resource group: this will only validate your template. The deployment will be done during the build phase in azure devops. - -```sh -cd template -az deployment group what-if --name TestDeployment --resource-group databricks-rg --template-file dbx.template.json --parameters @dbx.parameters.json -``` - -## Setting up Azure DevOps - -0. Go to dev.azure.com - -1. [Create an Azure DevOps Organization](https://docs.microsoft.com/en-us/azure/devops/organizations/accounts/create-organization?view=azure-devops) - -- Click on *New organization*, then click on Continue - -![](./assets/neworg.png) - -- Set the name of your Azure DevOps organization, then click on Continue. - -![](./assets/dbx-cicd-org.png) - - -2. [Create a Project in your Azure DevOps Organisation](https://docs.microsoft.com/en-us/azure/devops/organizations/projects/create-project?view=azure-devops&tabs=preview-page) - -- In your new organization, set your project name then click on *Create project* - -![](./assets/new-project.png) - -3. Import a repository - -- Click on *Repo*, then on *Import repository* - -![](./assets/import-repo.png) - -4. Set an Azure Resource Manager connection - -- Project Settings > Pipeline : Service connections > Create service connections - -![](./assets/create-service-connection.png) - -- Select Azure Resource Manager - -![](./assets/new-service-connections.png) - -- Select *Service principal (automatique)* - -![](./assets/new-service-connections-2.png) - -- Select your subscription and the resource group `databricks-rg` created previously and save. - -![](./assets/new-service-connections-3.png) - - -### Create a Build Pipeline - -Follow these steps to create a build pipeline. This operation will generate artifacts for our release pipeline. - -- Pipeline > Pipeline: CLick on Create pipeline -- Select *Azure GIt Repo* -- Select your repository -- Select *Python package* - -The `azure-pipelines.yml` file in your repository is automatically detected - -- Run your pipeline - -![](./assets/build-pipeline-success.png) - -Our build pipeline executed successfully and our artifacts are generated and ready to be consumed in a release pipelie. - -![](./assets/summary-build.png) - - -Before Creating our release pipeline let us do some configuration in azure devops. For security purpose we have dummy varibale in our template parameter file. We will use our variable group to overwrite them in our release pipeline. - -Create Variable Group: - -- Pipeline > Library > +Variable Group - -![](./assets/new-variable-grp.png) - -- Create the following variable - -![](./assets/variable-grp.png) - - - -### Create a Release Pipeline - -**Phase 1** - -Create a release to deploy resources (Databricks workspace, KeyVault, Storage Account and Container for blob storage) on Microsoft Azure. - -- Pipeline > Release: +New Piepline -- Update the Stage name to *Development* -- Add artifacts -- In variable: Link Variable Group for release and and stage -- In Task: Add ARM Resource Deployment and configure it accordingly giving the template and parameter files. And overwriting the variables: - - Deployment scope: Resource Group - - Template: Click on `more` and select dbx.template.json in the template artifacts - - Template parameters: select dbx.parameters.json in the template artifacts - -``` - -```sh -# Override template parameters --objectId "$(object_id)" -keyvaultName "$(keyvault)" -location "$(location)" -storageAccountName "$(sa_name)" -workspaceName "$(dbx_name)" -workspaceLocation "$(location)" -tier "premium" -sku "Standard" -tenant "$(tenant_id)" -networkAcls {"defaultAction":"Allow","bypass":"AzureServices","virtualNetworkRules":[],"ipRules":[]} -``` - -Generate a shared access signature token for the storage account. This will secure the communication between Azure Dtabricks and the object storage - -```sh -end=`date -u -d "10080 minutes" '+%Y-%m-%dT%H:%MZ'` -az storage account generate-sas \ - --permissions lruwap \ - --account-name dbxbnsa \ - --services b \ - --resource-types sco \ - --expiry $end \ - -o tsv -``` - -Copy the SAS token generated and copy it into the key vault - -```sh -az keyvault secret set --vault-name $keyvault_name --name "storagerw" --value "YOUR-SASTOKEN" -``` - -Let us few secrets in our the key vault - -```sh -#Let us gather some variable -tenantId=$(az account list | jq '.[].tenantId') -userId=$(az ad user list --filter "startswith(displayName, 'Barthelemy Diomaye NGOM')" | jq '.[].objectId') -subscriptionId=$(az account subscription list | jq '.[].subscriptionId') -``` - -```sh -# Example: az keyvault secret set --vault-name $keyvault_name --name "ExampleSecret" --value "dummyValues" -# Set my tenant id as a secret -az keyvault secret set --vault-name $keyvault_name --name "tenantId" --value $tenantId - -# Set my User id (Object ID) as a secret -az keyvault secret set --vault-name $keyvault_name --name "userId" --value $userId - -# Set my Subscription id as a secret -az keyvault secret set --vault-name $keyvault_name --name "subscriptionId" --value $subscriptionId - -az keyvault secret set --vault-name $keyvault_name --name "storageAccountName" --value "$storage_account" - -az keyvault secret set --vault-name $keyvault_name --name "storageAccountContainer" --value "$container" - -``` - -List your secrets - -```sh -az keyvault secret list --vault-name $keyvault_name --output table -``` - - - -Create a cluster in databricks - -```sh -cat <> create-cluster.json -{ - "num_workers": null, - "autoscale": { - "min_workers": 1, - "max_workers": 2 - }, - "cluster_name": "dbx-cluster", - "spark_version": "8.2.x-scala2.12", - "spark_conf": {}, - "azure_attributes": { - "first_on_demand": 1, - "availability": "ON_DEMAND_AZURE", - "spot_bid_max_price": -1 - }, - "node_type_id": "Standard_DS3_v2", - "ssh_public_keys": [], - "custom_tags": {}, - "spark_env_vars": { - "PYSPARK_PYTHON": "/databricks/python3/bin/python3" - }, - "autotermination_minutes": 120, - "cluster_source": "UI", - "init_scripts": [] -} -EOF - -databricks clusters create --json-file create-cluster.json -``` - - -Get my key vault id to be used as reference in my arm template parameter's file - -`az keyvault list | jq '.[].id'` - - -## Databricks CLI - -Install Databricks cli [here](https://docs.databricks.com/dev-tools/cli/index.html) - -Generate a databricks token [here](https://docs.databricks.com/dev-tools/api/latest/authentication.html#generate-a-personal-access-token) - -Configure your cli to interact with databricks - -```sh -databricks configure --token -``` - -Get the resource ID and the DNS name from your key vault properties - -```sh -vaultUri=$(az keyvault show --name $keyvault_name | jq '.properties.vaultUri') -vaultId=$(az keyvault show --name $keyvault_name | jq '.id') -``` - -## Create an Azure Key Vault-backed secret scope - -```sh -databricks secrets create-scope --scope demo-cicd --scope-backend-type AZURE_KEYVAULT --resource-id $vaultId --dns-name $vaultUri -``` - -## Resources - -https://docs.python.org/fr/3/distributing/index.html -https://github.com/Azure-Samples/azure-sdk-for-python-storage-blob-upload-download -https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python-legacy -https://stackoverflow.com/questions/53217767/py4j-protocol-py4jerror-org-apache-spark-api-python-pythonutils-getencryptionen -https://docs.microsoft.com/fr-fr/azure/azure-resource-manager/templates/template-tutorial-use-key-vault -https://build5nines.com/azure-cli-2-0-generate-sas-token-for-blob-in-azure-storage/ -https://docs.microsoft.com/fr-fr/azure/databricks/scenarios/store-secrets-azure-key-vault - diff --git a/create-cluster.json b/create-cluster.json index d0f201e..bc5724d 100644 --- a/create-cluster.json +++ b/create-cluster.json @@ -1,24 +1,24 @@ { - "num_workers": null, - "autoscale": { - "min_workers": 2, - "max_workers": 8 - }, - "cluster_name": "dbx-cluster", - "spark_version": "8.2.x-scala2.12", - "spark_conf": {}, - "azure_attributes": { - "first_on_demand": 1, - "availability": "ON_DEMAND_AZURE", - "spot_bid_max_price": -1 - }, - "node_type_id": "Standard_DS3_v2", - "ssh_public_keys": [], - "custom_tags": {}, - "spark_env_vars": { - "PYSPARK_PYTHON": "/databricks/python3/bin/python3" - }, - "autotermination_minutes": 120, - "cluster_source": "UI", - "init_scripts": [] + "num_workers": null, + "autoscale": { + "min_workers": 2, + "max_workers": 8 + }, + "cluster_name": "dbx-cluster", + "spark_version": "8.2.x-scala2.12", + "spark_conf": {}, + "azure_attributes": { + "first_on_demand": 1, + "availability": "ON_DEMAND_AZURE", + "spot_bid_max_price": -1 + }, + "node_type_id": "Standard_DS3_v2", + "ssh_public_keys": [], + "custom_tags": {}, + "spark_env_vars": { + "PYSPARK_PYTHON": "/databricks/python3/bin/python3" + }, + "autotermination_minutes": 30, + "cluster_source": "UI", + "init_scripts": [] } diff --git a/demo.py b/demo.py new file mode 100644 index 0000000..933bace --- /dev/null +++ b/demo.py @@ -0,0 +1,92 @@ +# Databricks notebook source +dbutils.secrets.listScopes() + +# COMMAND ---------- + +dbutils.secrets.list("demo") + +# COMMAND ---------- + +print(dbutils.secrets.get(scope="demo", key="storagerw")) + +# COMMAND ---------- + +# Unmount directory if previously mounted. +MOUNTPOINT = "/mnt/commonfiles" +if MOUNTPOINT in [mnt.mountPoint for mnt in dbutils.fs.mounts()]: + dbutils.fs.unmount(MOUNTPOINT) + +# Add the Storage Account, Container, and reference the secret to pass the SAS Token +STORAGE_ACCOUNT = dbutils.secrets.get(scope="demo", key="storageaccount") +CONTAINER = dbutils.secrets.get(scope="demo", key="container") +SASTOKEN = dbutils.secrets.get(scope="demo", key="storagerw") +SOURCE = "wasbs://{container}@{storage_acct}.blob.core.windows.net/".format(container=CONTAINER, storage_acct=STORAGE_ACCOUNT) +URI = "fs.azure.sas.{container}.{storage_acct}.blob.core.windows.net".format(container=CONTAINER, storage_acct=STORAGE_ACCOUNT) + +try: + dbutils.fs.mount( + source=SOURCE, + mount_point=MOUNTPOINT, + extra_configs={URI:SASTOKEN}) +except Exception as e: + if "Directory already mounted" in str(e): + pass # Ignore error if already mounted. + else: + raise e + +display(dbutils.fs.ls(MOUNTPOINT)) + +# COMMAND ---------- + +friendsDF = (spark.read + .option("header", True) + .option("inferSchema", True) + .csv(MOUNTPOINT + "/friends.csv")) + +display(friendsDF) +# Databricks notebook source +dbutils.secrets.listScopes() + +# COMMAND ---------- + +dbutils.secrets.list("demo") + +# COMMAND ---------- + +print(dbutils.secrets.get(scope="demo", key="storagerw")) + +# COMMAND ---------- + +# Unmount directory if previously mounted. +MOUNTPOINT = "/mnt/commonfiles" +if MOUNTPOINT in [mnt.mountPoint for mnt in dbutils.fs.mounts()]: + dbutils.fs.unmount(MOUNTPOINT) + +# Add the Storage Account, Container, and reference the secret to pass the SAS Token +STORAGE_ACCOUNT = dbutils.secrets.get(scope="demo", key="storageaccount") +CONTAINER = dbutils.secrets.get(scope="demo", key="container") +SASTOKEN = dbutils.secrets.get(scope="demo", key="storagerw") +SOURCE = "wasbs://{container}@{storage_acct}.blob.core.windows.net/".format(container=CONTAINER, storage_acct=STORAGE_ACCOUNT) +URI = "fs.azure.sas.{container}.{storage_acct}.blob.core.windows.net".format(container=CONTAINER, storage_acct=STORAGE_ACCOUNT) + +try: + dbutils.fs.mount( + source=SOURCE, + mount_point=MOUNTPOINT, + extra_configs={URI:SASTOKEN}) +except Exception as e: + if "Directory already mounted" in str(e): + pass # Ignore error if already mounted. + else: + raise e + +display(dbutils.fs.ls(MOUNTPOINT)) + +# COMMAND ---------- + +friendsDF = (spark.read + .option("header", True) + .option("inferSchema", True) + .csv(MOUNTPOINT + "/friends.csv")) + +display(friendsDF) diff --git a/friends.egg-info/PKG-INFO b/friends.egg-info/PKG-INFO new file mode 100644 index 0000000..9af6398 --- /dev/null +++ b/friends.egg-info/PKG-INFO @@ -0,0 +1,629 @@ +Metadata-Version: 2.1 +Name: friends +Version: 0.0.1 +Summary: A small example package +Home-page: https://github.com/bngom/azure-databricks-cicd +Author: barthelemy +Author-email: barthelemy@adaltas.com +License: UNKNOWN +Project-URL: Bug Tracker, https://github.com/bngom/azure-databricks-cicd/issues +Description: --- + permalink: cicd-for-databricks-with-azure-devops + description: >- + How to implement a cicd pipeline to deploy notebooks and libraries in Azure Databricks using Azure DevOps + author: barthelemy + image: feature/ + categories: + - devops + tags: + - git + - release + - gitops + + --- + + # Azure Databricks CI/CD pipeline using Azure DevOps + + Throughout the Development lifecycle of an application, [CI/CD](https://en.wikipedia.org/wiki/CI/CD) is a [DevOps](/en/tag/devops) process enforcing automation in building, testing and desploying applications. Development and Operation teams can leverage the advantages of CI/CD to deliver more frequently and reliably releases in a timly manner while ensuring quick iterations. + + CI/CD is becoming a necessary process for data engineering and data science teams to deliver valuable data projects and increase confidence in the quality of the outcomes. With [Azure Databricks](https://azure.microsoft.com/en-gb/services/databricks/) you can use solutions like Azure DevOps, Gitlabs, Github Actions or Jenkins to build a CI/CD pipeline to reliably build, test, and deploy your notebooks and libraries. + + In this article we will guide you step by step to create an effective CI/CD pipeline using Azure Devop to deploy a simple notebook and a library to Azure Databricks. We will show how to manage sensitive data during the process using [Azure Keyvault](https://azure.microsoft.com/en-us/services/key-vault/) and how to secure the communication between Azure Databricks and our object storage. + + ## Description of our pipeline + + Our stack: + + - Azure Databricks + - Azure Devops: to build, test and deploy our artifacts (notebook and library) to Azure Databricks + - Azure Data Lake Storage Service: To store our dataset that will be consumed by Azure Databricks + - Azure Key vault: to store sensitive data + + ## Prerequisites + + 1. An Azure Account and an active subscription. You can create a free account [here](). + 2. Azure Devops Organization that will hold a project for our repository and our pipeline assets + + Clone the repository + + ```sh + git clone https://github.com/bngom/azure-databricks-cicd.git && cd azure-databricks-cicd + ``` + + Create a python environment and install dependencies + + ```sh + python -m venv dbcicd + ``` + + Activate the virtual environment + + ```sh + source dbcicd/bin/activate + ``` + + Install requirements + + ```sh + pip install -r requirements.txt + ``` + + Run lint test + + ```sh + python -m pip install flake8 + flake8 ./test/ ./src/ + ``` + + Run unit test: + + ```sh + python -m test + ``` + + ## Setting up Azure CLI + + You can use [Azure Cloud Shell](https://azure.microsoft.com/en-us/features/cloud-shell/) from the directry you would like to deploy your resources. Or ou can install [install Azure CLI](https://docs.microsoft.com/fr-fr/cli/azure/install-azure-cli-linux?pivots=apt). + + ```sh + curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash + ``` + + Configure Azure CLI, your default browser will prompt for you to enter your credentials. This will connect you to your default tenant. + + ```sh + az login + ``` + + Get details about your Azure account. + + ```sh + az account show + ``` + + ![](./assets/account-info.png) + + You can now if you wish connect to another directory by specifying the `tenant_id`. + + ```sh + az login --tenant + ``` + + ### Prerequisites for Azure + + - Resource Group: In our selected directory let us create a resource Group + + ```sh + rg_name="databricks-rg" + location="francecentral" + az group create --name $rg_name --location $location + ``` + + - Test the validity of our Azure Resource Management (ARM) Template: You will find in the folder `./template` a model file `dbx.parameter.json` that describe resources we want to deploy on Azure and a parameter file `dbx.parameter.json`. + + > For our parameter file `dbx.parameter.json` we choose to not store in plain text sensitive information such as the tenant_id. We use a templating form like ``. This value will be overridden in Azure DevOps during the build pipeline. But to test the validity of our ARM templase, make sure to replace all values `<>` with good values. This include: ``, ``, ``, ``, ``. E.g for `` you can replace it by `dbxbnsa`... + > + > ![](./assets/param-template.PNG) + > + > to get your `objectId` run `az ad signed-in-user show | jq '.objectId'` + + With the command below, we test the deployment of our ARM template against the resource group using the `what-if` option. This will only validate your template. The deployment will be done during the build phase in azure devops. + + ```sh + cd template + az deployment group what-if --name TestDeployment --resource-group $rg_name --template-file dbx.template.json --parameters @dbx.parameters.json + ``` + + ## Setting up Azure DevOps + + In this section we will create a Azure DevOps Organization, create a project and upload our repository. We will also set up an *Azure Resource Manager connection* in Azure Devops to authorize communication between Azure DevOps and the Azure Resource Manager. + + ### Configure the DevOps environment + + 1. Go to [dev.azure.com](dev.azure.com) + 2. [Create an Azure DevOps Organization](https://docs.microsoft.com/en-us/azure/devops/organizations/accounts/create-organization?view=azure-devops) + + - Click on *New organization*, then click on *Continue* + + ![](./assets/neworg.png) + + - Set the name of your Azure DevOps organization, then click on *Continue*. + + ![](./assets/dbx-cicd-org.png) + + 3. [Create a Project](https://docs.microsoft.com/en-us/azure/devops/organizations/projects/create-project?view=azure-devops&tabs=preview-page) in your Azure DevOps Organisation + + - In your new organization, set your project name then click on *Create project* + + ![](./assets/new-project.png) + + 4. Import a repository + + - Click on your new project + + ![](./assets/new-repo.png) + + - Click on *Repo* + + ![](./assets/new-repo-1.PNG) + + - then on *Import repository*: Use the url `https://github.com/bngom/azure-databricks-cicd.git` + + ![](./assets/import-repo.png) + + 5. Set an Azure Resource Manager connection + + - Project Settings > Pipeline : Service connections > Create service connections + + ![](./assets/new-service-connections.png) + + - Select Azure Resource Manager + + ![](./assets/new-service-connections-1.png) + + - Select *Service principal (automatique)* + + ![](./assets/new-service-connections-2.png) + + - Select your subscription and the resource group `databricks-rg` created previously and save. + + ![](./assets/new-service-connections-3.png) + + ### Create a Build Pipeline + + We are all set to create a Build pipeline. This operation will generate artifacts to be consumed in our release pipeline. + + - Pipelines > Pipelines: CLick on *New pipeline* + + ![](./assets/new-pipeline.PNG) + + - Select *Azure GIt Repo* + + ![](./assets/new-pipeline-1.PNG) + + - Select your repository + + ![](./assets/new-pipeline-2.PNG) + + - Select *Existing Azure Pipelines YAML file* and select the `azure-pielines.yml` file from your repository. + + ![](./assets/new-pipeline-3.PNG) + + ![](./assets/new-pipeline-4.PNG) + + The `azure-pipelines.yml` file in your repository is automatically detected. The build pipeline is composed of steps that include tasks and scripts to be executed againts different python version. The pipeline will first install all requirements and run unit test. If succeeded it will publish an ARM template and a notebook artifacts, then build a python library and publish it. There is also a task involving test coverage evaluation. + + ![](./assets/new-pipeline-5.PNG) + + - Now you can run your build pipeline. Click on *Run* + + The build pipeline executed successfully and artifacts are generated and ready to be consumed in a release pipeline. + + ![](./assets/build-pipeline-success.png) + + ![](./assets/summary-build.png) + + Before Creating a release pipeline let us do some configuration in azure devops. For security purpose, we have dummy variales in our template parameter file `dbx.parameters.json`. We will use a *variable group* and use it to overwrite our dummy variables during release process. + + Create Variable Group: + + - Pipeline > Library > +Variable Group + + ![](./assets/new-variable-grp.png) + + - Create the following variables, and save + + ![](./assets/variable-grp.png) + + ### Create a Release Pipeline + + ***PHASE 1*** + + In this section, we will create a release to deploy resources (Databricks workspace, KeyVault, Storage Account and Container for blob storage) on Microsoft Azure. + + - Pipelines > Release: New release pipeline + + ![](./assets/new-release-0.png) + + - On the right blade, For *select a template* click on *Empty Job* + + ![](./assets/empty-job.png) + + - Update the Stage name to *Development* + + ![](./assets/stage-dev.png) + + - Click on *Add artifacts* and Select the source build pipeline `demo-cicd`. The click on *Add* + + ![](./assets/add-artifacts.png) + + - Click on *Variable* and link our Variable Group to stage *Development* + + ![](./assets/link-variable-grp.png) + + ![](./assets/link-variable-grp-2.png) + + - Now, click on *Tasks*: + + - Check the Agent job set up: make sure the Agent Specification is set to `ubuntu-20.04` + - Clic on `+` sign near Agent job + + ![](./assets/plus-task.png) + + - Add ARM Resource Deployment and configure it accordingly giving the template and parameter files. And overwriting the variables: + ![](./assets/arm-template-deployment-0.png) + - `Deployment scope`: Resource Group + - `Resource manager connection`: Select your Azure Resource manager connection and click on *Authorize* + - `Subscription`: Select your subscription + - `Action`: Create or update resource group + - `Resource group`: Select the resource group created previously or use the variable $(rg_name) + - `Location`: idem, or $(location) + - `Template`: Click on `more` and select `dbx.template.json` in the template artifacts + - `Template parameters`: select `dbx.parameters.json`in the template artifacts + - `Override template parameter`: Update tha values of the parameters with the variables we created in our variable group. + + ![](./assets/template-parameters.png) + + Or you can past the folowing lines of code + + ```sh + -objectId $(object_id) -keyvaultName $(keyvault) -location $(location) -storageAccountName $(sa_name) -containerName $(container) -workspaceName $(workspace) -workspaceLocation $(location) -tier "premium" -sku "Standard" -tenant $(tenant_id) -networkAcls {"defaultAction":"Allow","bypass":"AzureServices","virtualNetworkRules":[],"ipRules":[]} + ``` + + - Deployment mode: Incremental + - Now, save your configuration + + ![](./assets/arm-template-deployment.png) + + We are ready now to create a first release + + - Click on the button: *Create release* + + ![](./assets/create-release-btn.png) + + - Stages for a trigger change from automated to manual: select *Development* + + ![](./assets/new-release.png) + + - Click on *create* + - Go to Pipelines > Release + - Select our release pipeline + + ![](./assets/select-release.png) + + - Click on the newly created release and click on `Deploy` + + ![](./assets/deploy-release.png) + + Our pipeline executed successfully + + ![](./assets/release-pipeline-success.png) + + And our resources are deployed in Azure. Go to [Azure Portal](https://portal.azure.com/) and check the content of our resource group. You will find in the resource group a keyvault, a databricks workspace, a storage account and a container in the storage account. + + ![](./assets/check-az-resources.png) + + ***PHASE 2*** + + We will complete the deployment of our release pipeline but for security concerns, let us enforce the security between Databricks and other azure resources. + + **Generate Databricks Token** + + Log into the Databricks Workspace and under User settings (icon in the top right corner) and select “Generate New Token”. Choose a descriptive name and copy the token to a notebook or clipboard. The token is displayed just once. + + ![](./assets/dbx-token.png) + + Make sure that for Git integration, the git provider is set to `Azure DevOps Services` + + ![](./assets/git-integration.png) + + The following command show you how to copy the new token and save it into your keyvault. + > At the same time we will save the URI of our Databricks service, you can find this one in Azure portal. + > + > ![](./assets/dbx-service.png) + + ```sh + dbxtoken="your databricks token" + keyvault_name="your keyvault name" + dbxuri="https://adb-..azuredatabricks.net" + az keyvault secret set --vault-name $keyvault_name --name "dbxtoken" --value $dbxtoken + az keyvault secret set --vault-name $keyvault_name --name "dbxuri" --value $dbxuri + ``` + + **Key vault sas token** + + Generate a shared access signature token for the storage account. This will secure the communication between Azure Databricks and the object storage. + + ```sh + sa_name="your storage account" + end=`date -u -d "10080 minutes" '+%Y-%m-%dT%H:%MZ'` + az storage account generate-sas \ + --permissions lruwap \ + --account-name $sa_name \ + --services b \ + --resource-types sco \ + --expiry $end \ + -o json + ``` + + Your SAS Token is generated, copy it with the storage account and the container name the key vault. + + ```sh + sastoken="key vault sas token" + container="your container" + az keyvault secret set --vault-name $keyvault_name --name "storagerw" --value $sastoken + az keyvault secret set --vault-name $keyvault_name --name "storageaccount" --value $sa_name + az keyvault secret set --vault-name $keyvault_name --name "container" --value $container + ``` + + List your secrets + + ```sh + az keyvault secret list --vault-name $keyvault_name --output table + ``` + + **Configure Databricks CLI** + + ```sh + pip install databricks-cli + ``` + + Configure your cli to interact with databricks. You will need to enter the uri and the token generated + + ```sh + databricks configure --token + # https://adb-..azuredatabricks.net + # your-databricks-token + + ``` + + **Create a cluster in databricks** + + Once the connection is established, we can create a cluster in databricks + + ```sh + rm -f create-cluster.json + cat <> create-cluster.json + { + "num_workers": null, + "autoscale": { + "min_workers": 2, + "max_workers": 8 + }, + "cluster_name": "dbx-cluster", + "spark_version": "8.2.x-scala2.12", + "spark_conf": {}, + "azure_attributes": { + "first_on_demand": 1, + "availability": "ON_DEMAND_AZURE", + "spot_bid_max_price": -1 + }, + "node_type_id": "Standard_DS3_v2", + "ssh_public_keys": [], + "custom_tags": {}, + "spark_env_vars": { + "PYSPARK_PYTHON": "/databricks/python3/bin/python3" + }, + "autotermination_minutes": 30, + "cluster_source": "UI", + "init_scripts": [] + } + EOF + + databricks clusters create --json-file create-cluster.json + ``` + + **Create an Azure Key Vault-backed secret scope in Databricks** + + Get the resource ID and the DNS name from your key vault properties and create a secret scope in Azure Databricks. + + ```sh + keyvault_name="dbx-bn-keyvault" + vaultUri=$(az keyvault show --name $keyvault_name | jq '.properties.vaultUri') + vaultId=$(az keyvault show --name $keyvault_name | jq '.id') + databricks secrets create-scope --scope demo --scope-backend-type AZURE_KEYVAULT --resource-id $vaultId --dns-name $vaultUri + # List the scope(s) + databricks secrets list-scopes + ``` + + Oups! The above command raised and error `Scope with Azure KeyVault must have userAADToken defined!`. It seems to be an bug. But no worries, you can still create the scope from the databricks workspace using the following url `https://adb-..azuredatabricks.net/?o=#secrets/createScope`. + + ![](./assets/create-scope.png) + + And you can test if your secrets are accessible from your workspace. + + ![](./assets/secret-scope.png) + + Great! Let's test if we can from our workspace mount the container in our storage account and read a dataset we uploaded inside. + + ```sh + az storage blob upload \ + --name friends.csv \ + --account-name $sa_name \ + --container-name $container \ + --file ./data/friends.csv + ``` + + Upload the following notebook into your workspace and test if you can securely access the dataset uploaded in your storage account container. + + ```sh + cat <> demo.py + # Databricks notebook source + dbutils.secrets.listScopes() + + # COMMAND ---------- + + dbutils.secrets.list("demo") + + # COMMAND ---------- + + print(dbutils.secrets.get(scope="demo", key="storagerw")) + + # COMMAND ---------- + + # Unmount directory if previously mounted. + MOUNTPOINT = "/mnt/commonfiles" + if MOUNTPOINT in [mnt.mountPoint for mnt in dbutils.fs.mounts()]: + dbutils.fs.unmount(MOUNTPOINT) + + # Add the Storage Account, Container, and reference the secret to pass the SAS Token + STORAGE_ACCOUNT = dbutils.secrets.get(scope="demo", key="storageaccount") + CONTAINER = dbutils.secrets.get(scope="demo", key="container") + SASTOKEN = dbutils.secrets.get(scope="demo", key="storagerw") + SOURCE = "wasbs://{container}@{storage_acct}.blob.core.windows.net/".format(container=CONTAINER, storage_acct=STORAGE_ACCOUNT) + URI = "fs.azure.sas.{container}.{storage_acct}.blob.core.windows.net".format(container=CONTAINER, storage_acct=STORAGE_ACCOUNT) + + try: + dbutils.fs.mount( + source=SOURCE, + mount_point=MOUNTPOINT, + extra_configs={URI:SASTOKEN}) + except Exception as e: + if "Directory already mounted" in str(e): + pass # Ignore error if already mounted. + else: + raise e + + display(dbutils.fs.ls(MOUNTPOINT)) + + # COMMAND ---------- + + friendsDF = (spark.read + .option("header", True) + .option("inferSchema", True) + .csv(MOUNTPOINT + "/friends.csv")) + + display(friendsDF) + EOF + ``` + + We can successfully access data stored in the storage account's container. + + ![](./assets/test-sa.png). + + **Release Pipeline: continue** + + We can now finis the configuration of our release pipeline. Let's first create a new variable group and link it with our azure keyvault. + + - Go to Pipelines > Library + - Click on *Variables* and add a new variable group + - Toggle `Link secrets from an Azure vault as variables` + - Select your subscription + - Select the key vault name + - Add variables from the keyvault + + ![](./assets/dbx-variable-grp-secrets.png) + + Add a third variable group to save the name of our Notebook and the folder where we will deploy it. + + ![](./assets/link-variable-grp-notebook.png) + + Edit our release pipeline and link the variable groups we just created to `Release` scope + + - Go to Pipelines > Release: Select our pipeline and Edit the pipeline. + - Link the variable groups to Release scope + + ![](./assets/link-azure-secrets.png) + + ![](./assets/link-variable-grp-all.png) + + Now we are ready to update our tasks + + - Go to Tasks + - Add `UsePythonVersion` task. **Put it above the `AzureResourceManagerTemplateDeployment`** + + ![](./assets/use-python-task.png) + + - After `AzureResourceManagerTemplateDeployment` task add a Bash script + + ![](./assets/dbx-cli-task.png) + + - Add a bash task; Rename it to `Databricks configure` and add the following code in the Inline Script. + + ```sh + databricks configure --token <..azuredatabricks.net + $(dbxtoken) + EOF + ``` + + - Add a Bash task; Rename it to `Import notebook into databricks` and add the following code in the Inline Script. + + ```sh + databricks workspace mkdirs /$(folder) + databricks workspace import --language PYTHON --format SOURCE --overwrite _demo-cicd/notebook/$(notebook-name)-$(Build.SourceVersion).py /$(folder)/$(notebook-name)-$(Build.SourceVersion).py + ``` + + - Add a Bash task; Rename it to `Import library into databricks` and add the following code in the Inline Script. + + ```sh + # create a new directory + databricks fs mkdirs dbfs:/dbx-library + # Import the module + # databricks fs rm _demo-cicd//wheel/friends-0.0.1-py3-none-any.whl + databricks fs cp _demo-cicd/wheel/friends-0.0.1-py3-none-any.whl dbfs:/dbx-library/ + ``` + + - Add a Bash task; Rename it to `Install Library and attach it to the cluster` and add the following code in the Inline Script. + + ```sh + cluster_id=$(databricks clusters list --output JSON | jq '[ .clusters[] | { name: .cluster_name, id: .cluster_id, state: .state } ]' | jq '.[] | select(.name=="dbx-cluster")' | jq -r '.id') + # The above query have to be adapted if there is more than one cluster with the same name. They will be in different states. + + echo "Cluster id: $cluster_id" + + # Install library + databricks libraries install --cluster-id $cluster_id --whl dbfs:/dbx-library/friends-0.0.1-py3-none-any.whl + ``` + + Our tasks will end up to look like this: + + ![](./assets/final-tasks.png) + + - Save our tasks and Create a release pipeline. + - Make sure your Databricks cluster is in `RUNNING` state + - Then, Deploy the release + + ![](./assets/release-pipeline-success-2.png) + + Our Notebook is deployed. + + ![](./assets/notebook-deployed.png) + + Our library is imported and installed on our cluster. + + ![](./assets/library-installed.png) + + ![](./assets/pip-list.PNG) + + We are now ready to play in our workspace. + + ## Conclusion + + In this article, we deployed Databricks notebook and library using Azure DevOps to manage a Continuous Integration and Continous Delivery pipeline. We saw that Azure Databricks is tightly integrated with other Microsoft Azure resources and services such as Keyvault, Storage Account and Azure Active Directory. For more complex workload you could consider integrating ETL tools like Azure Data Factory and its built-in linked service. + +Platform: UNKNOWN +Classifier: Programming Language :: Python :: 3 +Classifier: License :: OSI Approved :: MIT License +Classifier: Operating System :: OS Independent +Description-Content-Type: text/markdown diff --git a/friends.egg-info/SOURCES.txt b/friends.egg-info/SOURCES.txt new file mode 100644 index 0000000..c071ecb --- /dev/null +++ b/friends.egg-info/SOURCES.txt @@ -0,0 +1,10 @@ +README.md +pyproject.toml +setup.py +friends/__init__.py +friends/friends.py +friends.egg-info/PKG-INFO +friends.egg-info/SOURCES.txt +friends.egg-info/dependency_links.txt +friends.egg-info/top_level.txt +test/test_friends.py \ No newline at end of file diff --git a/src/friend_pkg_barthelemy.egg-info/dependency_links.txt b/friends.egg-info/dependency_links.txt similarity index 100% rename from src/friend_pkg_barthelemy.egg-info/dependency_links.txt rename to friends.egg-info/dependency_links.txt diff --git a/src/friend_pkg_barthelemy.egg-info/top_level.txt b/friends.egg-info/top_level.txt similarity index 100% rename from src/friend_pkg_barthelemy.egg-info/top_level.txt rename to friends.egg-info/top_level.txt diff --git a/friends/__init__.py b/friends/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/friends/friends.py b/friends/friends.py new file mode 100644 index 0000000..ca87464 --- /dev/null +++ b/friends/friends.py @@ -0,0 +1,33 @@ +from pyspark.sql.dataframe import DataFrame +from pyspark.sql.session import SparkSession +from pyspark.sql.types import StructType, StringType, IntegerType, StructField + + +class Friends: + + def __init__(self, spark: SparkSession, file_path: str): + self.spark = spark + self.file_path = file_path + + def mount_dataset(self, path): + return 1 + + def load(self): + friendSchema = StructType([ + StructField('id', IntegerType()), + StructField('name', StringType()), + StructField('age', IntegerType()), + StructField('friends', StringType()) + ]) + return (self.spark.read + .format("csv") + .option("header", True) + .schema(friendSchema) + .load(self.file_path)) + + def save_as_parquet(self, df: DataFrame, file_name: str): + df.write.parquet(file_name) + + def create_table(self, df: DataFrame, table_name: str, file_name: str): + parquetFile = self.spark.read.parquet(file_name) + parquetFile.createOrReplaceTempView(table_name) diff --git a/notebook/friends-notebook.py b/notebook/friends-notebook.py index 7309bee..e9648dc 100644 --- a/notebook/friends-notebook.py +++ b/notebook/friends-notebook.py @@ -1,16 +1,23 @@ +# Databricks notebook source +# MAGIC %md +# MAGIC # Demo CICD with Databricks and Azure DevOps + # COMMAND ---------- -%md -# Demo CICD with Databricks and Azure DevOps + +# MAGIC %pip list # COMMAND ---------- + # Import our library -import friends as f +from friends import friends as f # COMMAND ---------- -%md -## Mount the Azure Storage Account Container + +# MAGIC %md +# MAGIC ## Mount the Azure Storage Account Container # COMMAND ---------- + # Mount Azure Blob # Unmount directory if previously mounted. MOUNTPOINT = "/mnt/adaltas" @@ -18,9 +25,9 @@ dbutils.fs.unmount(MOUNTPOINT) # Add the Storage Account, Container, and reference the secret to pass the SAS Token -STORAGE_ACCOUNT = dbutils.secrets.get(scope="demo-cicd", key="sabndatabricks") -CONTAINER = "sabnblob" -SASTOKEN = dbutils.secrets.get(scope="demo-cicd", key="storagerw") +STORAGE_ACCOUNT = dbutils.secrets.get(scope="demo", key="storageaccount") +CONTAINER = dbutils.secrets.get(scope="demo", key="container") +SASTOKEN = dbutils.secrets.get(scope="demo", key="storagerw") # Do not change these values SOURCE = "wasbs://{container}@{storage_acct}.blob.core.windows.net/".format(container=CONTAINER, storage_acct=STORAGE_ACCOUNT) @@ -37,12 +44,16 @@ else: raise e -display(dbutils.fs.ls(MOUNTPOINT)) +# display(dbutils.fs.ls()) + + +# COMMAND ---------- +display(dbutils.fs.ls("/mnt/adaltas/")) # COMMAND ---------- -f_obj = f.Friends(spark, "/mnt/adaltas") +f_obj = f.Friends(spark=spark, file_path="/mnt/adaltas") # COMMAND ---------- @@ -54,8 +65,19 @@ # COMMAND ---------- -f_obj.create_table(df, "friends") +file_name="/tmp/friends.parquet" +dbutils.fs.rm(file_name, True) +f_obj.save_as_parquet(df=df, file_name=file_name) # COMMAND ---------- -%sql -SELECT * FROM friends LIMIT 10 + +dbutils.fs.ls(file_name) + +# COMMAND ---------- + +f_obj.create_table(df=df, table_name="friends", file_name=file_name) + +# COMMAND ---------- + +# MAGIC %sql +# MAGIC SELECT * FROM friends LIMIT 10 diff --git a/release-pipeline.yml b/release-pipeline.yml new file mode 100644 index 0000000..c86335f --- /dev/null +++ b/release-pipeline.yml @@ -0,0 +1,43 @@ +steps: +- task: AzureResourceManagerTemplateDeployment@3 + displayName: 'ARM Template deployment: Resource Group scope' + inputs: + azureResourceManagerConnection: 'Developer plan (9ba4c535-94b4-4eff-98cd-02d64b80c335)' + subscriptionId: '9ba4c535-94b4-4eff-98cd-02d64b80c335' + resourceGroupName: '$(rg_name)' + location: '$(location)' + csmFile: '$(System.DefaultWorkingDirectory)/_demo-cicd/template/dbx.template.json' + csmParametersFile: '$(System.DefaultWorkingDirectory)/_demo-cicd/template/dbx.parameters.json' + overrideParameters: '-objectId $(object_id) -keyvaultName $(keyvault) -location $(location) -storageAccountName $(sa_name) -containerName $(container) -workspaceName $(workspace) -workspaceLocation $(location) -tier "premium" -sku "Standard" -tenant $(tenant_id) -networkAcls {"defaultAction":"Allow","bypass":"AzureServices","virtualNetworkRules":[],"ipRules":[]}' + +- bash: 'python -m pip install --upgrade pip databricks-cli' + displayName: 'Install requirments' + +- bash: | + databricks configure --token <..azuredatabricks.net + $(dbxtoken) + EOF + displayName: 'Databricks configure' + +- bash: | + databricks workspace mkdirs /$(folder) + databricks workspace import --language PYTHON --format SOURCE --overwrite _demo-cicd/notebook/$(notebook-name)-$(Build.SourceVersion).py /$(folder)/$(notebook-name)-$(Build.SourceVersion).py + displayName: 'Databricks import notebook into workspace' + +- bash: | + # create a new directory + databricks fs mkdirs dbfs:/dbx-library + # Import the module + databricks fs cp $(System.DefaultWorkingDirectory)/_dbx-cicd.git/wheel/friends-0.0.1-py2.py3-none-any.whl dbfs:/dbx-library/ --overwrite + displayName: 'Import Library into Databricks' + +- bash: | + # Install library + cluster_id=$(databricks clusters list --output JSON | jq '[ .clusters[] | { name: .cluster_name, id: .cluster_id, state: .state } ]' | jq '.[] | select(.name=="dbx-cluster")' | jq -r '.id') + + echo "Cluster id: $cluster_id" + + databricks libraries install --cluster-id $cluster_id --whl dbfs:/dbx-library/friends-0.0.1-py2.py3-none-any.whl + displayName: 'Install Library and attach it to the cluster' + diff --git a/requirement.txt b/requirements.txt similarity index 100% rename from requirement.txt rename to requirements.txt diff --git a/setup.py b/setup.py index 0dc9c9d..50d4cb2 100644 --- a/setup.py +++ b/setup.py @@ -20,7 +20,8 @@ "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", ], - package_dir={"": "src"}, - packages=setuptools.find_packages(where="src"), - python_requires=">=3.6", + packages=['.friends'], + #package_dir={"": "src"}, + #packages=setuptools.find_packages(where="src"), + #python_requires=">=3.6", ) \ No newline at end of file diff --git a/src/friend_pkg_barthelemy.egg-info/PKG-INFO b/src/friend_pkg_barthelemy.egg-info/PKG-INFO deleted file mode 100644 index 280101b..0000000 --- a/src/friend_pkg_barthelemy.egg-info/PKG-INFO +++ /dev/null @@ -1,54 +0,0 @@ -Metadata-Version: 2.1 -Name: friend-pkg-barthelemy -Version: 0.0.1 -Summary: A small example package -Home-page: https://github.com/bngom/azure-databricks-cicd -Author: barthelemy -Author-email: barthelemy@adaltas.com -License: UNKNOWN -Project-URL: Bug Tracker, https://github.com/bngom/azure-databricks-cicd/issues -Platform: UNKNOWN -Classifier: Programming Language :: Python :: 3 -Classifier: License :: OSI Approved :: MIT License -Classifier: Operating System :: OS Independent -Requires-Python: >=3.6 -Description-Content-Type: text/markdown -License-File: LICENSE - -# Azure Databricks CI/CD pipeline using Azure DevOps - -Throughout the Development lifecycle of an application, [CI/CD] is a [DevOps] process enforcing automation in building, testing and desploying applications. Development and Operation teams can leverage the advantages of CI/CD to deliver more frequently and reliably releases in a timly manner while ensuring quick iterations. - -CI/CD is becoming an increasingly necessary process for data engineering and data science teams to deliver valuable data project and increase confidence in the quality of the outcomes. With [Azure Databricks](https://azure.microsoft.com/en-gb/services/databricks/) you use solutions like Azure DevOps or Jenkins to build a CI/CD pipeline to reliably build, test, and deploy your notebooks and libraries. - -In this article we will walk you trhought a development process ... - -## Prerequisites - -1. An Azure Account. You can create a free account [here]() - -- Repos: Azure Devops Organization that will hold a project for our repository and our pipeline assets -- Azure Storage Account to store our dataset inside a blob container that will be used further -- SAS Token to authorize read and write to and from our blob container -- Azure Key Vault to store secrets(SAS Token, SA name) -- Azure Databrick token to allow CLI commands -- Setup secrets on Azure databricks to use them - -## Description of our ci/cd pipeline - -- Continuous Integration -- Continuous Delivery - -## Define the Build pipeline - -- **Set up a build agent** -- `azure-pipelines.yml` - -## Define the Release pipeline - -## Conclusion - -## Cheat Sheet - - - diff --git a/src/friend_pkg_barthelemy.egg-info/SOURCES.txt b/src/friend_pkg_barthelemy.egg-info/SOURCES.txt deleted file mode 100644 index 4b093d3..0000000 --- a/src/friend_pkg_barthelemy.egg-info/SOURCES.txt +++ /dev/null @@ -1,9 +0,0 @@ -LICENSE -README.md -pyproject.toml -setup.py -src/friend_pkg_barthelemy.egg-info/PKG-INFO -src/friend_pkg_barthelemy.egg-info/SOURCES.txt -src/friend_pkg_barthelemy.egg-info/dependency_links.txt -src/friend_pkg_barthelemy.egg-info/top_level.txt -test/test_friends.py \ No newline at end of file diff --git a/src/friends.egg-info/PKG-INFO b/src/friends.egg-info/PKG-INFO deleted file mode 100644 index 3d42f5e..0000000 --- a/src/friends.egg-info/PKG-INFO +++ /dev/null @@ -1,40 +0,0 @@ -Metadata-Version: 2.1 -Name: friends -Version: 0.0.1 -Summary: A small example package -Home-page: https://github.com/bngom/azure-databricks-cicd -Author: barthelemy -Author-email: barthelemy@adaltas.com -License: UNKNOWN -Project-URL: Bug Tracker, https://github.com/bngom/azure-databricks-cicd/issues -Platform: UNKNOWN -Classifier: Programming Language :: Python :: 3 -Classifier: License :: OSI Approved :: MIT License -Classifier: Operating System :: OS Independent -Requires-Python: >=3.6 -Description-Content-Type: text/markdown -License-File: LICENSE - -# Azure Databricks CI/CD pipeline using Azure DevOps - -Throughout the Development lifecycle of an application, [CI/CD] is a [DevOps] process enforcing automation in building, testing and desploying applications. Development and Operation teams can leverage the advantages of CI/CD to deliver more frequently and reliably releases in a timly manner while ensuring quick iterations. - -CI/CD is becoming an increasingly necessary process for data engineering and data science teams to deliver valuable data project and increase confidence in the quality of the outcomes. With [Azure Databricks](https://azure.microsoft.com/en-gb/services/databricks/) you use solutions like Azure DevOps or Jenkins to build a CI/CD pipeline to reliably build, test, and deploy your notebooks and libraries. - -## Prerequisites - -1. An Azure Account and an active subscription. You can create a free account [here](). -2. Azure Devops Organization that will hold a project for our repository and our pipeline assets - -## Description of our ci/cd pipeline - -## Define the Build pipeline - -## Define the Release pipeline - -## Check our results - -## Conclusion - -## Cheat Sheet - diff --git a/src/friends.egg-info/SOURCES.txt b/src/friends.egg-info/SOURCES.txt deleted file mode 100644 index 16a3599..0000000 --- a/src/friends.egg-info/SOURCES.txt +++ /dev/null @@ -1,9 +0,0 @@ -LICENSE -README.md -pyproject.toml -setup.py -src/friends.egg-info/PKG-INFO -src/friends.egg-info/SOURCES.txt -src/friends.egg-info/dependency_links.txt -src/friends.egg-info/top_level.txt -test/test_friends.py \ No newline at end of file diff --git a/src/friends.egg-info/dependency_links.txt b/src/friends.egg-info/dependency_links.txt deleted file mode 100644 index 8b13789..0000000 --- a/src/friends.egg-info/dependency_links.txt +++ /dev/null @@ -1 +0,0 @@ - diff --git a/src/friends.egg-info/top_level.txt b/src/friends.egg-info/top_level.txt deleted file mode 100644 index 8b13789..0000000 --- a/src/friends.egg-info/top_level.txt +++ /dev/null @@ -1 +0,0 @@ - diff --git a/template/dbx.parameters.json b/template/dbx.parameters.json index cfe1d3f..e8d497f 100644 --- a/template/dbx.parameters.json +++ b/template/dbx.parameters.json @@ -3,25 +3,29 @@ "contentVersion": "1.0.0.0", "parameters": { "objectId": { - "value": "object Id" + "value": "" }, "storageAccountName":{ - "value": "storage account" + "value": "" }, "containerName":{ - "value": "container" + "value": "" }, "workspaceName": { - "value": "workspace" + "value": "" }, "workspaceLocation": { - "value": "location" + "value": "francecentral" }, "keyvaultName": { - "value": "keyvault" + "value": "" }, "location": { - "value": "location" + "value": "francecentral" + }, + + "tenant": { + "value": "" }, "tier": { "value": "premium" @@ -29,9 +33,6 @@ "sku": { "value": "Standard" }, - "tenant": { - "value": "tenantId" - }, "networkAcls": { "value": { "defaultAction": "Allow", diff --git a/test/__main__.py b/test/__main__.py index 72ca339..a74b5a7 100644 --- a/test/__main__.py +++ b/test/__main__.py @@ -2,7 +2,7 @@ import sys import unittest -sys.path.append(os.path.abspath(os.path.join(os.path.dirname('__file__'), 'src'))) +sys.path.append(os.path.abspath(os.path.join(os.path.dirname('__file__'), 'friends'))) loader = unittest.TestLoader() testSuite = loader.discover('test') diff --git a/test/test_friends.py b/test/test_friends.py index 81a841f..b15b018 100644 --- a/test/test_friends.py +++ b/test/test_friends.py @@ -1,7 +1,7 @@ import unittest # import pyspark import logging -import friends as f +from friends import friends as f from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StringType, IntegerType, StructField