Skip to content

Commit 7c9ba16

Browse files
authored
New pipelines init command (#3114)
## Changes `pipelines init` new command, with new `cli-pipelines` template based off `lakeflow-pipelines` template While most files are kept the same, differences between the model `lakeflow-pipelines` and `cli-pipelines` templates are viewable with `diff -r libs/template/templates/cli-pipelines libs/template/templates/lakeflow-pipelines`. Differences - removing {{.project_name}}.job.yml.tmpl file (only for jobs, which are removing, not pipelines) - removing bundle mentions - changed directories `{{.project_name}}/resources/{{.project_name}}_pipeline/` to be just `{{.project_name}}/{{.project_name}}_pipeline/` - renamed project folder from `my_lakeflow_project` to be `my_project` by default - removed previous formatting that caused issues with the linter Added `root` package to `pipelines`, copying over functionality from `cmd/root` to also apply to new `pipelines` CLI Modified linter to exclude template files ## Why `pipelines` CLI needs a way to initialize specifically templates for Databricks pipelines ## Tests Added acceptance testing for `pipelines init` 3 folders, each with an output folder showing a sample output template from the script with various input.json - python - sql - error-cases
1 parent ce1c8ab commit 7c9ba16

79 files changed

Lines changed: 2016 additions & 18 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
2+
=== Install pipelines CLI
3+
>>> errcode [CLI] install-pipelines-cli -d ./subdir
4+
pipelines successfully installed in directory "./subdir"
5+
6+
=== Test with missing config file
7+
>>> errcode ./subdir/pipelines init --output-dir output
8+
9+
Welcome to the template for pipelines!
10+
11+
12+
Your new project has been created in the 'my_project' directory!
13+
14+
Refer to the README.md file for "getting started" instructions!
15+
16+
=== Test with invalid project name (contains uppercase letters)
17+
>>> errcode ./subdir/pipelines init --config-file ./invalid_input.json --output-dir invalid-output
18+
Error: failed to load config from file ./invalid_input.json: invalid value for project_name: "InvalidProjectName". Name must consist of lower case letters, numbers, and underscores.
19+
20+
Exit code: 1
21+
22+
=== Test with non-existent config file
23+
>>> errcode ./subdir/pipelines init --config-file ./nonexistent.json --output-dir invalid-output-2
24+
Error: failed to load config from file ./nonexistent.json: open ./nonexistent.json: no such file or directory
25+
26+
Exit code: 1
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Typings for Pylance in Visual Studio Code
2+
# see https://github.com/microsoft/pyright/blob/main/docs/builtins.md
3+
from databricks.sdk.runtime import *
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
{
2+
"recommendations": [
3+
"databricks.databricks",
4+
"ms-python.vscode-pylance",
5+
"redhat.vscode-yaml"
6+
]
7+
}
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
{
2+
"python.analysis.stubPath": ".vscode",
3+
"databricks.python.envFile": "${workspaceFolder}/.env",
4+
"jupyter.interactiveWindow.cellMarker.codeRegex": "^# COMMAND ----------|^# Databricks notebook source|^(#\\s*%%|#\\s*\\<codecell\\>|#\\s*In\\[\\d*?\\]|#\\s*In\\[ \\])",
5+
"jupyter.interactiveWindow.cellMarker.default": "# COMMAND ----------",
6+
"python.testing.pytestArgs": [
7+
"."
8+
],
9+
"python.testing.unittestEnabled": false,
10+
"python.testing.pytestEnabled": true,
11+
"python.analysis.extraPaths": ["resources/my_project_pipeline"],
12+
"files.exclude": {
13+
"**/*.egg-info": true,
14+
"**/__pycache__": true,
15+
".pytest_cache": true,
16+
},
17+
"[python]": {
18+
"editor.defaultFormatter": "ms-python.black-formatter",
19+
"editor.formatOnSave": true,
20+
},
21+
}
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# my_project
2+
3+
The 'my_project' project was generated by using the CLI Pipelines template.
4+
5+
## Setup
6+
7+
1. Install the Databricks CLI from https://docs.databricks.com/dev-tools/cli/databricks-cli.html
8+
9+
2. Install the Pipelines CLI:
10+
```
11+
$ databricks install-pipelines-cli
12+
```
13+
14+
3. Authenticate to your Databricks workspace, if you have not done so already:
15+
```
16+
$ databricks auth login
17+
```
18+
19+
4. Optionally, install developer tools such as the Databricks extension for Visual Studio Code from
20+
https://docs.databricks.com/dev-tools/vscode-ext.html. Or the PyCharm plugin from
21+
https://www.databricks.com/blog/announcing-pycharm-integration-databricks.
22+
23+
24+
## Deploying pipelines
25+
26+
1. To deploy a development copy of this project, type:
27+
```
28+
$ pipelines deploy --target dev
29+
```
30+
(Note that "dev" is the default target, so the `--target` parameter
31+
is optional here.)
32+
33+
2. Similarly, to deploy a production copy, type:
34+
```
35+
$ pipelines deploy --target prod
36+
```
37+
38+
3. To run a pipeline, use the "run" command:
39+
```
40+
$ pipelines run
41+
```
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# This is a Databricks pipelines definition for my_project.
2+
# See https://docs.databricks.com/dev-tools/bundles/index.html for documentation.
3+
bundle:
4+
name: my_project
5+
uuid: [UUID]
6+
7+
include:
8+
- resources/*.yml
9+
- resources/*/*.yml
10+
- ./*.yml
11+
12+
# Variable declarations. These variables are assigned in the dev/prod targets below.
13+
variables:
14+
catalog:
15+
description: The catalog to use
16+
schema:
17+
description: The schema to use
18+
notifications:
19+
description: The email addresses to use for failure notifications
20+
21+
targets:
22+
dev:
23+
# The default target uses 'mode: development' to create a development copy.
24+
# - Deployed pipelines get prefixed with '[dev my_user_name]'
25+
mode: development
26+
default: true
27+
workspace:
28+
host: [DATABRICKS_URL]
29+
variables:
30+
catalog: hive_metastore
31+
schema: ${workspace.current_user.short_name}
32+
notifications: []
33+
34+
prod:
35+
mode: production
36+
workspace:
37+
host: [DATABRICKS_URL]
38+
# We explicitly deploy to /Workspace/Users/[USERNAME] to make sure we only have a single copy.
39+
root_path: /Workspace/Users/[USERNAME]/.bundle/${bundle.name}/${bundle.target}
40+
permissions:
41+
- user_name: [USERNAME]
42+
level: CAN_MANAGE
43+
variables:
44+
catalog: hive_metastore
45+
schema: default
46+
notifications: [[USERNAME]]
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# my_project_pipeline
2+
3+
This folder defines all source code for the my_project_pipeline pipeline:
4+
5+
- `explorations`: Ad-hoc notebooks used to explore the data processed by this pipeline.
6+
- `transformations`: All dataset definitions and transformations.
7+
- `utilities` (optional): Utility functions and Python modules used in this pipeline.
8+
- `data_sources` (optional): View definitions describing the source data for this pipeline.
9+
10+
## Getting Started
11+
12+
To get started, go to the `transformations` folder -- most of the relevant source code lives there:
13+
14+
* By convention, every dataset under `transformations` is in a separate file.
15+
* Take a look at the sample under "sample_trips_my_project.py" to get familiar with the syntax.
16+
Read more about the syntax at https://docs.databricks.com/dlt/python-ref.html.
17+
* Use `Run file` to run and preview a single transformation.
18+
* Use `Run pipeline` to run _all_ transformations in the entire pipeline.
19+
* Use `+ Add` in the file browser to add a new data set definition.
20+
* Use `Schedule` to run the pipeline on a schedule!
21+
22+
For more tutorials and reference material, see https://docs.databricks.com/dlt.
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {
6+
"application/vnd.databricks.v1+cell": {
7+
"cellMetadata": {},
8+
"inputWidgets": {},
9+
"nuid": "[UUID]",
10+
"showTitle": false,
11+
"tableResultSettingsMap": {},
12+
"title": ""
13+
}
14+
},
15+
"source": [
16+
"### Example Exploratory Notebook\n",
17+
"\n",
18+
"Use this notebook to explore the data generated by the pipeline in your preferred programming language.\n",
19+
"\n",
20+
"**Note**: This notebook is not executed as part of the pipeline."
21+
]
22+
},
23+
{
24+
"cell_type": "code",
25+
"execution_count": 0,
26+
"metadata": {
27+
"application/vnd.databricks.v1+cell": {
28+
"cellMetadata": {},
29+
"inputWidgets": {},
30+
"nuid": "[UUID]",
31+
"showTitle": false,
32+
"tableResultSettingsMap": {},
33+
"title": ""
34+
}
35+
},
36+
"outputs": [],
37+
"source": [
38+
"# !!! Before performing any data analysis, make sure to run the pipeline to materialize the sample datasets. The tables referenced in this notebook depend on that step.\n",
39+
"\n",
40+
"display(spark.sql(\"SELECT * FROM hive_metastore.[USERNAME].my_project\"))"
41+
]
42+
}
43+
],
44+
"metadata": {
45+
"application/vnd.databricks.v1+notebook": {
46+
"computePreferences": null,
47+
"dashboards": [],
48+
"environmentMetadata": null,
49+
"inputWidgetPreferences": null,
50+
"language": "python",
51+
"notebookMetadata": {
52+
"pythonIndentUnit": 2
53+
},
54+
"notebookName": "sample_exploration",
55+
"widgets": {}
56+
},
57+
"language_info": {
58+
"name": "python"
59+
}
60+
},
61+
"nbformat": 4,
62+
"nbformat_minor": 0
63+
}
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
resources:
2+
pipelines:
3+
my_project_pipeline:
4+
name: my_project_pipeline
5+
serverless: true
6+
channel: "PREVIEW"
7+
catalog: ${var.catalog}
8+
schema: ${var.schema}
9+
root_path: "."
10+
libraries:
11+
- glob:
12+
include: transformations/**
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
import dlt
2+
from pyspark.sql.functions import col
3+
from utilities import utils
4+
5+
6+
# This file defines a sample transformation.
7+
# Edit the sample below or add new transformations
8+
# using "+ Add" in the file browser.
9+
10+
11+
@dlt.table
12+
def sample_trips_my_project():
13+
return spark.read.table("samples.nyctaxi.trips").withColumn("trip_distance_km", utils.distance_km(col("trip_distance")))

0 commit comments

Comments
 (0)