bash install.sh
python3.12 -m venv venv
- Install DuckDB
- Create Python virtual environment
- Install necessary python packages
- Run scripts
- Run DBT
We encourage everyone to work inside of the linux based Codespace and use the following commands. Each line needs to be entered separately
wget https://github.com/duckdb/duckdb/releases/download/v1.0.0/duckdb_cli-linux-amd64.zip
unzip duckdb_cli-linux-amd64.zip
sudo mv duckdb /bin
winget install DuckDB.cli
brew install duckdb
duckdb
./duckdb
python -m venv venv
curl -sSL https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python3.12 get-pip.py
source venv/bin/activate
deactivate
But turn it back on because you need to be in the Python ven for the rest of the setup instructions inclduing running dbt
source venv/bin/activate
python -m pip install -r requirements.txt
python scaffold_folders.py
We then need to get the Michigan Campaign Finance Reports data with the command below. We'll use Georgia voterfile data in the course but this is just for getting set up:
python get_mi_cfr_data.py
cd mi_cfr_duckdb_dbt
dbt run
This is additional info not required for setup. Please feel free to explore the dbt file structure and additional features and docs below more. But don't worry about it too much. We'll go much more in dept on this in the course with Georgia voterfile data:
We want to run these in the dbt folder so change directories if you're not already in that directory:
cd mi_cfr_duckdb_dbt
> models
> mart
> ml
> staging
NEEDED_FOLDERS = [
'./data/mart',
'./data/ml_models',
'./data/ml',
'./data/raw',
'./data/staging',
...
]
This is the folder where we interact with the raw sources. Generally this is where we rename columns to nicer names and do the base level of data munging.
This is a special folder where I give a proof of concept of running python ML models directlyl in dbt.
One can write sql files in these folders and dbt will run the sql. Dbt also uses jinja templating to enhance sql functionality. By default it stores these using the filename (among other configurations) as a view. This project is running duckdb in memory so we don't create any data objects without specifying an external materialization
{{ config(materialized='external', location='../data/mart/state_party_contributions.parquet', format='parquet') }}
By default the materialized config is 'view' Because we are using duckdb as the engine these files will materialize as external parquet files
FROM {{ ref('stg_expenditures') }}
This uses the templating and sql setup and you can query it just like a table or a view.
read_parquet('../data/raw/*_mi_cfr_receipts.parquet', union_by_name=True)
This is very different than traditional SQL! Duckdb can work with CSV, parquet, and even JSON files. One can even setup duckdb to query remote files or other databases with some additional setup.
classify_expenditure(list_concat_expenditure, 0.75)
This requires additional setup in the plugins folder, but very useful if you have a python library that you want directly implemented in your models.
Very similar to models, but they fail if they return a result
dbt test
This is a folder that does not come standard with dbt, but is additional functionality that allows us to run arbitrary python functions in our dbt models.
New functions are to be implemented in the functions folder and the create_function_plugin.py needs to be updated:
class Plugin(BasePlugin):
def configure_connection(self, conn: DuckDBPyConnection):
conn.create_function("return_hello_array", return_hello_array)
conn.create_function("get_substring", get_substring)
conn.create_function("return_numpy_one", return_numpy_one)
conn.create_function("pydantic_example", pydantic_example)
conn.create_function("classify_expenditure", classify_expenditure)
....
conn.create_function("function", function)
within the configure_connection function.
Find yourself writing a lot of boilerplate or want to functionalize your SQL? Macros are that idea in dbt:
{% macro choose_nonnull_value(first_column_name, second_column_name, out_column_name) %}
CASE WHEN {{ first_column_name }} IS NULL THEN {{ second_column_name }} ELSE {{ first_column_name }} END AS {{ out_column_name }}
{% endmacro %}
If you'd like to learn more I recommend the dbt guide here
dbt docs generate
dbt docs serve --port <port number>
This gives a workflow for the state caucus committee funds raised using the external materializations from our dbt runs.
This generates a pickle file to allow us to do the ML POC. This requires you to install the ml_requirements in your python environment and is much more computationally expensive.
This gives an alternative way to implement python processes in the middle of your modeling by using the dbt cli to automate runs of subsets of models.
This gives more explanation of calling python functions within duckdb. Super useful to hack python or API calls into the middle of your dbt run.