Skip to content

HannaStselmashok/f1_dbt

Repository files navigation

Dashboards for Ferrari team: 2023 season results

Summary

  • Create virtual env
  • Read data from PostgreSQL to Pandas dataframe
  • Transfer data from PostgreSQL to Snowflake
  • Connect dbt and Snowflake
  • Build dbt data pipelines
  • Establish data quality tests

Create virtual env

pip install virtualenv
virtualenv f1dbt
f1dbt\Scripts\activate

Read data from PostgreSQL to Pandas dataframe

I have 2 tables in PostgreSQL. To read it in Pandas dataframe - execute step 1 from migration.ipynb

raceresultsPandas

Transfer data from PostgreSQL to Snowflake

First of all I created database, scema and dbt user in Snowflake

-- Use an admin role
USE ROLE ACCOUNTADMIN;

-- Create the `transform` role
CREATE ROLE IF NOT EXISTS transform;
GRANT ROLE TRANSFORM TO ROLE ACCOUNTADMIN;

-- Create the default warehouse if necessary
CREATE WAREHOUSE IF NOT EXISTS COMPUTE_WH;
GRANT OPERATE ON WAREHOUSE COMPUTE_WH TO ROLE TRANSFORM;

-- Create the `dbt` user and assign to role
CREATE USER IF NOT EXISTS dbt
  PASSWORD='f1dbt'
  LOGIN_NAME='hannaf1'
  MUST_CHANGE_PASSWORD=FALSE
  DEFAULT_WAREHOUSE='COMPUTE_WH'
  DEFAULT_ROLE='transform'
  DEFAULT_NAMESPACE='F1.RAW'
  COMMENT='DBT user used for data transformation';
GRANT ROLE transform to USER dbt;

-- Create our database and schemas
CREATE DATABASE IF NOT EXISTS F1;
CREATE SCHEMA IF NOT EXISTS F1.RAW;

-- Set up permissions to role `transform`
GRANT ALL ON WAREHOUSE COMPUTE_WH TO ROLE transform;
GRANT ALL ON DATABASE F1 to ROLE transform;
GRANT ALL ON ALL SCHEMAS IN DATABASE F1 to ROLE transform;
GRANT ALL ON FUTURE SCHEMAS IN DATABASE F1 to ROLE transform;
GRANT ALL ON ALL TABLES IN SCHEMA F1.RAW to ROLE transform;
GRANT ALL ON FUTURE TABLES IN SCHEMA F1.RAW to ROLE transform;

To Migrate datasets from PostgreSQL to Snowflake - complete step 2 migration.ipynb

Connect dbt and Snowflake

pip install dbt-snowflake==1.7.2
dbt init dbt_f1
  • account: lkhpmcc-cn69015
  • password: f1dbt
  • role: transform
  • warehouse: COMPUTE_WH
  • database: F1
  • schema: RAW
  • threads: 1
  • user: hannaf1 (in profiles.yml)
cd dbt_f1
dbt debug

Build dbt data pipelines

Created separate folders for models to represent logical levels

mkdir models\staging
mkdir models\transform
mkdir models\mart

Modified dbt_project.yml file to reflect model structur. Staging layer is having view materialization, while transform and mart - table

models:
  dbt_hol:
      # Applies to all files under models/example/
      staging:
          schema: staging
          materialized: view
      transform:
          schema: transform
          materialized: table
      mart:
          schema: mart
          materialized: mart

- Staging layer. In this layer I slightly modified rresults and qresults by making column names more meaningful. Executed dbt run.

raw_qresults

raw_rresults

- Transformed layer. In this step I changed data types for columns, replaced letter designations for drivers who didn't start/finish the race for 20 in points, null in time columns. Queries for cleansed_rresults and cleansed_qresults. Executed dbt run.

- Seeds. Uploaded reces orders to seeds folder. Seeds are usually small datasets in csv format (ex. catalogs). Executed dbt seed.

- Mart layer. This layer will be accessible for BI tools. I created two files mart_rresults and mart_qresults, where join race order with transformed table.

Running test

- Generic tests. In schema.yml added generic test 'not null' for track country and deriver number columns (cleansed_qresults and cleansed_rresults).

- Singular tests. In tests folder created files to check position columns for values (from 1 to 20). Executed dbt test.

Analyses

In analyses folder created sql files to answer the questions:

  • Average position of every racer during the championship; total points, the share of points contributed by the team's lead driver - average_in_championship

average_points

  • Driver's championship points for each race; a driver's position in the championship for each race (Ferrari team) - driver_points

driver_points

Vizualization

Connect to Power BI

  • server: lkhpmcc-cn69015.snowflakecomputing.com
  • warehouse: COMPUTE_WH
  • user name: hannaf1
  • password: f1dbt

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published