Skip to content

TOmelchenko/LifeExpectancyProject

Repository files navigation

Project overview

Dataset

For this learning project, the Health dataset from World Bank Open Data has been chosen. It contains plenty of metrics from which the following have been selected to build the final dashboard:

  • Population, total
  • Population, male
  • Population, female
  • Life expectancy at birth, male (years)
  • Life expectancy at birth, total (years)
  • Life expectancy at birth, female (years)

The dataset includes data broken down by countries and years effective from the 1960 year. For the final dashboard, the data for the last 10 years (2013 - 2022) has been taken. The data file can be loaded directly from the site. For this project downloaded files have been saved in Github, from where they are being processed.

Technologies

  • Cloud: GCP
  • Infrastructure as code (IaC): Terraform - for setting infrastructure.
  • Workflow orchestration: Mage
    • for the pipeline for processing selected dataset and putting it to a datalake (GCS);
    • for the pipeline for moving the data from the lake to a data warehouse (BigQuery)
  • Data Warehouse: BigQuery - for data warehouse.
  • Batch processing: dbt - for transforming the data in the data warehouse for preparing it for visualization.
  • Visualization: Looker - for building a dashboard.

Dashboard

The dashboard shows the population and life expectancy by country and year. It can be accessible by the link

Set up the project

Preconditions

To reproduce the project the user should have:

  1. The account in Google Cloud. (The trial account for 90 days can be created)
  2. Google Cloud SDK (The installation can be found here)
  3. The account in dbt. (The free account can be opened using this link)
  4. Terraform installed. (The installation instructions can be found here)
  5. Docker installed.

Run the project

  1. Clone the project repository: https://github.com/TOmelchenko/LifeExpectancyProject.git or git@github.com:TOmelchenko/LifeExpectancyProject.git and go to the project folder:
cd LifeExpectancyProject
  1. In Google Cloud create a project: DataEngProject.
  2. Create a service account: DataEngProject_DataUser
  3. Grant this service account such privileges: BigQuery Admin, Storage Admin, Storage Object Admin, Viewer
  4. Create JSON key (right-click on service account actions) and save it on your computer.
  5. Copy data from your JSON key file to the file ~/LifeExpectancyProject/mage/dataengproject.json
  6. Point this key to the environmental variable GOOGLE_APPLICATION_CREDENTIALS:
export GOOGLE_APPLICATION_CREDENTIALS='<full path>/LifeExpectancyProject/mage/dataengproject.json'

  1. Authorize login - launch the command
gcloud auth application-default login

gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS

  1. Create GCS bucket and BigQuery dataset using terraform:
cd terraform
terraform init
terraform plan ## Enter your project ID
terraform apply ## Enter your project ID
  1. Load data using Mage:
  • Run these commands:
cd mage
cp dev.env .env && rm dev.env
  • Check the PROJECT_NAME=life_expectancy_project in the .env file

  • Run docker image and pull the latest version

docker compose up -d
docker pull mageai/mageai:latest
  • The mage can be accessible via the link: http://localhost:6789/ You should see the life_expectancy_project with 2 pipelines:
  • You have to set variables for each pipeline, which are being used in data exporter/data loader scripts. This can be done in 2 ways:
  1. First:
  • open the file ~/LifeExpectancyProject/mage/life_expectancy_project/pipelines/life_expectancy_to_gcs_load/metadata.yaml and in the variables section in the very end set:
  BUCKET_NAME:  <your bucket name> ##life_expectancy_dataengproject-417719 - my bucket name
  PROJECT_ID: <your project id>##dataengproject-417719 - my project id

  • open the file ~/LifeExpectancyProject/mage/life_expectancy_project/pipelines/life_expectancy_to_bigquery_load/metadata.yaml and in the variables section in the very end set:
  BUCKET_NAME:  <your bucket name> ##life_expectancy_dataengproject-417719 - my bucket name
  DB_NAME: <db name in BigQuery>##dataengproject-417719 - my db name that equal project id
  1. Second:
  • open mage interface and in the variable section for each pipeline update the variables like mentioned above.

Run the the pipelines manually. Go to the Trigger page:

  • First - click on life_expectancy_to_gcs_load and press Run once button.
  • Second - click on life_expectancy_to_bigquery_load and press Run once button. After running these two pipelines you should have 3 parquet files in the GCP bucket:
dim_country-0.parquet
dim_indicator-0.parquet
life_expectancy_raw_data-0.parquet

and 3 tables in BigQuery dataset:

dim_country_stg
dim_indicator_stg
life_expectancy_stg

  1. Prepare the final dataset with dbt:
  • Set up BigQuery connection - load the JSON key, created in step 5.

  • Update on Credentials the Dataset field to life_expectancy

  • Update in the file schema.yaml the old database name to the name you have in BigQuery

  • Run the model manually by dbt run command.

  • After dbt run is completed the datamart_life_expectancy table should be created in BigQuery:

  1. The final dashboard can be accessible by the link

TODO

  1. Tests
  2. Load dataset file directly from the site not copying it to GitHub
  3. Add clustering to the datamart table.

Remark: To avoid additional charges from Google Cloud after the trial period it's better to remove the created GCS bucket and BigQuery dataset. It can be done manually or by terraform command:

terraform destroy

About

Learning project for pipeline creation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published