Project overview

Dataset

For this learning project, the Health dataset from World Bank Open Data has been chosen. It contains plenty of metrics from which the following have been selected to build the final dashboard:

Population, total
Population, male
Population, female
Life expectancy at birth, male (years)
Life expectancy at birth, total (years)
Life expectancy at birth, female (years)

The dataset includes data broken down by countries and years effective from the 1960 year. For the final dashboard, the data for the last 10 years (2013 - 2022) has been taken. The data file can be loaded directly from the site. For this project downloaded files have been saved in Github, from where they are being processed.

Technologies

Cloud: GCP
Infrastructure as code (IaC): Terraform - for setting infrastructure.
Workflow orchestration: Mage
- for the pipeline for processing selected dataset and putting it to a datalake (GCS);
- for the pipeline for moving the data from the lake to a data warehouse (BigQuery)
Data Warehouse: BigQuery - for data warehouse.
Batch processing: dbt - for transforming the data in the data warehouse for preparing it for visualization.
Visualization: Looker - for building a dashboard.

Dashboard

The dashboard shows the population and life expectancy by country and year. It can be accessible by the link

Set up the project

Preconditions

To reproduce the project the user should have:

The account in Google Cloud. (The trial account for 90 days can be created)
Google Cloud SDK (The installation can be found here)
The account in dbt. (The free account can be opened using this link)
Terraform installed. (The installation instructions can be found here)
Docker installed.

Run the project

Clone the project repository: https://github.com/TOmelchenko/LifeExpectancyProject.git or git@github.com:TOmelchenko/LifeExpectancyProject.git and go to the project folder:

cd LifeExpectancyProject

In Google Cloud create a project: DataEngProject.
Create a service account: DataEngProject_DataUser
Grant this service account such privileges: BigQuery Admin, Storage Admin, Storage Object Admin, Viewer
Create JSON key (right-click on service account actions) and save it on your computer.
Copy data from your JSON key file to the file ~/LifeExpectancyProject/mage/dataengproject.json
Point this key to the environmental variable GOOGLE_APPLICATION_CREDENTIALS:

export GOOGLE_APPLICATION_CREDENTIALS='<full path>/LifeExpectancyProject/mage/dataengproject.json'

Authorize login - launch the command

gcloud auth application-default login

gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS

Create GCS bucket and BigQuery dataset using terraform:

cd terraform
terraform init
terraform plan ## Enter your project ID
terraform apply ## Enter your project ID

Load data using Mage:

Run these commands:

cd mage
cp dev.env .env && rm dev.env

Check the PROJECT_NAME=life_expectancy_project in the .env file
Run docker image and pull the latest version

docker compose up -d
docker pull mageai/mageai:latest

The mage can be accessible via the link: http://localhost:6789/ You should see the life_expectancy_project with 2 pipelines:
You have to set variables for each pipeline, which are being used in data exporter/data loader scripts. This can be done in 2 ways:

First:

open the file ~/LifeExpectancyProject/mage/life_expectancy_project/pipelines/life_expectancy_to_gcs_load/metadata.yaml and in the variables section in the very end set:

  BUCKET_NAME:  <your bucket name> ##life_expectancy_dataengproject-417719 - my bucket name
  PROJECT_ID: <your project id>##dataengproject-417719 - my project id

open the file ~/LifeExpectancyProject/mage/life_expectancy_project/pipelines/life_expectancy_to_bigquery_load/metadata.yaml and in the variables section in the very end set:

  BUCKET_NAME:  <your bucket name> ##life_expectancy_dataengproject-417719 - my bucket name
  DB_NAME: <db name in BigQuery>##dataengproject-417719 - my db name that equal project id

Second:

open mage interface and in the variable section for each pipeline update the variables like mentioned above.

Run the the pipelines manually. Go to the Trigger page:

First - click on life_expectancy_to_gcs_load and press Run once button.
Second - click on life_expectancy_to_bigquery_load and press Run once button. After running these two pipelines you should have 3 parquet files in the GCP bucket:

dim_country-0.parquet
dim_indicator-0.parquet
life_expectancy_raw_data-0.parquet

and 3 tables in BigQuery dataset:

dim_country_stg
dim_indicator_stg
life_expectancy_stg

Prepare the final dataset with dbt:

Set up BigQuery connection - load the JSON key, created in step 5.
Update on Credentials the Dataset field to life_expectancy
Update in the file schema.yaml the old database name to the name you have in BigQuery
Run the model manually by dbt run command.
After dbt run is completed the datamart_life_expectancy table should be created in BigQuery:

The final dashboard can be accessible by the link

TODO

Tests
Load dataset file directly from the site not copying it to GitHub
Add clustering to the datamart table.

Remark: To avoid additional charges from Google Cloud after the trial period it's better to remove the created GCS bucket and BigQuery dataset. It can be done manually or by terraform command:

terraform destroy

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
analyses		analyses
data		data
img		img
macros		macros
mage		mage
models/example		models/example
seeds		seeds
snapshots		snapshots
terraform		terraform
tests		tests
.gitignore		.gitignore
README.md		README.md
dbt_project.yml		dbt_project.yml
package-lock.yml		package-lock.yml
packages.yml		packages.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Project overview

Dataset

Technologies

Dashboard

Set up the project

Preconditions

Run the project

About

Uh oh!

Releases

Packages

Uh oh!

Languages

TOmelchenko/LifeExpectancyProject

Folders and files

Latest commit

History

Repository files navigation

Project overview

Dataset

Technologies

Dashboard

Set up the project

Preconditions

Run the project

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages