For this learning project, the Health dataset from World Bank Open Data has been chosen. It contains plenty of metrics from which the following have been selected to build the final dashboard:
- Population, total
- Population, male
- Population, female
- Life expectancy at birth, male (years)
- Life expectancy at birth, total (years)
- Life expectancy at birth, female (years)
The dataset includes data broken down by countries and years effective from the 1960 year. For the final dashboard, the data for the last 10 years (2013 - 2022) has been taken. The data file can be loaded directly from the site. For this project downloaded files have been saved in Github, from where they are being processed.
- Cloud: GCP
- Infrastructure as code (IaC): Terraform - for setting infrastructure.
- Workflow orchestration: Mage
- for the pipeline for processing selected dataset and putting it to a datalake (GCS);
- for the pipeline for moving the data from the lake to a data warehouse (BigQuery)
- Data Warehouse: BigQuery - for data warehouse.
- Batch processing: dbt - for transforming the data in the data warehouse for preparing it for visualization.
- Visualization: Looker - for building a dashboard.
The dashboard shows the population and life expectancy by country and year. It can be accessible by the link

To reproduce the project the user should have:
- The account in Google Cloud. (The trial account for 90 days can be created)
- Google Cloud SDK (The installation can be found here)
- The account in dbt. (The free account can be opened using this link)
- Terraform installed. (The installation instructions can be found here)
- Docker installed.
- Clone the project repository:
https://github.com/TOmelchenko/LifeExpectancyProject.gitorgit@github.com:TOmelchenko/LifeExpectancyProject.gitand go to the project folder:
cd LifeExpectancyProject
- In Google Cloud create a project: DataEngProject.
- Create a service account: DataEngProject_DataUser
- Grant this service account such privileges: BigQuery Admin, Storage Admin, Storage Object Admin, Viewer
- Create JSON key (right-click on service account actions) and save it on your computer.
- Copy data from your JSON key file to the file
~/LifeExpectancyProject/mage/dataengproject.json - Point this key to the environmental variable GOOGLE_APPLICATION_CREDENTIALS:
export GOOGLE_APPLICATION_CREDENTIALS='<full path>/LifeExpectancyProject/mage/dataengproject.json'
- Authorize login - launch the command
gcloud auth application-default login
gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS
- Create GCS bucket and BigQuery dataset using terraform:
cd terraform
terraform init
terraform plan ## Enter your project ID
terraform apply ## Enter your project ID
- Load data using Mage:
- Run these commands:
cd mage
cp dev.env .env && rm dev.env
-
Check the
PROJECT_NAME=life_expectancy_projectin the.envfile -
Run docker image and pull the latest version
docker compose up -d
docker pull mageai/mageai:latest
- The mage can be accessible via the link: http://localhost:6789/ You should see the
life_expectancy_projectwith 2 pipelines:
- You have to set variables for each pipeline, which are being used in data exporter/data loader scripts. This can be done in 2 ways:
- First:
- open the file
~/LifeExpectancyProject/mage/life_expectancy_project/pipelines/life_expectancy_to_gcs_load/metadata.yamland in the variables section in the very end set:
BUCKET_NAME: <your bucket name> ##life_expectancy_dataengproject-417719 - my bucket name
PROJECT_ID: <your project id>##dataengproject-417719 - my project id
- open the file
~/LifeExpectancyProject/mage/life_expectancy_project/pipelines/life_expectancy_to_bigquery_load/metadata.yamland in the variables section in the very end set:
BUCKET_NAME: <your bucket name> ##life_expectancy_dataengproject-417719 - my bucket name
DB_NAME: <db name in BigQuery>##dataengproject-417719 - my db name that equal project id
- Second:
- open mage interface and in the variable section for each pipeline update the variables like mentioned above.
Run the the pipelines manually. Go to the Trigger page:
- First - click on
life_expectancy_to_gcs_loadand press Run once button. - Second - click on
life_expectancy_to_bigquery_loadand press Run once button. After running these two pipelines you should have 3 parquet files in the GCP bucket:
dim_country-0.parquet
dim_indicator-0.parquet
life_expectancy_raw_data-0.parquet
and 3 tables in BigQuery dataset:
dim_country_stg
dim_indicator_stg
life_expectancy_stg
- Prepare the final dataset with dbt:
-
Set up BigQuery connection - load the JSON key, created in step 5.

-
Update in the file
schema.yamlthe old database name to the name you have in BigQuery
-
Run the model manually by
dbt runcommand. -
After dbt run is completed the
datamart_life_expectancytable should be created in BigQuery:
- The final dashboard can be accessible by the link
TODO
- Tests
- Load dataset file directly from the site not copying it to GitHub
- Add clustering to the datamart table.
Remark: To avoid additional charges from Google Cloud after the trial period it's better to remove the created GCS bucket and BigQuery dataset. It can be done manually or by terraform command:
terraform destroy

