Skip to content

week_03

Rafoolin edited this page Nov 21, 2023 · 1 revision

Create a pipeline

We have 3 main data sources and each of them might need some side dataset that explains about the special units or geo location abbreviation for example.

Extract

We first download the data from the data providers.

Transform

Then we drop the rows with Na values, rename columns with the same name but different meaning. We can also change the value of rows, for example replace unit's abbr with real numerical values.

Load

Then we merge the database on same columns and drop the NA rows. Finally we save the result in a new SQLITE dataset.

Pipeline

There is a bash script inside project directory that creates a virtual environment and then install requirements.txt, and run the pipeline.

Note

All the data and downloaded files will be stored in \data directory.

Clone this wiki locally