This project automates the extraction, comparison, and storage of updated/inserted OECD dataset information, leveraging APIs and structured workflows.
- Data Fetching: Downloads dataflows from the OECD API.
- Data Comparison: Compares new datasets with previous instances to detect changes (new inserts, deletions).
- Data Management: Saves dataset and metadata for future processing.
- Change Archival: Archives previous datasets and records changes for future reference.
- Error Logging: Logs detailed information about the execution process.
Functions/: Contains core utility scripts:data_fetcher.py: Fetches dataflows and saves them to CSV. Designed for first time dataset download as well as subsequent downloads (all_dataflows_new.csv,all_dataflows_previous.csv)data_comparator.py: Compares oldall_dataflows_previous.csvand new datasetsall_dataflows_new.csvand identifies changesdata_changes.csv.api_downloader.py: Downloads datasets and metadata for new entries.logger.py: Configures logging for the project.
base_run.py: Initializes and manages the data fetching workflow for the first time to create Base dataset.main.py: Invokes regular workflows, including data fetching, comparison, and metadata updates. This is to be scheduled and invoked on regular intervalsconfig.yaml: Configuration file with API endpoints, file names and file paths.requirements.txt: Lists Python dependencies.linux_cron_setup.txt: Scheduling setup to run the main.py job every two weeks once on Tuesday 7 am on Linux server.
Ensure the following folders exist in the project directory before running the scripts:
logs/:- Used for storing log files generated by the scripts.
output/:- Used for saving downloaded datasets and metadata files.
data/:- Used for storing the main dataset files (e.g.,
all_dataflows_new.csv,all_dataflows_previous.csv,data_changes.csv).
- Used for storing the main dataset files (e.g.,
data/archive/:- Folder for archiving old datasets or backups .
-
Initial Setup:
- Run
base_run.pyto fetch and save the baseline dataset:python base_run.py
- This needs to run only once and will create the first version of the dataset (
all_dataflows_previous.csv) and set up the workspace for subsequent runs.
- Run
-
Regular Workflow:
- Run
main.pyfor periodic execution:python main.py
- This will:
- Run every two weeks once on Tuesday 7 am once after moved to Linux server and executed.
- Fetch the latest datasets from the OECD API.
- Compare the new dataset with the existing one to identify changes.
- Save detected changes to
data_changes.csv. - Download additional data and metadata for new records.
- Run
-
Output:
- Logs: Found in the
logs/folder. - Change Summary: Saved as
data_changes.csvin thedata/directory. - Downloaded Data and Metadata: Saved in the
output/directory.
- Logs: Found in the
For a detailed explanation of each script, its role, and how they work together, refer to the Confluence page.