Case Study: Data Engineering

⚠️ PLEASE DO NOT FORK THIS REPO AS OTHERS MAY SEE YOUR CODE. INSTEAD YOU CAN USE THE USE THIS TEMPLATE BUTTON TO CREATE YOUR OWN REPOSITORY.

Targeted Workflow

Extract → Store raw → Preprocess → Load to PostgreSQL → Transform with dbt → Analyze

Getting Started

Install the required packages using pip. It is recommended to use a virtual environment to avoid conflicts with other projects.

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Start the Postgres database using Docker:

docker-compose up -d

This will start a Postgres database on port 5432. You can access the database using any Postgres client.

Issues with CrossRef API

If you face any issues with accessing the CrossRef API, you can add &mailto=your@email into the URL. This way CrossRef assigns your API request to a prioritized pool. Use a real email address.

Suggested Project Structure

.
├── data                         # Local storage for ingested data
│   ├── raw                     # Raw dumps from API
│   └── processed               # Cleaned/preprocessed files (if needed)
│
├── src                         # All Python source code
│   ├── extract                 # Code to call APIs and fetch raw data
│   │   ├── __init__.py
│   │   └── extractor.py
│   │
│   ├── preprocess              # Normalize / clean / deduplicate raw data
│   │   ├── __init__.py
│   │   └── normalize.py
│   │
│   ├── load                    # Load preprocessed data into Postgres
│   │   ├── __init__.py
│   │   └── loader.py
│   │
│   ├── utils                   # Config, logging, etc.
│   │   ├── __init__.py
│   │   ├── config.py
│   │   └── logger.py
│   │
│   └── pipeline.py             # Orchestrates all the steps end-to-end
│
├── dbt                         # dbt project directory
│   ├── models
│   │   ├── staging             # Raw to cleaned staging models
│   │   └── marts               # Final models / business logic
│   ├── seeds
│   └── snapshots
│
├── docker-compose.yml          # Docker Compose file to run Postgres
├── main.py                     # Entrypoint that runs the pipeline
├── README.md
└── requirements.txt

Some Ideas for Todos

improve the fetching of raw data: loop a few pages of the API responses
improve the deduplicator logic: some items with different DOI's may belong together (e.g., a preprint and a journal article version of the same work)
start the dbt schema and sql models to run some analytics on the data, e.g., sum citations per year, or per journal, or per publisher
decompose the main.py entrypoints into distinct pipelines and use an orchestrator such as Airflow or Prefect
Dockerize the python app

dbt

If you are not familiar with dbt, you can check their sandox project on GitHub to get started. You can also check the dbt documentation for more information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Case Study: Data Engineering

Targeted Workflow

Getting Started

Issues with CrossRef API

Suggested Project Structure

Some Ideas for Todos

dbt

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
dbt/models		dbt/models
logs		logs
src		src
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
main.py		main.py
requirements.txt		requirements.txt

MDPI-AG/case-study-data-engineering

Folders and files

Latest commit

History

Repository files navigation

Case Study: Data Engineering

Targeted Workflow

Getting Started

Issues with CrossRef API

Suggested Project Structure

Some Ideas for Todos

dbt

About

Resources

Uh oh!

Stars

Watchers

Forks

Languages