Extract → Store raw → Preprocess → Load to PostgreSQL → Transform with dbt → Analyze
Install the required packages using pip. It is recommended to use a virtual environment to avoid conflicts with other projects.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtStart the Postgres database using Docker:
docker-compose up -dThis will start a Postgres database on port 5432. You can access the database using any Postgres client.
If you face any issues with accessing the CrossRef API, you can add &mailto=your@email
into the URL. This way CrossRef assigns your API request to a prioritized pool. Use
a real email address.
.
├── data # Local storage for ingested data
│ ├── raw # Raw dumps from API
│ └── processed # Cleaned/preprocessed files (if needed)
│
├── src # All Python source code
│ ├── extract # Code to call APIs and fetch raw data
│ │ ├── __init__.py
│ │ └── extractor.py
│ │
│ ├── preprocess # Normalize / clean / deduplicate raw data
│ │ ├── __init__.py
│ │ └── normalize.py
│ │
│ ├── load # Load preprocessed data into Postgres
│ │ ├── __init__.py
│ │ └── loader.py
│ │
│ ├── utils # Config, logging, etc.
│ │ ├── __init__.py
│ │ ├── config.py
│ │ └── logger.py
│ │
│ └── pipeline.py # Orchestrates all the steps end-to-end
│
├── dbt # dbt project directory
│ ├── models
│ │ ├── staging # Raw to cleaned staging models
│ │ └── marts # Final models / business logic
│ ├── seeds
│ └── snapshots
│
├── docker-compose.yml # Docker Compose file to run Postgres
├── main.py # Entrypoint that runs the pipeline
├── README.md
└── requirements.txt
- improve the fetching of raw data: loop a few pages of the API responses
- improve the deduplicator logic: some items with different DOI's may belong together (e.g., a preprint and a journal article version of the same work)
- start the dbt schema and sql models to run some analytics on the data, e.g., sum citations per year, or per journal, or per publisher
- decompose the
main.pyentrypoints into distinct pipelines and use an orchestrator such as Airflow or Prefect - Dockerize the python app
If you are not familiar with dbt, you can check their sandox project on GitHub to get started. You can also check the dbt documentation for more information.