CrossRef Metadata Sprint 2025, Madrid

This is repository with the outcomes of a 2-day hackathon at the CrossRef Metadata Sprint 2025 in Madrid, Spain. Due the obvious time constraints, the artefacts are very much prototypical.

Retractions Data Analysis

Goal: Connect Retraction Watch (RW) data set with metadata from CrossRef, ROR, OpenAlex, etc.; to e.g. analyze the doi prefixes citing retracted papers pre- and post-retraction; analyze evolution of the number of retractions per prefix / publisher / journal / institute / funder / country / region; etc.

Running the Analysis

Prerequisites

Have a local copy of ROR-API running via Docker: https://github.com/ror-community/ror-api#readme

Setup Your Local Environment

Clone the repository
Active venv environment:
```
source .venv/bin/activate
```
Install the Python dependencies:
```
pip install -r requirements.txt
```

Runing the Pipelines

Run the pipeline to ETL the RW data set:
```
python src/pipeline_rw.py
```
Sample the RW date set (optional):
```
python src/pipeline_sample.py
```
Run the pipeline to match ROR IDs for affiliations:
```
python src/pipeline_ror.py
```
Merge back ROR data into the RW data set:
```
python src/pipeline_rw_ror.py
```
Fetch CrossRef data for the RW data set:
```
python src/pipeline_cr.py
```

Running the Web App

Run the web app for development:
```
fastapi dev src/app.py
```
Run with paralell workers:
```
python src/app.py
```

Then open the browser at http://localhost:8000/ to see the prototype analysis UI.

Limitations / Possible Improvements

We use ROR API first returned item for affiliation matching, which is strongly advised against by the ROR documentation. We should use a proper machine learning model to match the affiliations with RORs instead, such as the one used in OpenAlex.

Possible improvements:

Affiliation matching from ROR: using a Machine Learning model such as those employed by OpenAlex
Funder data from CrossRef: we use the first item returned by the CrossRef works API, but do not follow funder hierarchy. Ideally we should use the hierarchy data so that retractions can be reconciliated on higher funder levels (such as a national funder level).
Solution performance and scaling:
1. We do lots of API-based look-ups, which may fail with non 200 http codes for various reasons (not robust and not performant):
  1. Matching affiliations locally instead of via ROR API to scale the solutions to larger dataset
  2. Matching CrossRef metadata locally from the public dump file instead of via the CrossRef API
Load cited-by data from CrossRef (or alternatively OpenAlex) to create a second enriched dataset of “papers citing retractions”.
1. Idea: create a weighted citation factor per paper / author / journal based only on citations to retracted papers (the further away the retracted paper, the more discounted: direct citation paper → retracted paper; discounted citation paper → paper → retracted paper, etc.)
Fix some bugs:
1. At some point in one of the pipelines the dumping of the dataframe to parquet converts the retraction reasons from array to string. The array format should be retrained for subsequent analysis.
Some retractions’ original paper DOI are not registered by CrossRef but by other registration agencies such as DataCite or mEDRA (example).
Load the dataset into a ElasticSearch or Solr index for better query / facet-based refinement and analysis based on user’s input.
Write proper pipelines in proper Python 😅

Notes

General note: articles in RW may not have a DOI that is registered in CrossRef. We need to check if the DOI is registered in CrossRef.

Data fields in Retraction Watch (RW) data set:

subject: subject seems not easily usable; may need a subject classification API
institution: this piece of data might not be present in CrossRef, so we can take it from the RW data set. We need to match it again ROR API to get the institution's main ROR ID.
journal and publisher: we should not rely on RW data for this but the from CrossRef.
country: we ignore this piece of information as we will infer it from the institution's ROR record.
author: we can take from CrossRef data set.
urls: needs preprocessing as may contain multiple URLs. Contains URL to Retraction Watch blog if any.
articletype: this is the type of the article, which is any of: ['Clinical Study', 'Supplementary Materials', 'Auto/Biography', 'Interview/Q&A', 'Expression of Concern', 'Book Chapter/Reference Work', 'Case Report', 'Trade Magazines', 'Conference Abstract/Paper', 'Correction/Erratum/Corrigendum', 'Dissertation/Thesis', 'Preprint', 'Meta-Analysis', 'Commentary/Editorial', 'Research Article', 'Review Article', 'Article in Press', 'Other', 'Legal Case/Analysis', 'Retraction Notice', 'Technical Report/White Paper', 'Guideline', 'Government Publication', 'Letter', 'Retracted Article', 'Revision'] We will use this to filter out the articles we are interested in.
retractiondate: we keep this info; we need to check if any cited-by publication is published after this date. Key data point.
retractiondoi: this is the DOI of the RETRACTION NOTICE.
originalpaperdoi: this is the DOI of the articles that has been RETRACTED. Key data point.
retractionnature: type of the retraction, which is any of: ['Retraction' 'Correction' 'Expression of concern' 'Reinstatement']

Data fileds of interest from the ROR dataset:

ror_id: this is the ROR ID of the institution.
name: this is the name of the institution.
country: this is the country of the institution.
region: this is the country subzone of the institution.

Data fields of interest from the CrossRef dataset:

doi: this is the DOI of the article.
container: this is the title of the journal / conference / book series.
publisher: this is the publisher of the journal / conference / book series.
type: the publication type (journal-article, preprint, book-chapter, etc.)

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
screenshot.png		screenshot.png
solution-overview.png		solution-overview.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CrossRef Metadata Sprint 2025, Madrid

Retractions Data Analysis

Running the Analysis

Prerequisites

Setup Your Local Environment

Runing the Pipelines

Running the Web App

Limitations / Possible Improvements

Possible improvements:

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

MDPI-AG/crossref-sprint-retractions

Folders and files

Latest commit

History

Repository files navigation

CrossRef Metadata Sprint 2025, Madrid

Retractions Data Analysis

Running the Analysis

Prerequisites

Setup Your Local Environment

Runing the Pipelines

Running the Web App

Limitations / Possible Improvements

Possible improvements:

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages