Skip to content

This repository contains the code used to generate the NER corpus for metadata enrichment as part of FAIRagro use case

License

Notifications You must be signed in to change notification settings

fairagro/pilot-uc-textmining-metadata

Repository files navigation

Increasing FAIRness of FAIRagro Data Through AI-Supported Metadata Enrichment

A FAIRagro Pilot Use Case

Hugging Face Dataset FRL Dataset

Supported by the FAIRagro consortium, the pilot use case
“Increasing FAIRness of FAIRagro data through AI-supported metadata enrichment”
develops a manually annotated agricultural text corpus to support the training and evaluation of Named Entity Recognition (NER) models.

These NER models enable automated extraction of key metadata—such as crop species, soil properties, geographic information, and temporal statements—from dataset titles and abstracts. This supports metadata enrichment at scale and contributes to FAIR-aligned research data practices.

This work is a collaboration between:

  • ZB MED – Information Centre for Life Sciences
  • Julius Kühn-Institute (JKI)
  • Leibniz Centre for Agricultural Landscape Research (ZALF)

This repository contains all code and processing pipelines used to construct and evaluate the NER dataset.

Pipeline overview

Figure 1: Overview of the use case workflow and processing pipeline.

Folder Structure

The repository is organized as follows:

├── code
│   ├──OpenAgrar
│   ├──corpus_creation
│   ├──pre_annotations
│   ├──generate_annotations
│   └──Bonares
├── data
│   ├──OpenAgrar
│   ├──Bonares
│   └──Corpus
├── documents
│   └──proposal.pdf
├── Tutorials
├── images
├── requirements.txt
└── README.md

How to use the software

Each sub-directory in the code dicrectory contains the code base for the componenst it is named after. The ones named after the resources are the codes used to get the datasets from each source. The pre_annotations directory is where the code to convert the text data from the sources into the data to be used in INCEpTION software. Finally the corpus_creation includes the code used to generate the final annotated corpus. In order to use each component, please refere to the instructions inside this component.

1. Clone the repository

git clone https://github.com/fairagro/pilot-uc-textmining-metadata.git
cd pilot-uc-textmining-metadata

2. Create and activate a virtual environment

python3 -m venv venv
source venv/bin/activate

3. Install the dependencies

Make sure you are in the project root (where requirements.txt is located), then run:

pip install -r requirements.txt

Hands-on

Please refer to the tutorials for a showcase of how to use the dataset for models fine-tuning. It follows the token-classification tutorial from Huggingface.

Datasets station

The data is a directory where a copy of the corpus can be accessed. It is also a helpful directory to work as an output for the different software components.

Contributors

cite as

@dataset{abdelmalak_fairagro_ner_2025,
  author       = {Abdelmalak, Abanoub and Schneider, Gabriel and Riegler, Heike and Meier, Kristin and Specka, Xenia and Svoboda, Nikolai and Husain, Murtuza and Fluck, Juliane},
  title        = {{A Manually Annotated Agricultural Dataset for AI-Based NER and FAIR Metadata Enrichment}},
  year         = {2025},
  publisher    = {Fachrepositorium Lebenswissenschaften (FRL)},
  doi          = {10.4126/FRL01-6526458},
  url          = {https://doi.org/10.4126/FRL01-6526458},
  note         = {Version 1.0}
}

Licensing

The FAIRagro Metadata Enrichment NER Dataset is released under the:

Creative Commons Attribution 4.0 International (CC BY 4.0) License

License URL:

https://creativecommons.org/licenses/by/4.0/

About

This repository contains the code used to generate the NER corpus for metadata enrichment as part of FAIRagro use case

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •