Supported by the FAIRagro consortium, the pilot use case
“Increasing FAIRness of FAIRagro data through AI-supported metadata enrichment”
develops a manually annotated agricultural text corpus to support the training and
evaluation of Named Entity Recognition (NER) models.
These NER models enable automated extraction of key metadata—such as crop species, soil properties, geographic information, and temporal statements—from dataset titles and abstracts. This supports metadata enrichment at scale and contributes to FAIR-aligned research data practices.
This work is a collaboration between:
- ZB MED – Information Centre for Life Sciences
- Julius Kühn-Institute (JKI)
- Leibniz Centre for Agricultural Landscape Research (ZALF)
This repository contains all code and processing pipelines used to construct and evaluate the NER dataset.
Figure 1: Overview of the use case workflow and processing pipeline.The repository is organized as follows:
├── code
│ ├──OpenAgrar
│ ├──corpus_creation
│ ├──pre_annotations
│ ├──generate_annotations
│ └──Bonares
├── data
│ ├──OpenAgrar
│ ├──Bonares
│ └──Corpus
├── documents
│ └──proposal.pdf
├── Tutorials
├── images
├── requirements.txt
└── README.md
Each sub-directory in the code dicrectory contains the code base for the componenst it is named after. The ones named after the resources are the codes used to get the datasets from each source. The pre_annotations directory is where the code to convert the text data from the sources into the data to be used in INCEpTION software. Finally the corpus_creation includes the code used to generate the final annotated corpus. In order to use each component, please refere to the instructions inside this component.
git clone https://github.com/fairagro/pilot-uc-textmining-metadata.git
cd pilot-uc-textmining-metadatapython3 -m venv venv
source venv/bin/activateMake sure you are in the project root (where requirements.txt is located), then run:
pip install -r requirements.txtPlease refer to the tutorials for a showcase of how to use the dataset for models fine-tuning. It follows the token-classification tutorial from Huggingface.
The data is a directory where a copy of the corpus can be accessed. It is also a helpful directory to work as an output for the different software components.
- Abanoub Abdelmalak
- Email: abdelmalak@zbmed.de
- Gabriel Schneider
- Email: schneiderg@zbmed.de
- Murtuza Husain
- Email: husain@zbmed.de
@dataset{abdelmalak_fairagro_ner_2025,
author = {Abdelmalak, Abanoub and Schneider, Gabriel and Riegler, Heike and Meier, Kristin and Specka, Xenia and Svoboda, Nikolai and Husain, Murtuza and Fluck, Juliane},
title = {{A Manually Annotated Agricultural Dataset for AI-Based NER and FAIR Metadata Enrichment}},
year = {2025},
publisher = {Fachrepositorium Lebenswissenschaften (FRL)},
doi = {10.4126/FRL01-6526458},
url = {https://doi.org/10.4126/FRL01-6526458},
note = {Version 1.0}
}The FAIRagro Metadata Enrichment NER Dataset is released under the:
Creative Commons Attribution 4.0 International (CC BY 4.0) License
License URL:
