Increasing FAIRness of FAIRagro Data Through AI-Supported Metadata Enrichment

A FAIRagro Pilot Use Case

Supported by the FAIRagro consortium, the pilot use case
“Increasing FAIRness of FAIRagro data through AI-supported metadata enrichment”
develops a manually annotated agricultural text corpus to support the training and evaluation of Named Entity Recognition (NER) models.

These NER models enable automated extraction of key metadata—such as crop species, soil properties, geographic information, and temporal statements—from dataset titles and abstracts. This supports metadata enrichment at scale and contributes to FAIR-aligned research data practices.

This work is a collaboration between:

ZB MED – Information Centre for Life Sciences
Julius Kühn-Institute (JKI)
Leibniz Centre for Agricultural Landscape Research (ZALF)

This repository contains all code and processing pipelines used to construct and evaluate the NER dataset.

Figure 1: Overview of the use case workflow and processing pipeline.

Folder Structure

The repository is organized as follows:

├── code
│   ├──OpenAgrar
│   ├──corpus_creation
│   ├──pre_annotations
│   ├──generate_annotations
│   └──Bonares
├── data
│   ├──OpenAgrar
│   ├──Bonares
│   └──Corpus
├── documents
│   └──proposal.pdf
├── Tutorials
├── images
├── requirements.txt
└── README.md

How to use the software

Each sub-directory in the code dicrectory contains the code base for the componenst it is named after. The ones named after the resources are the codes used to get the datasets from each source. The pre_annotations directory is where the code to convert the text data from the sources into the data to be used in INCEpTION software. Finally the corpus_creation includes the code used to generate the final annotated corpus. In order to use each component, please refere to the instructions inside this component.

1. Clone the repository

git clone https://github.com/fairagro/pilot-uc-textmining-metadata.git
cd pilot-uc-textmining-metadata

2. Create and activate a virtual environment

python3 -m venv venv
source venv/bin/activate

3. Install the dependencies

Make sure you are in the project root (where requirements.txt is located), then run:

pip install -r requirements.txt

Hands-on

Please refer to the tutorials for a showcase of how to use the dataset for models fine-tuning. It follows the token-classification tutorial from Huggingface.

Datasets station

The data is a directory where a copy of the corpus can be accessed. It is also a helpful directory to work as an output for the different software components.

Contributors

Abanoub Abdelmalak
- Email: abdelmalak@zbmed.de
Gabriel Schneider
- Email: schneiderg@zbmed.de
Murtuza Husain
- Email: husain@zbmed.de

cite as

@dataset{abdelmalak_fairagro_ner_2025,
  author       = {Abdelmalak, Abanoub and Schneider, Gabriel and Riegler, Heike and Meier, Kristin and Specka, Xenia and Svoboda, Nikolai and Husain, Murtuza and Fluck, Juliane},
  title        = {{A Manually Annotated Agricultural Dataset for AI-Based NER and FAIR Metadata Enrichment}},
  year         = {2025},
  publisher    = {Fachrepositorium Lebenswissenschaften (FRL)},
  doi          = {10.4126/FRL01-6526458},
  url          = {https://doi.org/10.4126/FRL01-6526458},
  note         = {Version 1.0}
}

Licensing

The FAIRagro Metadata Enrichment NER Dataset is released under the:

Creative Commons Attribution 4.0 International (CC BY 4.0) License

License URL:

https://creativecommons.org/licenses/by/4.0/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Increasing FAIRness of FAIRagro Data Through AI-Supported Metadata Enrichment

A FAIRagro Pilot Use Case

Folder Structure

How to use the software

1. Clone the repository

2. Create and activate a virtual environment

3. Install the dependencies

Hands-on

Datasets station

Contributors

cite as

Licensing

About

Uh oh!

Releases 2

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
code		code
data		data
documents		documents
images		images
metadata_schemas		metadata_schemas
tutorials		tutorials
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
varieties_list.json		varieties_list.json

License

fairagro/pilot-uc-textmining-metadata

Folders and files

Latest commit

History

Repository files navigation

Increasing FAIRness of FAIRagro Data Through AI-Supported Metadata Enrichment

A FAIRagro Pilot Use Case

Folder Structure

How to use the software

1. Clone the repository

2. Create and activate a virtual environment

3. Install the dependencies

Hands-on

Datasets station

Contributors

cite as

Licensing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Uh oh!

Languages

Packages