This repository contains the code, documentation, and evaluation framework for my Master’s Thesis in Applied Computer Science at the University of Bamberg.
- Author: Yannick Lang
- Supervisor: Dr. Sean Papay (Bamberg NLP-Group)
- Submission Date: December 6, 2025
- Defense Date: January 15, 2026
Traditional linear-chain Conditional Random Fields (CRFs) excel at local sequence labeling but struggle to enforce global structural constraints. This thesis presents a system that automatically discovers global patterns, such as role ordering and co-occurrence, and encodes them into a Regular-Language-Constrained CRF (RegCCRF).
The system was evaluated on four diverse relation extraction and semantic role labeling tasks and one NER task:
- GePaDeSpkAtt: German parliamentary debate events.
- Genia: Biomedical event extraction + Named Entity Recognition.
- RiQuA: Speech events in English literature.
- OntoNotes 5.0: Large-scale general-domain semantic role labeling.
Those datasets are not included in this repository.
| Folder | Description |
|---|---|
| 01-proposal/ | LaTeX source for the initial thesis proposal. |
| 02-implementation/ | Core Logic: Constraint discovery, selection algorithms, and automaton generation. |
| 03-paper/ | LaTeX source for the final thesis, including raw evaluation results and plotting scripts. |
| 04-sources/ | Archived PDFs (omitted due to copyright). |
| 05-presentation/ | Defense slides (PDF) and speaker notes. |
Prerequisites
- Python 3.9
- PyTorch (CUDA)
- Specific dependencies listed in 02-implementation/environment.yml
- for some model configurations, up to 16 GB of VRAM are required.
Data Note: Due to licensing, datasets are not included.
- Automated Discovery: Identifies recurring patterns in relation structures.
- Constraint Selection: Algorithms to filter noise and retain high-impact regular constraints.
- Automaton Integration: Converts induced rules into FSAs that interface model.