The code provided here allows for extraction of names and husband-wife relationships from notarial acts written in Dutch language.
Using this code is very easy. Just download the complete package and run the main file 'nerd_main.py' in python:
python nerd_main.py
nerd_main.pycontains the main classNERD(text)that can be used as
nerd = Nerd(a_piece_of_text)Once an instance nerd is made, the references can be extracted by
nerd.get_references()and the relations can be extracted by
nerd.get_relations()Also, a highlighted html text can be exported by using the following code
nerd.get_highlighted_text()module_preprocess.pycontains the code for preprocessing the text and removing/correcting the bad text patternsmodule_names.pycontains the code for tagging wordsmodule_refscontains the code for using the tagged words to extract relationsmodule_relscontains the code for detecting the husband-wife relationships/db-folder contains some dictionaries required to extract the names from text ...first_name.txt: list of frequent first names in Dutch ...last_name_multiple.txt: list of common last names that consist of more than one word ...starting_words.pylist of the words that start a sentence and can be problematic in detecting the correct pattern of names
according to the first evaluations on 48 notarial acts that contain 309 individual names, 278 names are extracted precisely and 31 names are undetected: Recall: 90%, Precision: 91%
This code is developed within the MiSS project (http://swarmlab.unimaas.nl/catch/), funded by NWO. This code is free to use. However, it will be highly appreciated if the developer gets notified in case of use (email: bij.ranjbar@gmail.com).