This repository contains scripts to support different steps of the analyses performed in the CoNECo paper.
There is a Zenodo project associated with this repository.
Annotation documentation is available through Zenodo and this page: https://katnastou.github.io/annodoc-CoNECo/
There are three scripts in this directory to replicate the process of calculating corpus statistics as described in the Results and Discussion section of the manuscript. You only need to invoke the shell script in the directory.
./corpus_stats/run.shFor word counting of the documents, BERT basic tokenization is used, with the implementation found here.
This directory has the documents in BRAT and conll format.
For the error analysis the evaluation script evalso.py is used to detect False Positives and False Negatives in each document of the test set. To invoke the command in the entire Jensenlab tagged CoNECo test set using the CoNECo annotated test set as a gold standard a shell script is provided.
./error_analysis/jensenlab-tagger/run.shSimilarly, for the Transformer-based tagger, you should run:
./error_analysis/transformer-tagger/run.shFor large-scale tagging, the tagger needs to be set up first. Instructions on how to set it up can be found here. Then one needs to execute the shell script and the results that are also available in Zenodo can be obtained.
./large-scale-jensenlab-tagger/run.shPlease refer to the original repo on how to train an NER model on CoNECo.
Please refer to the original repo on how to do a large-scale run using the model trained on CoNECo.