CLUEval

CLUEval is a Python module and command line interface for span-based evaluation of sequence labelling. It expects two sequences of non-overlapping spans, denoted as R = reference and C = candidate spans. These spans can for instance be gold spans and predictions of an automatic model, or different predictions of automatic models or different labelling of human annotators. CLUEval provides common metrics such as precision, recall, and F1-score, with the possibility of choosing between strict evaluation and different levels of leniency.

Installation

pip install git+https://github.com/fau-klue/CLUEval

Dependencies

pandas
numpy
networkx

Input format

CLUEval expects two files with input data in verticalised text format (VRT), where each token is on a separate line and annotated with BIO tags. It assumes that there are at least two columns, the first being the token and the second one the annotation, such as


Stephanie	B-sensitive
works	O
at	O
city	B-sensitive
court	I-sensitive
.	O

It can also contain several different annotation layers, such as


Stephanie	B-sensitive	B-nat-name	B-high
works	O	O	O
at	O	O	O
city	B-sensitive	B-court-name	B-low
court	I-sensitive	I-court-name	I-low
.	O	O	O

Further information, such as token IDs, document IDs, or text domains, can be included as token-level annotation following annotation layers:


Stephanie	B-sensitive	B-nat-name	B-high	token0	doc0	legal
works	O	O	O	token1	doc0	legal
at	O	O	O	token2	doc0	legal
city	B-sensitive	B-court-name	B-low	token3	doc0	legal
court	I-sensitive	I-court-name	I-low	token4	doc0	legal
.	O	O	O	token5	doc0	legal

Features

Multilayer span

Convert tokens to span:
- Extract tagged tokens for each annotation layer (Ignore O-tag).
- Combine consecutive tokens into a whole span.
- Extract span label by removing B- and I-tags.
Unify spans across annotation layers:
- Identify spans that overlap across layers.
- Merge these spans into a unified span via adjacency matrix.
Span label:
- Collect predicted label for each combined span.
- Assign label according to majority vote.

Metrics

Calculation of precision, recall, and F1 (harmonic mean between precision and recall)
Labelled vs. unlabelled evaluation

Lenient evaluation

When calculating recall, reference spans are classified in true positives (TP) and false negatives (FN).
CLUEval allows lenient evaluation, which considers more spans than just exact matches as correct.
Consider the following reference span (R) vs. different kinds of candidate spans (C):

R      |==========|

C      |----------|        0. exact
    |-------------|        1. contained
     
       |---||-----|        2. tiled
     |--------||----|      3. covered
     
    (all other cases)      4. unmatched

We distinguish four types of reference-candidate alignments, defined from the perspective of the reference span, that may be counted as TP:
- exact match: The reference span and the candidate span are identical.
- contained match: The reference span is fully contained in a larger candidate span.
- tiled match: The reference span matches multiple adjacent candidate spans exactly.
- covered match: The reference span overlaps with several adjacent candidate spans but does not exceed the length of the combined candidate spans.
The level of leniency determines which kinds of matches are classified as TP:
- 0: strict evaluation, i.e. only exact matches are classified as TP
- 1: incl. contained matches
- 2: incl. contained + tiled matches
- 3: incl. contained + tiled + covered matches (default)
All other kinds of (partially matched or unmatched) reference spans are counted as false negative (FN).
The calculation of precision is accomplished by calculating recall for reference spans with regard to candidate annotations.

Precision and recall tables

Prepare table:
- Match spans in reference and candidate tables via their corpus positions (start and end offsets).
- Identify all exact matches with inner join and keep them in a separate table.
- For remaining spans, check for lenient matches and assign span status accordingly (see Lenient evaluation).
- Concat all spans into a unified dataframe.
Precision table:
- Match candidate to reference spans.
- Show how many candidate spans were correct with respect to reference.
Recall table:
- Match reference to candidate.
- Show how many annotated spans are correctly identified.

Span evaluation

Evaluate how accurately the model can identify the spans without considering the labels.
Compare the start and end positions of reference and candidate spans with respect to the lenient level.

Span evaluation filtered by label column

Filter spans:
- Select a label column from reference to filter spans.
- Specify the column value to retain relevant spans in precision and recall tables for evaluation.
Compute span-wise metrics for filtered spans without comparing the label values in R and C.

Labelled evaluation

Calculate span-wise evaluation metrics with label-matching for each label column separately.
Only spans that satisfy the lenient level and match in labels are considered as true positive (TP).
All other cases will be counted as false negative (FN), or false positive (FP), respectively.

Error analysis tables

CLUEval provides a table for error analysis with colour coded text spans
- Green (🟩): Tokens occur in both reference and candidate.
- Red (🟥): Tokens occur in reference but are missing in candidate.
- Orange (🟧): Tokens appear only in candidate span.
Option to input the window size of context information

Usage

`cluevaluate` executable script

positional arguments:
  reference             Path to reference file.
  candidate             Path to candidate or prediction file.

options:
  -h, --help            show this help message and exit
  -v, --version         output version information and exit
  -l {0,1,2,3}, --lenient {0,1,2,3}
                        Level of leniency. 0: exact, 1: incl. contained, 2: incl. contained and tiled, 3: incl. contained, tiled and covered. (default: 3)
  -a ANNOTATION_LAYER [ANNOTATION_LAYER ...], --annotation_layer ANNOTATION_LAYER [ANNOTATION_LAYER ...]
                        Assign names to annotation layers. (default: ['span'])
  -le LABELLED_EVAL [LABELLED_EVAL ...], --labelled_eval LABELLED_EVAL [LABELLED_EVAL ...]
                        Layer name for labelled evaluation. (default: None)
  -sle SPAN_LABEL_EVAL, --span_label_eval SPAN_LABEL_EVAL
                        Carry out labelled evaluation for specified annotation layer(s). (default: None)
  -slv SPAN_LABEL_VALUE, --span_label_value SPAN_LABEL_VALUE
                        Value by which to filter span labels. (default: None)
  -cd DOMAIN_COLUMN, --domain_column DOMAIN_COLUMN
                        Column index of domain metadata. (default: None)
  -ci DOC_ID_COLUMN, --doc_id_column DOC_ID_COLUMN
                        Column index of document ID (default: None)
  -ct TOKEN_ID_COLUMN, --token_id_column TOKEN_ID_COLUMN
                        Column index of token ISs. (default: None)
  -e [{contained,tiled,covered,unmatched} ...], --error_tables [{contained,tiled,covered,unmatched} ...]
                        Generate error tables for the specified error types. Defaults to 'unmatched' if no values are given. (default: None)
  -m, --match_tables    Generate detailed precision and recall matching tables. (default: False)
  -n N_ROWS, --n_rows N_ROWS
                        Number of table rows to display on screen. (default: 10)
  -w WRITE_TO_FOLDER, --write_to_folder WRITE_TO_FOLDER
                        Save tables to files in specified folder (which will be created automatically in the current working directory). 'tables' is used as default (default: None)

Examples with fictitious verdict

Basic span-wise evaluation

cluevaluate ./tests/data/fiktives-urteil-p1.bio ./tests/data/fiktives-urteil-p2.bio -l 0

Multilayer evaluation

cluevaluate ./tests/data/fiktives-urteil-p1.bio ./tests/data/fiktives-urteil-p2.bio -a span entity

Labelled evaluation

cluevaluate ./tests/data/fiktives-urteil-p1.bio ./tests/data/fiktives-urteil-p2.bio -a span entity risk -lc entity risk

Span evaluation filtered by label column

cluevaluate ./tests/data/fiktives-urteil-p1.bio ./tests/data/fiktives-urteil-p2.bio -a span entity -sl entity -slv nat-name

Include Precision and Recall tables

cluevaluate ./tests/data/fiktives-urteil-p1.bio ./tests/data/fiktives-urteil-p2.bio -a ner_tags pos_tags -m

Include error tables; NB: 'unmatched' is the default setting when the -e argument is passed without any additional parameters.

cluevaluate ./tests/data/fiktives-urteil-p1.bio ./tests/data/fiktives-urteil-p2.bio -a ner_tags pos_tags -e unmatched contained

Module

Instead of using the provided executable script, you can also embed the CLUEval module into your evaluation script / notebook. You will need to import:

Convert and Match from spans_table
MetricsForSpanAnonymisation as well as MetricsForCategoricalSpanAnonymisation from evaluation

Convert and match reference and prediction files to span dataframe

from clueval.spans_table import Convert, Match

ref_converter = Convert(path_to_file="./tests/data/fiktives-urteil-p1.bio",  annotation_layer=["anon", "entity", "risk"])
cand_converter = Convert(path_to_file="./tests/data/fiktives-urteil-p2.bio", annotation_layer=["anon", "entity", "risk"])

reference = ref_converter()
candidate = cand_converter()

# Prepare precision and Recall tables for later use by matching spans between reference and prediction
recall_matching = Match(reference, candidate)
precision_matching = Match(reference, candidate)

Further span meta information

from clueval.spans_table import Convert

ref_converter = Convert(path_to_file="./tests/data/reference.bio", 
                        annotation_layer=["confidence"],
                        token_id_column=2,
                        doc_id_column=3,
                        domain_column=4)
reference = ref_converter()
reference.head()

# |   start |   end | token_id_start   | token_id_end   | text                               | doc_id          | domain            | confidence   | id       |
# |--------:|------:|:-----------------|:---------------|:-----------------------------------|:----------------|:------------------|:-------------|:---------|
# |       1 |     2 | token_1          | token_2        | AMTSGERICHT ERLANGEN               | fictitious_1512 | fictitious_domain | niedrig      | id000001 |
# |       4 |     8 | token_4          | token_8        | Mozartstraße 23 , 91052 Erlangen   | fictitious_1512 | fictitious_domain | hoch         | id000002 |
# |       9 |    10 | token_9          | token_10       | HELGA SCHMIDT                      | fictitious_1512 | fictitious_domain | hoch         | id000003 |
# |      12 |    16 | token_12         | token_16       | Schillerstraße 4 , 91058 Erlangen  | fictitious_1512 | fictitious_domain | hoch         | id000004 |
# |      17 |    20 | token_17         | token_20       | Rechtsanwälte Schneider & Kollegen | fictitious_1512 | fictitious_domain | mittel       | id000005 |

Columns in Convert dataframe

start: Span start position in corpus
end: Span end position in corpus
token_id_start : Predefined token ID for span start token in corpus
token_id_end: Predefined token ID for span end token in corpus
text: Extracted text span
doc_id: Document Id
domain: Text domain
id: Span id
input annotation layers: Tag columns (here: confidence)

Match dataframe

# Show first 5 rows from recall_matching dataframe
recall_matching.head()

# |   start |   end | token_id_start | token_id_end | text                             | doc_id | domain | anon   | entity       | risk    | token_id_start_Y   | token_id_end_Y   | text_Y                           |  anon_Y  | entity_Y     | risk_Y   | status   |   start_Y |   end_Y |
# |--------:|------:|:---------------|:-------------|:---------------------------------|:-------|:-------|:-------|:-------------|:--------|:-------------------|:-----------------|:---------------------------------|:---------|:-------------|:---------|:---------|:----------|:--------|
# |       1 |     2 |                |              | AMTSGERICHT ERLANGEN             |        |        | anon   | court-name   | niedrig |                    |                  |                                  |          |              |          | unmatch  |      -100 |    -100 |
# |       6 |     8 |                |              | 11 C 122/20                      |        |        | anon   | court-docket | niedrig |                    |                  | 11 C 122/20                      | anon     | court-docket | niedrig  | exact    |         7 |       9 |
# |       9 |    13 |                |              | Mozartstraße 23 , 91052 Erlangen |        |        | anon   | address-name | hoch    |                    |                  | Mozartstraße 23 , 91052 Erlangen | anon     | address-name | hoch     | exact    |        10 |      14 |
# |      16 |    20 |                |              | 09131 / 782 - 01                 |        |        | anon   | code-idx     | niedrig |                    |                  | 09131 / 782 - 01                 | anon     | code-idx     | niedrig  | exact    |        17 |      21 |
# |      23 |    27 |                |              | 09131 / 782 - 105                |        |        | anon   | code-idx     | niedrig |                    |                  | 09131 / 782 - 105                | anon     | code-idx     | niedrig  | exact    |        24 |      28 |

Columns in Match dataframe

start: Span start position in reference corpus
end: Span end position in reference corpus
token_id_start : Predefined token ID for span start token in reference corpus
token_id_end: Predefined token ID for span end token in reference corpus
text: Extracted text span
doc_id: Document Id
domain: Text domain
start_Y:Span start position in candidate corpus
end_Y: Span end position in candidate corpus
token_id_start_Y: Predefined token ID for span start token in candidate corpus
token_id_end_Y: Predefined token ID for span end token in candiate corpus
text_Y: Candidate text span
status: Matching types between reference and candidate
and
input annotation layers: Tag columns (here: anon, entity, risk)

Evaluation

from clueval.evaluation import MetricsForSpansAnonymisation
# Span-wise evaluation
span_metrics = MetricsForSpansAnonymisation(precision_table=precision_table, recall_table=recall_table)
span_metrics(lenient_level=1, row_name="Span Anonymisation")

#              P         R        F1  TP_Precision  ...  FN  FP  Support  row_name
# Span  85.71429  85.71429  85.71429            60  ...  10  10       70      Span

from clueval.evaluation import MetricsForCategoricalSpansAnonymisation
# Categorical span evaluation
categorical_metrics = MetricsForCategoricalSpansAnonymisation(precision_table, recall_table, classification_head="risk")
categorical_metrics(lenient_level=0)

#                 P         R        F1  TP_Precision  TP_Recall  FN  FP  Support
# Hoch     78.37838  87.87879  82.85714            29         29   4   8       33
# Mittel   66.66667  66.66667  66.66667             2          2   1   1        3
# Niedrig  92.85714  78.78788  85.24590            26         26   7   2       33

Error Analysis

You just need to use ErrorTable from error_analysis to generate a table for error assessment. Additionally, in order to retrieve context information, you need to use BIOToSentenceParser from spans_table that maps corpus position to corresponding token.

from clueval.error_analysis import ErrorTable
from clueval.spans_table import BIOToSentenceParser

reference_sentences = BioToSentenceParser("./tests/data/fiktives-urteil-p1.bio")()
# {'token_ids': [[1, 2, 3, 4], ...], 'sents': [['----------', 'AMTSGERICHT', 'ERLANGEN', '----------'], ...]}

# Example table that contains false negative cases (Ignore cases here: 'exact'
fn_table = recall_table(recall_table["status"].isin(["unmatch"]))
error_analysis = ErrorTable(match_table=fn_table, candidate_table=candidate, token_position_sentence_mapping=reference_sentences)
erroneous_table = error_analysis(headers=["entity", "risk", "risk_Y"], windows=10)
erroneous_table.head(1)

# | token_id_start   | token_id_end   |   token_id_start_Y |   token_id_end_Y | domain   | entity       | risk    | risk_Y   | reference            | candidate   | context                                                                                                                                | error_type   | comment   |
# |:-----------------|:---------------|-------------------:|-----------------:|:---------|:-------------|:--------|:---------|:---------------------|:------------|:---------------------------------------------------------------------------------------------------------------------------------------|:-------------|:----------|
# | ---              | ---            |               -100 |             -100 | ---      | court-name   | niedrig | ---      | AMTSGERICHT ERLANGEN | ---         | ---------- 🟥AMTSGERICHT ERLANGEN🟥 ----------                                                                                         | unmatch      |           |

Columns in ErrorTable dataframe

token_id_start : Predefined token ID for span start token in reference corpus
token_id_end: Predefined token ID for span end token in reference corpus
token_id_start_Y: Predefined token ID for span start token in candidate corpus
token_id_end_Y: Predefined token ID for span end token in candiate corpus
domain: Text domain
reference: Reference span
candidate: Candidate text span
context: Highlighted spans + context tokens
error_type: Types of matching error
annotation layers: Tag columns (here: entity and risk)

Examples for possible cases

contained

Reference: 21. 05. 2020
Candidate: 21. 05. 2020 und
Error: contained
Context: [...] durch Richterin am Amtsgericht Arnold aufgrund der mündlichen Verhandlungen vom 🟩21. 05. 2020🟩 🟧und🟧 25. 06. 2020 folgendes

unmatched

Reference: AMTSGERICHT ERLANGEN
Candidate: ---
Error: unmatched
Context: ---------- 🟥AMTSGERICHT ERLANGEN🟥 ----------

Reference: Feldstraße 4 d , 91096 Möhrendorf
Candidate: Feldstraße 4 | , 91096 Möhrendorf
Error: unmatch
Context: LUISE SCHÜTZ , 🟩Feldstraße 4🟩 🟥d🟥 🟩, 91096 Möhrendorf🟩

Name		Name	Last commit message	Last commit date
Latest commit History 217 Commits
.github/workflows		.github/workflows
bin		bin
clueval		clueval
doc		doc
tests		tests
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLUEval

Installation

Dependencies

Input format

Features

Multilayer span

Metrics

Lenient evaluation

Precision and recall tables

Span evaluation

Span evaluation filtered by label column

Labelled evaluation

Error analysis tables

Usage

`cluevaluate` executable script

Examples with fictitious verdict

Module

Convert and match reference and prediction files to span dataframe

Further span meta information

Columns in Convert dataframe

Match dataframe

Columns in Match dataframe

Evaluation

Error Analysis

Columns in ErrorTable dataframe

Examples for possible cases

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CLUEval

Installation

Dependencies

Input format

Features

Multilayer span

Metrics

Lenient evaluation

Precision and recall tables

Span evaluation

Span evaluation filtered by label column

Labelled evaluation

Error analysis tables

Usage

cluevaluate executable script

Examples with fictitious verdict

Module

Convert and match reference and prediction files to span dataframe

Further span meta information

Columns in Convert dataframe

Match dataframe

Columns in Match dataframe

Evaluation

Error Analysis

Columns in ErrorTable dataframe

Examples for possible cases

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`cluevaluate` executable script

Packages