Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 57 additions & 43 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,69 +1,83 @@
# CLUEval

CLUEval is a simple Python module for evaluating text anonymisation using token classification. It provides common metrics such as Precision, Recall and F1-score with the options for a more lenient evaluation.
CLUEval is a Python module and command line interface for evaluating span predictions. It provides common metrics such as precision, recall, and F1-score, with the possibility of choosing between strict and different levels of lenient evaluation.

## Installation
```sh
pip install git+https://github.com/fau-klue/CLUEval
```
### Requirements

### Dependencies
- pandas
- numpy
- networkx

## Features

### Join multihead classification
- Combine spans from multiple classification headers into a single span via an adjacency matrix
- Head labels will be determined by majority voting
## Usage
CLUEval expects input data in verticalised text format (VRT) with BIO tagging of tokens (see `tests/data/candidate.bio`):
```
----------|O|O|O
AMTSGERICHT|B-anon|B-court-name|B-niedrig
ERLANGEN|I-anon|I-court-name|I-niedrig
----------|O|O|O
```

### Lenient evaluation
- CLUEval also allows lenient evaluation and accept consider following overlap cases as true positive:
- Subset: The reference span is contained in candidate.
- Tiling: The reference span matches multiple adjacent candidates exactly.
- Overlap: The reference span overlaps several adjacent candidates but should not exceed the length of combined candidate spans.
- Lenient level:
- 0: No lenient (Default)
- 1: Subset
- 2: Subset + tiling
- 3: Subset + tiling + overlap
Further meta information can be included as further token-level annotation in the VRT file, such as predefined token IDs, document IDs, or text domains (see `tests/data/reference.bio`):

```
----------|O|token_0|fictitious_1512|Fictitious_Domain
AMTSGERICHT|B-niedrig|token_1|fictitious_1512|Fictitious_Domain
ERLANGEN|I-niedrig|token_2|fictitious_1512|Fictitious_Domain
----------|O|token_3|fictitious_1512|Fictitious_Domain
```

## Features

### Metrics

- Precision, Recall and F1
- Span-wise evaluation
- Compute evaluation metrics without taking the span label into account
- Compare spans according to the aforementioned lenient level
- Categorical span-wise evaluation
- Information category is also considered in the evaluation
- Labelled vs. unlabelled evaluation

### Lenient evaluation

- CLUEval allows lenient evaluation, which considers more spans than just exact matches as correct
- Consider the following gold span (G) vs. five different kinds of prediction spans (P):

```
G |==========|

P |----------| 0. Exact match
|-------------| 1. Superset

|---||-----| 2. Tiling
|--------||----| 3. Overlap

(all other cases) 4. False negative
```

- When calculating recall, gold spans are classified in true positives (TP) and false negatives (FN).
- Exact matches are always counted as TP. The level of leniency determines which of the remaining cases are classified as TP.
- Superset: The reference span is contained in the candidate span.
- Tiling: The reference span matches multiple adjacent candidate spans exactly.
- Overlap: The reference span overlaps with several adjacent candidate spans but does not exceed the length of the combined candidate spans.
- Lenient level:
- 0: Strict evaluation, i.e. not lenient (default)
- 1: Superset
- 2: Superset + tiling
- 3: Superset + tiling + overlap
- Precision is calculated simply by switching gold and prediction spans.

### Join multihead classification
- Combine spans from multiple classification headers into a single span via an adjacency matrix
- Head labels will be determined by majority voting

### Table for error analysis
- CLUEval provides a table for error analysis with colour coded text spans
- Green (🟩): Tokens occur in both reference and candidate.
- Red (🟥): Tokens occur in reference but are missing in candidate.
- Orange (🟧): Tokens appear only in candidate span.
- Option to input the window size of context information

## Usage
CLUEval expects input data in vertical format (VRT) with BIO tagging scheme.

```python
# fiktives-urteil-p1.bio
#
# ---------- O O O
# AMTSGERICHT B-anon B-court-name B-niedrig
# ERLANGEN I-anon I-court-name I-niedrig
# ---------- O O O
```
Further meta information can be included in the VRT file, such as predefined token IDs, document IDs and text domains.

```python
# reference.bio
#
# ---------- O token_0 fictitious_1512 Fictitious_Domain
# AMTSGERICHT B-niedrig token_1 fictitious_1512 Fictitious_Domain
# ERLANGEN I-niedrig token_2 fictitious_1512 Fictitious_Domain
# ---------- O token_3 fictitious_1512 Fictitious_Domain
```

### cluevaluate executable script
```
Expand Down Expand Up @@ -289,4 +303,4 @@ Reference: Feldstraße 4 d , 91096 Möhrendorf
Candidate: Feldstraße 4 | , 91096 Möhrendorf
Error: unmatch
Context: LUISE SCHÜTZ , 🟩Feldstraße 4🟩 🟥d🟥 🟩, 91096 Möhrendorf🟩
```
```
3 changes: 2 additions & 1 deletion clueval/evaluation/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
)
import pandas as pd


def main(
path_reference: str,
path_candidate: str,
Expand Down Expand Up @@ -108,4 +109,4 @@ def main(
categorical_eval_df["Label"] = categorical_eval_df.index
categorical_eval_df.reset_index(drop=True, inplace=True)
spans_eval_df = pd.concat([spans_eval_df, categorical_eval_df]).reset_index(drop=True)
return matched_span_precision, matched_span_recall, spans_eval_df
return matched_span_precision, matched_span_recall, spans_eval_df
12 changes: 12 additions & 0 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,15 @@ def p1():
def p2():
""" annotation 1 """
return "tests/data/candidate.bio"


@pytest.fixture
def p1s():
""" annotation 1 """
return "tests/data/reference-short.bio"


@pytest.fixture
def p2s():
""" annotation 1 """
return "tests/data/candidate-short.bio"
57 changes: 57 additions & 0 deletions tests/data/candidate-short.bio
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
---------- B-niedrig token_0 fictitious_1512 Fictitious_Domain
AMTSGERICHT I-niedrig token_1 fictitious_1512 Fictitious_Domain
ERLANGEN I-niedrig token_2 fictitious_1512 Fictitious_Domain
---------- I-niedrig token_3 fictitious_1512 Fictitious_Domain

Mozartstraße B-hoch token_4 fictitious_1512 Fictitious_Domain
23 I-hoch token_5 fictitious_1512 Fictitious_Domain
, I-hoch token_6 fictitious_1512 Fictitious_Domain
91052 B-mittel token_7 fictitious_1512 Fictitious_Domain
Erlangen I-mittel token_8 fictitious_1512 Fictitious_Domain

HELGA B-hoch token_9 fictitious_1512 Fictitious_Domain
SCHMIDT I-hoch token_10 fictitious_1512 Fictitious_Domain
, B-niedrig token_11 fictitious_1512 Fictitious_Domain
Schillerstraße B-hoch token_12 fictitious_1512 Fictitious_Domain
4 B-mittel token_13 fictitious_1512 Fictitious_Domain
, O token_14 fictitious_1512 Fictitious_Domain
91058 B-hoch token_15 fictitious_1512 Fictitious_Domain
Erlangen I-hoch token_16 fictitious_1512 Fictitious_Domain

Rechtsanwälte B-mittel token_17 fictitious_1512 Fictitious_Domain
Schneider I-mittel token_18 fictitious_1512 Fictitious_Domain
& I-mittel token_19 fictitious_1512 Fictitious_Domain
Kollegen I-mittel token_20 fictitious_1512 Fictitious_Domain
, O token_21 fictitious_1512 Fictitious_Domain
Wiener B-hoch token_22 fictitious_1512 Fictitious_Domain
Straße B-hoch token_23 fictitious_1512 Fictitious_Domain
12 B-mittel token_24 fictitious_1512 Fictitious_Domain
, O token_25 fictitious_1512 Fictitious_Domain
90431 B-hoch token_26 fictitious_1512 Fictitious_Domain
Nürnberg I-hoch token_27 fictitious_1512 Fictitious_Domain

Steinbrecher B-mittel token_28 fictitious_1512 Fictitious_Domain
+ O token_29 fictitious_1512 Fictitious_Domain
Amberger B-mittel token_30 fictitious_1512 Fictitious_Domain
Rechtsanwälte B-niedrig token_31 fictitious_1512 Fictitious_Domain
PartGmbB O token_32 fictitious_1512 Fictitious_Domain
, B-hoch token_33 fictitious_1512 Fictitious_Domain
Gothestraße I-hoch token_34 fictitious_1512 Fictitious_Domain
1 I-hoch token_35 fictitious_1512 Fictitious_Domain
, I-hoch token_36 fictitious_1512 Fictitious_Domain
91056 B-hoch token_37 fictitious_1512 Fictitious_Domain
Erlangen I-hoch token_38 fictitious_1512 Fictitious_Domain

Die O token_39 fictitious_1512 Fictitious_Domain
Kläger O token_40 fictitious_1512 Fictitious_Domain
sind O token_41 fictitious_1512 Fictitious_Domain
Eigentümer O token_42 fictitious_1512 Fictitious_Domain
des O token_43 fictitious_1512 Fictitious_Domain
Anwesens B-hoch token_44 fictitious_1512 Fictitious_Domain
Feldstraße I-hoch token_45 fictitious_1512 Fictitious_Domain
5 I-hoch token_46 fictitious_1512 Fictitious_Domain
a I-hoch token_47 fictitious_1512 Fictitious_Domain
, I-hoch token_48 fictitious_1512 Fictitious_Domain
91096 I-hoch token_49 fictitious_1512 Fictitious_Domain
Möhrendorf I-hoch token_50 fictitious_1512 Fictitious_Domain
. O token_51 fictitious_1512 Fictitious_Domain
2 changes: 1 addition & 1 deletion tests/data/candidate.bio
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ Steinbrecher B-mittel token_28 fictitious_1512 Fictitious_Domain
Amberger B-mittel token_30 fictitious_1512 Fictitious_Domain
Rechtsanwälte B-niedrig token_31 fictitious_1512 Fictitious_Domain
PartGmbB O token_32 fictitious_1512 Fictitious_Domain
, O O O token_33 fictitious_1512 Fictitious_Domain
, O token_33 fictitious_1512 Fictitious_Domain
Gothestraße B-hoch token_34 fictitious_1512 Fictitious_Domain
1 I-hoch token_35 fictitious_1512 Fictitious_Domain
, O token_36 fictitious_1512 Fictitious_Domain
Expand Down
4 changes: 2 additions & 2 deletions tests/data/manual_evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ n spans in candidate:
#### Precision
- Hoch
- exact: TP - 1 FN - 11
- lenient 1: TP - 10 FN 2
- lenient 1: TP - 10 FN 2
- lenient 2: TP - 10 FN 2
- lenient 3: TP -10 FN 2

Expand All @@ -75,4 +75,4 @@ n spans in candidate:
- exact: TP - 1 FN - 3
- lenient 1: TP - 2 FN - 2
- lenient 2: TP - 2 FN - 2
- lenient 3: TP - 2 FN - 2
- lenient 3: TP - 2 FN - 2
57 changes: 57 additions & 0 deletions tests/data/reference-short.bio
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
---------- O token_0 fictitious_1512 Fictitious_Domain
AMTSGERICHT B-niedrig token_1 fictitious_1512 Fictitious_Domain
ERLANGEN I-niedrig token_2 fictitious_1512 Fictitious_Domain
---------- O token_3 fictitious_1512 Fictitious_Domain

Mozartstraße B-hoch token_4 fictitious_1512 Fictitious_Domain
23 I-hoch token_5 fictitious_1512 Fictitious_Domain
, I-hoch token_6 fictitious_1512 Fictitious_Domain
91052 I-hoch token_7 fictitious_1512 Fictitious_Domain
Erlangen I-hoch token_8 fictitious_1512 Fictitious_Domain

HELGA B-hoch token_9 fictitious_1512 Fictitious_Domain
SCHMIDT I-hoch token_10 fictitious_1512 Fictitious_Domain
, O token_11 fictitious_1512 Fictitious_Domain
Schillerstraße B-hoch token_12 fictitious_1512 Fictitious_Domain
4 I-hoch token_13 fictitious_1512 Fictitious_Domain
, I-hoch token_14 fictitious_1512 Fictitious_Domain
91058 I-hoch token_15 fictitious_1512 Fictitious_Domain
Erlangen I-hoch token_16 fictitious_1512 Fictitious_Domain

Rechtsanwälte B-mittel token_17 fictitious_1512 Fictitious_Domain
Schneider I-mittel token_18 fictitious_1512 Fictitious_Domain
& I-mittel token_19 fictitious_1512 Fictitious_Domain
Kollegen I-mittel token_20 fictitious_1512 Fictitious_Domain
, O token_21 fictitious_1512 Fictitious_Domain
Wiener B-hoch token_22 fictitious_1512 Fictitious_Domain
Straße I-hoch token_23 fictitious_1512 Fictitious_Domain
12 I-hoch token_24 fictitious_1512 Fictitious_Domain
, I-hoch token_25 fictitious_1512 Fictitious_Domain
90431 I-hoch token_26 fictitious_1512 Fictitious_Domain
Nürnberg I-hoch token_27 fictitious_1512 Fictitious_Domain

Steinbrecher B-mittel token_28 fictitious_1512 Fictitious_Domain
+ I-mittel token_29 fictitious_1512 Fictitious_Domain
Amberger I-mittel token_30 fictitious_1512 Fictitious_Domain
Rechtsanwälte I-mittel token_31 fictitious_1512 Fictitious_Domain
PartGmbB I-mittel token_32 fictitious_1512 Fictitious_Domain
, O token_33 fictitious_1512 Fictitious_Domain
Gothestraße B-hoch token_34 fictitious_1512 Fictitious_Domain
1 I-hoch token_35 fictitious_1512 Fictitious_Domain
, I-hoch token_36 fictitious_1512 Fictitious_Domain
91056 I-hoch token_37 fictitious_1512 Fictitious_Domain
Erlangen I-hoch token_38 fictitious_1512 Fictitious_Domain

Die O token_39 fictitious_1512 Fictitious_Domain
Kläger O token_40 fictitious_1512 Fictitious_Domain
sind O token_41 fictitious_1512 Fictitious_Domain
Eigentümer O token_42 fictitious_1512 Fictitious_Domain
des O token_43 fictitious_1512 Fictitious_Domain
Anwesens O token_44 fictitious_1512 Fictitious_Domain
Feldstraße B-hoch token_45 fictitious_1512 Fictitious_Domain
5 I-hoch token_46 fictitious_1512 Fictitious_Domain
a I-hoch token_47 fictitious_1512 Fictitious_Domain
, I-hoch token_48 fictitious_1512 Fictitious_Domain
91096 I-hoch token_49 fictitious_1512 Fictitious_Domain
Möhrendorf I-hoch token_50 fictitious_1512 Fictitious_Domain
. O token_51 fictitious_1512 Fictitious_Domain
37 changes: 35 additions & 2 deletions tests/test_evaluation.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,21 @@
import pandas as pd
from clueval.evaluation import main


def test_evaluate(p1, p2):
main(p1, p2, annotation_layer="confidence")
main(p1, p2, annotation_layer="confidence")
main(p1, p2, annotation_layer="confidence")
main(p1, p2, annotation_layer="confidence", categorical_evaluation=True, categorical_head="confidence")


def test_evaluate_same_file(p1):
main(p1, p1, annotation_layer="confidence")
main(p1, p1, annotation_layer="confidence")
main(p1, p1, annotation_layer="confidence")
main(p1, p1, annotation_layer="confidence", categorical_evaluation=True, categorical_head="confidence")


def test_span_evaluation(p1, p2):
# test exact metrics
*_, span_evaluation = main(p1, p2, annotation_layer="confidence")
Expand Down Expand Up @@ -44,6 +47,36 @@ def test_span_evaluation(p1, p2):
# test recall TP for lenient level 3
assert span_evaluation["TP_Recall"].values[0] == 6


def test_span_evaluation_short(p1s, p2s):
# test exact metrics
*_, span_evaluation = main(p1s, p2s, annotation_layer="confidence")
assert isinstance(span_evaluation, pd.DataFrame) and not span_evaluation.empty
assert span_evaluation["P"].values.round(4) == round(2/19 * 100, 4)
assert span_evaluation["R"].values.round(4) == round(2/9 * 100, 4)

# test lenient level 1
*_, span_evaluation = main(p1s, p2s, annotation_layer="confidence", lenient_level=1)
assert isinstance(span_evaluation, pd.DataFrame)
assert span_evaluation["P"].values.round(4) == round(15/19 * 100, 4)
assert span_evaluation["R"].values.round(4) == round(4/9 * 100, 4)

# test lenient level 2
*_, span_evaluation = main(p1s, p2s, annotation_layer="confidence", lenient_level=2)
assert isinstance(span_evaluation, pd.DataFrame)
assert span_evaluation["P"].values.round(4) == round(15/19 * 100, 4)
assert span_evaluation["R"].values.round(4) == round(5/9 * 100, 4)

# test lenient level 2
*_, span_evaluation = main(p1s, p2s, annotation_layer="confidence", lenient_level=3)
assert isinstance(span_evaluation, pd.DataFrame)
assert span_evaluation["P"].values.round(4) == round(15/19 * 100, 4)
assert span_evaluation["R"].values.round(4) == round(6/9 * 100, 4)

# test span support
assert span_evaluation["Support"].values[0] == 9


def test_categorical_evaluation(p1, p2):
# test exact metrics
*_, categorical_evaluation = main(p1, p2, annotation_layer="confidence", categorical_evaluation=True, categorical_head="confidence")
Expand All @@ -53,7 +86,7 @@ def test_categorical_evaluation(p1, p2):
# Support
assert categorical_evaluation[categorical_evaluation["Label"] == "Hoch"]["Support"].values == 7
assert categorical_evaluation[categorical_evaluation["Label"] == "Mittel"]["Support"].values == 2
assert categorical_evaluation[categorical_evaluation["Label"] == "Niedrig"]["Support"].values == 2
assert categorical_evaluation[categorical_evaluation["Label"] == "Niedrig"]["Support"].values == 2

# test number of TP_Recall / FN
assert categorical_evaluation[categorical_evaluation["Label"] == "Hoch"]["TP_Recall"].values == 1
Expand All @@ -69,4 +102,4 @@ def test_categorical_evaluation(p1, p2):
assert categorical_evaluation[categorical_evaluation["Label"] == "Mittel"]["TP_Precision"].values == 1
assert categorical_evaluation[categorical_evaluation["Label"] == "Mittel"]["FP"].values == 5
assert categorical_evaluation[categorical_evaluation["Label"] == "Niedrig"]["TP_Precision"].values == 1
assert categorical_evaluation[categorical_evaluation["Label"] == "Niedrig"]["FP"].values == 3
assert categorical_evaluation[categorical_evaluation["Label"] == "Niedrig"]["FP"].values == 3