Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
59a696e
State in the readme that this is a fork
matej-ibis-ai Feb 29, 2024
4947a91
Log progress of loading lookup structs
matej-ibis-ai Feb 29, 2024
d7a68ad
Titlecase all-caps street names
matej-ibis-ai Mar 1, 2024
7c9abb8
Enable skipping of pickling lookup structs
matej-ibis-ai Mar 4, 2024
535999e
Proofread CONTRIBUTING.md
matej-ibis-ai Mar 4, 2024
026218f
Test with all-caps "IJSWEG"
matej-ibis-ai Mar 4, 2024
6062d46
Titlecase also when loading resources
matej-ibis-ai Mar 4, 2024
712ebe4
Make `TestLookupStruct` work regardless of cwd
matej-ibis-ai Mar 4, 2024
c5bc0b6
Use `pytest-datadir` for data fixtures
matej-ibis-ai Mar 4, 2024
9ebca78
Minimize data fixtures for tests
matej-ibis-ai Mar 4, 2024
7924ff3
Make `ensure_path` a plain util function
matej-ibis-ai Mar 4, 2024
f2d9675
Add a rant (a FIXME) about transformations
matej-ibis-ai Mar 5, 2024
7f46345
Reproduce the "de Quervain" issue
matej-ibis-ai Mar 5, 2024
3620b08
Test overzealous matching of patients
matej-ibis-ai Mar 5, 2024
cabbeca
Label only entities with all pat subtags as pat
matej-ibis-ai Mar 5, 2024
3b7c20c
Add a more extensive test for patient name
matej-ibis-ai Mar 6, 2024
1eadb3c
Titlecase patient names when matching
matej-ibis-ai Mar 6, 2024
010840b
Make (patient, persoon) become persoon
matej-ibis-ai Mar 6, 2024
4ec5836
Retain first-/surname distinction longer...
matej-ibis-ai Mar 6, 2024
ca4977f
Fix tests for `PersonAnnotationConverter`
matej-ibis-ai Mar 6, 2024
8710bcd
Update documentation slightly
matej-ibis-ai Mar 6, 2024
bf9a910
Enable matching tags of tokens
matej-ibis-ai Mar 6, 2024
27c27ce
(Almost) automatically format code
matej-ibis-ai Mar 7, 2024
6a8d1f5
Address issues reported by Flake8
matej-ibis-ai Mar 7, 2024
0c10ba4
Simplify `MultiTokenLookupAnnotator`...
matej-ibis-ai Mar 7, 2024
7cc1538
Properly assign priority to patient v. other tags
matej-ibis-ai Mar 7, 2024
f299b14
Make pylint happier
matej-ibis-ai Mar 8, 2024
21546da
Reduce num of args of `_apply_context_pattern`
matej-ibis-ai Mar 11, 2024
ff0fd50
Move `SequenceTokenizer` to Docdeid
matej-ibis-ai Mar 11, 2024
ece3fa9
Replace `_DIRECTION_MAP` with an enum
matej-ibis-ai Mar 11, 2024
bb7aa3c
Improve and test `annos_by_token()`
matej-ibis-ai Mar 11, 2024
481b23f
Fix a one-off error in iterating context tokens
matej-ibis-ai Mar 12, 2024
d259c0d
Document how to run tests better + cosmetics
matej-ibis-ai Jul 12, 2024
043618b
Merge remote-tracking branch 'origin/main'
matej-ibis-ai Jul 12, 2024
82c04ed
Resolve minor issues from the merge
matej-ibis-ai Jul 12, 2024
1cc4bda
Pass older annotations to iterations of the conx annor
matej-ibis-ai Oct 28, 2024
d772c9c
Merge changes from Deduce-3.0.3
matej-ibis-ai Oct 28, 2024
6bd6ce4
Move `annos_by_token` to `Document`
matej-ibis-ai Jan 8, 2025
392e21d
Rename `SequenceAnnotator.dicts` to `ds`
matej-ibis-ai Jan 8, 2025
34a6716
Replace `list(map(f, xs))` with list comprehension
matej-ibis-ai Jan 8, 2025
02af57b
Re-add `MultiTokenLookupAnnotator` accepting a `LookupSet`
matej-ibis-ai Jan 8, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 21 additions & 21 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,42 +3,42 @@
Thanks for considering making an addition to this project! These contributing guidelines should help make your life easier.

Before starting, some things to consider:
* For larger features, it would be helpful to get in touch first (through issue/email)
* For larger features, it would be helpful to get in touch first (through issue/email).
* A lot of the logic is in `docdeid`, please consider making a PR there for things that are not specific to `deduce`.
* `deduce` is a rule-based de-identifier
* In case you would like to see any rules added/removed/changed, a decent substantiation (with examples) of the potential improvement is useful
* `deduce` is a rule-based de-identifier.
* In case you would like to see any rules added/removed/changed, a decent substantiation (with examples) of the potential improvement is useful.

## Setting up the environment

* This project uses poetry for package management. Install it with ```pip install poetry```
* Set up the environment is easy, just use ```poetry install```
* This project uses poetry for package management. Install it with ``pip install poetry``.
* Setting up the environment is easy, just use ``poetry install``.
* The makefile contains some useful commands when developing:
* `make format` formats the package code
* `make lint` runs the linters (check the output)
* `make clean` removes build/test artifacts, etc
* `make format` formats the package code;
* `make lint` runs the linters (check the output);
* `make clean` removes build/test artifacts, etc.
* And for docs:
* `make build-docs` builds the docs
* `make build-docs` builds the docs.

## Runing the tests
## Running the tests

```bash
pytest .
poetry run pytest .
```

## PR checlist
## PR checklist

* Verify that tests are passing
* Verify that tests are updated/added according to changes
* Run the formatters (`make format`)
* Run the linters (`make lint`)
* Add a section to the changelog
* Add a description to your PR
* Verify that tests are passing.
* Verify that tests are updated/added according to changes.
* Run the formatters (`make format`).
* Run the linters (`make lint`).
* Add a section to the changelog.
* Add a description to your PR.

If all the steps above are followed, this ensures a quick review and release of your contribution.

## Releasing
* Readthedocs has a webhook connected to pushes on the main branch. It will trigger and update automatically.
* Create a [release on github](https://github.com/vmenger/docdeid/releases/new), create a tag with the right version, manually copy and paste from the changelog
* Build pipeline and release to PyPi trigger automatically on release
* Create a [release on Github](https://github.com/vmenger/docdeid/releases/new), create a tag with the right version, manually copy and paste from the changelog.
* Build pipeline and release to PyPI trigger automatically on release.

Any other questions/issues not covered here? Please just get in touch!
Any other questions/issues not covered here? Please just get in touch!
14 changes: 13 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,18 @@
![license](https://img.shields.io/github/license/vmenger/deduce)
[![black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

# About this fork

This is Matěj Korvas's fork of the original Deduce tool, available at
https://github.com/vmenger/deduce, forked on 2024-02-29. The latest version
available here has some extra or different functionality on top of the
original tool -- which is maybe obvious but the license requires me to
state it clearly.

Use at your own risk.

Original Readme documentation follows.

# deduce

> Deduce 3.0.0 is out! It is way more accurate, and faster too. It's fully backward compatible, but some functionality is scheduled for removal, read more about it here: [docs/migrating-to-v3](https://deduce.readthedocs.io/en/latest/migrating.html)
Expand Down Expand Up @@ -141,4 +153,4 @@ For setting up the dev environment and contributing guidelines, see: [docs/contr

## License

This project is licensed under the GNU General Public License v3.0 - see the [LICENSE.md](LICENSE.md) file for details
This project is licensed under the GNU General Public License v3.0 - see the [LICENSE.md](LICENSE.md) file for details
46 changes: 34 additions & 12 deletions base_config.json
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
"redactor_close_char": "]",
"annotators": {
"prefix_with_initial": {
"annotator_type": "deduce.annotator.TokenPatternAnnotator",
"annotator_type": "docdeid.process.SequenceAnnotator",
"group": "names",
"args": {
"tag": "prefix+initiaal",
Expand All @@ -37,7 +37,7 @@
}
},
"prefix_with_interfix": {
"annotator_type": "deduce.annotator.TokenPatternAnnotator",
"annotator_type": "docdeid.process.SequenceAnnotator",
"group": "names",
"args": {
"tag": "prefix+interfix+naam",
Expand All @@ -56,7 +56,7 @@
}
},
"prefix_with_name": {
"annotator_type": "deduce.annotator.TokenPatternAnnotator",
"annotator_type": "docdeid.process.SequenceAnnotator",
"group": "names",
"args": {
"tag": "prefix+naam",
Expand All @@ -79,7 +79,7 @@
}
},
"interfix_with_name": {
"annotator_type": "deduce.annotator.TokenPatternAnnotator",
"annotator_type": "docdeid.process.SequenceAnnotator",
"group": "names",
"args": {
"tag": "interfix+achternaam",
Expand All @@ -102,7 +102,7 @@
}
},
"initial_with_name": {
"annotator_type": "deduce.annotator.TokenPatternAnnotator",
"annotator_type": "docdeid.process.SequenceAnnotator",
"group": "names",
"args": {
"tag": "initiaal+naam",
Expand All @@ -128,7 +128,7 @@
}
},
"initial_interfix": {
"annotator_type": "deduce.annotator.TokenPatternAnnotator",
"annotator_type": "docdeid.process.SequenceAnnotator",
"group": "names",
"args": {
"tag": "initiaal+interfix+naam",
Expand Down Expand Up @@ -177,6 +177,32 @@
"args": {
"iterative": true,
"pattern": [
{
"name": "patient_left",
"direction": "left",
"pre_tag": [
"achternaam_patient"
],
"tag": "voornaam_patient+achternaam_patient",
"pattern": [
{
"tag": "vornaam_patient"
}
]
},
{
"name": "patient_right",
"direction": "right",
"pre_tag": [
"voornaam_patient"
],
"tag": "voornaam_patient+achternaam_patient",
"pattern": [
{
"tag": "achternaam_patient"
}
]
},
{
"name": "interfix_right",
"direction": "right",
Expand Down Expand Up @@ -295,11 +321,7 @@
"skip": ["."],
"pattern": [
{
"and": [
{
"lookup": "prefix"
}
]
"lookup": "prefix"
}
]
}
Expand All @@ -325,7 +347,7 @@
}
},
"street_pattern": {
"annotator_type": "deduce.annotator.TokenPatternAnnotator",
"annotator_type": "docdeid.process.SequenceAnnotator",
"group": "locations",
"args": {
"pattern": [
Expand Down
107 changes: 75 additions & 32 deletions deduce/annotation_processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,16 @@


class DeduceMergeAdjacentAnnotations(dd.process.MergeAdjacentAnnotations):
"""Merge adjacent tags, according to deduce logic: adjacent annotations with mixed
patient/person tags are replaced with a patient annotation, in other cases only
annotations with equal tags are considered adjacent."""
"""
Merges adjacent tags, according to Deduce logic:

- adjacent annotations with mixed patient/person tags are replaced
with the "persoon" annotation;
- adjacent annotations with patient tags of which one is the surname
are replaced with the "patient" annotation; and
- adjacent annotations with other patient tags are replaced with
the "part_of_patient" annotation.
"""

def _tags_match(self, left_tag: str, right_tag: str) -> bool:
"""
Expand All @@ -23,10 +30,15 @@ def _tags_match(self, left_tag: str, right_tag: str) -> bool:
``True`` if tags match, ``False`` otherwise.
"""

return (left_tag == right_tag) or {left_tag, right_tag} == {
"patient",
"persoon",
}
patient_part = [tag.endswith("_patient") for tag in (left_tag, right_tag)]
# FIXME Ideally, we should be first looking for a `*_patient` tag in
# both directions and only failing that, merge with an adjacent
# "persoon" tag.
return (
left_tag == right_tag
or all(patient_part)
or (patient_part[0] and right_tag == "persoon")
)

def _adjacent_annotations_replacement(
self,
Expand All @@ -37,14 +49,22 @@ def _adjacent_annotations_replacement(
"""
Replace two annotations that have equal tags with a new annotation.

If one of the two annotations has the patient tag, the new annotation will also
be tagged patient. In other cases, the tags are already equal.
If one of the two annotations has the "patient" tag (and the other is either
"patient" or "persoon"), the other annotation will be used. In other cases, the
tags are always equal.
"""

if left_annotation.tag != right_annotation.tag:
replacement_tag = "patient"
else:
replacement_tag = left_annotation.tag
ltag = left_annotation.tag
rtag = right_annotation.tag
replacement_tag = (
ltag
if ltag == rtag
else "persoon"
if rtag == "persoon"
else "patient"
if any(tag.startswith("achternaam") for tag in (ltag, rtag))
else "part_of_patient"
)

return dd.Annotation(
text=text[left_annotation.start_char : right_annotation.end_char],
Expand All @@ -59,20 +79,28 @@ class PersonAnnotationConverter(dd.process.AnnotationProcessor):
Responsible for processing the annotations produced by all name annotators (regular
and context-based).

Any overlap with annotations that are contain "pseudo" in their tag are removed, as
are those annotations. Then resolves overlap between remaining annotations, and maps
the tags to either "patient" or "persoon", based on whether "patient" is in the tag
(e.g. voornaam_patient => patient, achternaam_onbekend => persoon).
Any overlap with annotations that contain "pseudo" in their tag is removed, as are
those annotations. Then resolves overlap between remaining annotations, and maps the
tags to either "patient" or "persoon", based on whether "patient" is in all
constituent tags (e.g. voornaam_patient+achternaam_patient => patient,
achternaam_onbekend => persoon).
"""

def __init__(self) -> None:
def map_tag_to_prio(tag: str) -> int:
if "pseudo" in tag:
return 0
if "patient" in tag:
return 1

return 2
def map_tag_to_prio(tag: str) -> (int, int, int):
"""
Maps from the tag of a mention to its priority. The lower, the higher
priority.

The return value is a tuple of:
1. Is this a pseudo tag? If it is, it's a priority.
2. How many subtags does the tag have? The more, the higher priority.
3. Is this a patient tag? If it is, it's a priority.
"""
is_pseudo = "pseudo" in tag
num_subtags = tag.count("+") + 1
is_patient = tag.count("patient") == num_subtags
return (-int(is_pseudo), -num_subtags, -int(is_patient))

self._overlap_resolver = dd.process.OverlapResolver(
sort_by=("tag", "length"),
Expand All @@ -89,15 +117,30 @@ def process_annotations(
annotations, text=text
)

return dd.AnnotationSet(
real_annos = (
anno
for anno in new_annotations
if "pseudo" not in anno.tag and anno.text.strip()
)
with_patient = (
dd.Annotation(
text=annotation.text,
start_char=annotation.start_char,
end_char=annotation.end_char,
tag="patient" if "patient" in annotation.tag else "persoon",
text=anno.text,
start_char=anno.start_char,
end_char=anno.end_char,
tag=PersonAnnotationConverter._resolve_tag(anno.tag),
)
for annotation in new_annotations
if ("pseudo" not in annotation.tag and len(annotation.text.strip()) != 0)
for anno in real_annos
)
return dd.AnnotationSet(with_patient)

@classmethod
def _resolve_tag(cls, tag: str) -> str:
if "+" not in tag:
return tag if "patient" in tag else "persoon"
return (
"patient"
if all("patient" in part for part in tag.split("+"))
else "persoon"
)


Expand All @@ -114,7 +157,7 @@ def process_annotations(


class CleanAnnotationTag(dd.process.AnnotationProcessor):
"""Cleans annotation tags based on the corresponding mapping."""
"""Renames tags using a mapping."""

def __init__(self, tag_map: dict[str, str]) -> None:
self.tag_map = tag_map
Expand Down
Loading