Course project for Formale Semantik (University of Heidelberg). We investigate Named Entity Recognition & Classification (NERC) under increasing label granularity (from coarse-grained up to ultra-fine entity typing).
OntoNotes: The 90% Solution (Hovy et al., NAACL 2006)
Fine-grained entity recognition (FIGER) (Ling and Weld, AAAI 2012)
Ultra-Fine Entity Typing (Choi et al., ACL 2018)
This subproject focuses on the analysis of datasets used for Named Entity Recognition and Classification (NERC) as well as Fine-Grained Entity Typing.
The goal is to examine the characteristics and challenges of the following datasets:
- OntoNotes (Hovy et al., NAACL 2006)
- FIGER – Fine-Grained Entity Recognition (Ling & Weld, AAAI 2012)
- Ultra-Fine Entity Typing (Choi et al., ACL 2018)
The analysis focuses on:
- Label granularity
- Label distribution
- Ambiguous entities
- Multi-word entities
- Challenges for T5-based models
- Challenges for NLI-based approaches
- Possible preprocessing strategies
| Dataset | Task | Granularity | Multi-Label |
|---|---|---|---|
| OntoNotes | Classical NER | Coarse | No |
| FIGER | Fine-Grained Entity Typing | Fine | Yes |
| Ultra-Fine | Ultra-Fine Entity Typing | Very fine | Yes |
These three datasets represent different levels of complexity in entity typing.
The OntoNotes dataset is a well-established benchmark for classical Named Entity Recognition (NER).
- PERSON
- ORG
- GPE
- LOC
- EVENT
- PRODUCT
- LANGUAGE
- DATE
- MONEY
- WORK_OF_ART
OntoNotes uses coarse-grained labels, meaning relatively general entity categories.
| Entity | Label |
|---|---|
| Barack Obama | PERSON |
| Apple | ORG |
| Berlin | GPE |
These categories are broad and do not distinguish finer subtypes.
Example issue:
Apple → ORG
However, this could also refer to:
- a company
- a brand
- a product
The label distribution shows a clear class imbalance.
Frequent classes:
- PERSON
- ORG
- GPE
Rare classes:
- EVENT
- LAW
- LANGUAGE
This imbalance can affect model performance.
An example is:
Amazon
Possible meanings:
- ORG (company)
- LOC (river)
- GPE (region)
The correct classification strongly depends on the context.
The dataset contains many multi-token entities.
Examples:
- New York City
- Bank of America
- United Nations
Models therefore need to detect contiguous token spans.
The FIGER dataset extends NER to fine-grained entity typing.
Approximately 112 different entity types.
Examples:
- /person/actor
- /person/politician
- /location/city
- /organization/company
- /organization/sports_team
Example:
Entity: Barack Obama
Possible labels:
- person
- politician
- president
- author
This creates a hierarchical structure of labels.
An entity can have multiple types simultaneously.
Example:
Elon Musk
Labels:
- person
- entrepreneur
- engineer
- businessman
FIGER also exhibits strong class imbalance.
Frequent:
- person
- organization
- location
Rare:
- person/skateboarder
- person/cartoonist
Examples:
- Los Angeles Lakers
- United States of America
- New York Stock Exchange
This dataset extends entity typing to extremely fine-grained categories.
Labels are often free-form natural language descriptions.
Examples:
- person
- father
- songwriter
- politician
- skyscraper
An entity can have labels at multiple levels of specificity.
Example:
Trump
Possible labels:
- person
- businessman
- president
- politician
- celebrity
Many labels appear only a few times in the dataset, resulting in a long-tail distribution.
T5 is a generative sequence-to-sequence model.
Challenges for NERC tasks include:
- multiple labels per entity
- structured output requirements
- open label vocabularies (especially in Ultra-Fine)
Example output:
Barack Obama → person, politician, president
The model must therefore generate multiple correct labels simultaneously.
In NLI-based approaches, entity typing is formulated as a textual inference problem.
Example hypothesis:
The entity is a politician.
Problem:
Many labels require many hypotheses.
For example:
100 labels
→ 100 NLI inferences per entity
This significantly increases computational cost.
Example:
[Barack Obama] visited [Berlin]
First, the entity spans are identified.
Example:
sports_team → sports team
film_actor → actor
This simplifies generative modeling.
Example:
person/politician
becomes:
person
politician
Possible strategies:
- Removing extremely rare labels
- Merging similar labels
- Applying few-shot learning techniques
The three datasets differ significantly in their complexity:
| Dataset | Granularity | Labels | Difficulty |
|---|---|---|---|
| OntoNotes | Coarse | ~18 | Low |
| FIGER | Fine | ~100 | Medium |
| Ultra-Fine | Very fine | Thousands | High |
As granularity increases, the challenges also grow:
- Multi-label classification
- Long-tail label distributions
- Context-dependent interpretation of entities
Status: Work in progress (this repo will evolve as experiments and structure solidify!).