This pipeline is designed to identify and categorize named entities in text using a custom NER model. It aims to extract entities such as people (PER), organizations (ORG), and locations (LOC).
-
Load the Pre-trained Model
- Import packages needed (e.g. spacy, pandas).
- Import input articles.
-
Remove In-text Hyperlinks
- Iterate through sentences, detect and delete all-capitalized sentence embedded in main content.
-
Enable Entity Linker
- Connecting to Wikidata knowledge base, categorize similar calling of one interest group, and label the primary name to the column called "Entity".
-
Run the Model and Keep Results
- Idenitify entities, list in column called "Pattern", categorized to Names, Organizations, and Locations.
- Append the entity and its corresponding ID of the article, publication data, and article URL, to a new dataframe.
-
Write Output to New File
- Write the dataframe to a created CSV file as the result.
spacy-entity-linker.ipynb: Contains the pipeline implementation, training, evaluation, and analysis workflows.emission_extracted_entities_org_linker.csv: Directory for the resulting output file.
-
Clone the repository:
git clone <https://github.com/infoqualitylab/NER-Model> cd <repository-folder>
-
Install dependencies:
pip install spacy pip install pandas pip install spacy_entity_linker
-
Prepare your list of news articles called "test_articles.csv", your list of commenters of interest called "patterns.csv".
-
Ensure that the required datasets are available in the same local directory.
The pipeline exhibited two main types of validation errors:
- Highlighted in annotation, but pipeline did not recognize
- Recognized by pipeline, but not highlighted in annotation
| Entity Type | Recognized by Pipeline | Annotated Manually |
|---|---|---|
| ORG | 345 | 309 |
| Annotated w/ Doccano \ Recognized by Pipeline | No | Yes |
|---|---|---|
| No | 133 (Not labeled with ORG) | 34 (Annotated Not Recognized) |
| Yes | 50 (Not Annotated Recognized) | 275 (Annotated Recognized) |
- The pipeline identifies extra “organizations”. For example, I didn’t highlight “EV” which stands for “Electric Vehicle” (left). But the pipeline recognizes it (right).
- Not all interest groups are linked. For instance, “Alliance for Automotive Innovation” is not linked because it is missing from WikiData, which SpaCy checks for Entity Linking.
- Our pipeline only supports English. Adaptation for multiple languages would be possible using more SpaCy features.
- Manual validation of outputs introduces potential for human error, impacting the reliability of descriptive statistics.
- In the future, will use Media Bias/Fact Check’s ratings to identify news outlets often read by Republicans vs. Democrats so that we can compare which interest groups appear.
- We will look for names of individuals in news outlets using NER for Person. We could then match them with individuals who commented on emissions standards.
- Experiment with different architectures and embeddings.
- [Xiaoran Zhou]