This repo contains all the code that was used to build a knowledge graph from 2081 articles published by News24 on the topic of the Zondo Commission as described in the project report submitted to the University of London. As per agreement with Media24 (who owns the articles), a sample of 30 of the unlocked articles, i.e. those not behind the paywall, is supplied with the code in parquet format (see source_data/sample_text_30_unlocked.pq) which can be read using the Pandas library.
Where very specific ideas have been used that were found in blogs, stackoverflow, and other online resources, these are acknowledged inline with the relevant code. I also made use of Toqan.ai to get 'starter code' for various low-level functions along the way, which were then expanded upon or re-worked before being incorporated into the code base.
- KG - knowledge graph
- HITL dataset - human-in-the-loop dataset (sample of 30 annotated articles used for evaluation)
- NER - named entity recognition
- CR - coreference resolution
- REX - relation extraction
- EL - entity linking
This folder contains all the required dataclasses, functions and methods that were used to build and test the knowledge graph. What follows is an overview of the role of each one:
Includes data classes for structuring key elements:
Information extraction:
- Article
- NamedEntity
- CrCluster (containing Mention)
- Relation
Building the KG:
- KGData used to create an instance of a KG
- KGEntity
- KGRelation
It also includes methods to export to Label Studio to facilitate creation of the HITL dataset.
Includes all the functions required to build and/or extend a KG, article by article, from an Article class instance. The main algorithm is update_kg_from_article which calls the sub-algorithm process_entity which includes performing EL where possible (up to a maximum of 5 retries).
It also includes methods prepare_kg_neo4j_files and prepare_kg_nx_files to export the KGData instance to the appropriate formats to be read in by neo4j and the NetworkX library respectively.
Contains the model setup, code used for the initial NER run (Round 1), evaluation options and the option to import from Label Studio
Contains the model setup, code used for the initial CR run (Round 1), evaluation options and the option to import from Label Studio
Contains the model setup, code used for the initial REX run (Round 1), evaluation options and the option to import from Label Studio
Contains the enhancements to NER and CR developed for Round 2 descibed in 5.2 Round 2 – improving baseline model outputs
Contains the enhancements to REX developed for Round 2 descibed in 5.2 Round 2 – improving baseline model outputs and 5.4 Round 4 – improving the KG
General functions used across the project.
The file reference_info/rebel_flair_selected.csv contains the offline work that was done to arrive at the final ontology described in 4.3 Ontology requirements definition.
The following notebooks reflect the sequential stages of development and testing as described in the accompanying project report:
A. round1_initial_model_outputs.ipynbwas used to obtain the baseline model outputs (and runtimes) for each model tested for the following tasks: NER, CR and REB. round1_initial_model_evaluation.ipynbwas used to evaluate the baseline model outputs against the HITL dataset
C. round2_model_outputs.ipynbwas used to include several measures designed to improve the results in round 1D. round2_model_evaluation.ipynbwas used to evaluate the measures from round 2 against the HITL dataset of 30 articles
E. round3_first_kg_build.ipynbwas used to build the first knowledge graph using the 30 sample articles. Evaluation was done offline using the triples from the HITL dataset as the base.
F. round4_second_kg_build.ipynbwas used to build the second knowledge graph using the 30 sample articles again. Evaluation was again done offline using the triples from the HITL dataset as the base.
G. end_to_end_on_sample_30_unlocked_article.ipynbputs it all together. This notebook can be run on thesource_data/sample_text_30_unlocked.pqsupplied with the project, and was also used to compile the final knowledge graph on all 2081 articles. Key events and datapoints were output tokg_builder.logfor tracking and troubleshooting.H. load_data_to_neo4j.ipynbsmall notebook that loads the neo4j text files into neo4j to construct the final KGI. networkx_analysis.ipynbreads the final entities and relations fromcsv_dataand does an EDA on the graph and its outputs
Additional utility notebooks are included for reference as follows:
0. assemble_source_data.ipynbwas used to extract the articles from sources and prepare them for further processing (this includes the full set of articles as well as the sample of 30 used for evaluation)1. compile_ontology_data.ipynbwas used to update the ontology information when adjustments were made
Finally, requirements.txt contains info on the versions used for all main libraries on the project.