This is a prototypical implementation of a concept to extract dependencies from an ontology to improve existing error detection methods. Those dependencies include new types upon existing Denial Dependency and Matching Dependency:
- Device-Link Dependency
- Temporal Dependency
- Locality Dependency
- Monitoring Dependency
- Capability Dependency
Contains two datasets: Hospital and IoT:
- Hospital is a commonly used benchmark dataset from the US Health Service and already contains typos.
- IoT is self-collected via a Smart Home Application with four temperature sensors. To inject errors use
inject.py
- Definition of dependencies as python classes
- Definition of a ontology loader to parse from a file or a database
- Contains SPARQL Queries to extract dependencies from ontologies
Contains runnables to test data validation:
- Execution of HoloClean with Hospital dataset
- Execution of HoloClean with IoT dataset
- Execution of Raha with IoT dataset
- Injection of outliers in IoT dataset
- Execution of dBoost with outlier IoT dataset
Ontologies are used to extract dependencies in the context of the data and find relations. These are evaluated for the usage in the further pipeline
The concept is implemented to work with HoloClean and Raha. The HoloClean framework is enhanced with the extracted information to improve its error detection capabilities.
Hint: You need to build error-generator. Change every occurences of "get_values()" to "values" since it is deprecated in pandas, but was not updated in this project.
Hint:
If you are running Python 3.8 and above you need to change all occurences of time.clock() to time.time(). This is a known issue of HoloClean.
This guide should give you an idea how to run this prototype. The following instructions show what to do to run the prototype with the Hospital dataset.
- Clone this repository
- Clone HoloClean to the same parent folder
- See Hints for HoloClean and change the code accordingly
- Setup HoloClean with instructions from HoloClean-Repo. Use Python 3.8 instead
- Start Postgres DB
- You can try by executing
examples/holoclean_repair_example.pyin HoloClean's folder if the installation was successful - Install the python packages from the requirements
- To make sure the dependencies are extracted from the ontology (and not used from cache) delete
data/hospital-scenario/hospital-dependencies.txt - Run
validation/holoclean/holoclean_hospital.py - Results are printed to the console
The execution with the IoT dataset works analogously.
- Python (Version 3.8.x)
- rdflib
- pyfuseki
- pandas