Skip to content

Data Provenance

David Huard edited this page Sep 3, 2021 · 6 revisions

Data provenance refers to a metadata record documenting the origin of a data product and its transformations.

In DACCS, ultimately, we'd like to be able to automatically create provenance documents accompanying DataArray objects resulting from a climate science workflow, recording the source datasets, as well as the methods and functions applied to generate a final output.

This of course raises a host of questions regarding

  • the granularity of provenance information
  • the customization of provenance information to reflect climate science's semantic and the scientist's intent.

Existing instances of provenance document use directed acyclic graphs (DAG) storing the entities, agents and activities involved in the production of a data set. Visualization tools can scale these graphs, hiding or showing levels of details. Provenance records typically rely on the PROV-O ontology, a grammar of expressions useful to describe provenance relationships. Actual provenance files can be saved to disk using a number of different serializations: PROV-XML, PROV-N, JSON, JSON-LD, etc.

Existing implementations

provenance uses a decorator to record how artifacts are modified by functions.

import provenance as p

p.load_config(...)

import time

@p.provenance()
def expensive_add(a, b):
    time.sleep(2)
    return a + b

The last release was in December 2020 and there has been no commits since then.

By importing recipy at the beginning of a script, python import classes are patched to modify IO-related functions and have them log information into a provenance database. So for example, numpy.load can be wrapped to write to make sure it logs the fact that an input file was loaded. Only tracks inputs and outputs, treats the script (so far) as a black box. Does not use the PROV ontology.

sumatra is a command line tool to store metadata about a script run:

$ smt configure --executable=python --main=main.py

There is an API that can be used to integrate into libraries directly. Does not use the PROV ontology.

Python infrastructure to help researchers with automating, managing, persisting, sharing and reproducing the complex workflows associated with modern computational science and all associated data.

Need to create Node for data objects and wrap functions using decorator. Apparently not a subtle integration.

With metaclipR, functions from library climate4R each have an analog function that builds the provenance graph. A script can thus call the climate4R function, and then the same metaclipR function to create the data product and the provenance document.

Ontologies

W3C recommended standard ontology for provenance. The starting point is made of these objects:

  • prov:Entity
  • prov:Activity
  • prov:Agent

and relations:

  • prov:wasGeneratedBy
  • prov:wasDerivedFrom
  • prov:wasAttributedTo
  • prov:startedAtTime
  • prov:used
  • prov:wasInformedBy
  • prov:endedAtTime
  • prov:wasAssociatedWith
  • prov:actedOnBehalfOf

ProvONE is defined as an extension of PROV, aiming to capture the most relevant information concerning scientific workflow computational processes, and providing extension points to accommodate the specificities of particular scientific workflow systems. https://www.semanticscholar.org/paper/ProvONE%3A-extending-PROV-to-support-the-DataONE-Cao-Jones/819971a7f4fd83d1f585d253f6d496ea7e763540

ProvONE+

https://link.springer.com/chapter/10.1007/978-3-030-62008-0_30

METAdata for CLImate Products (METACLIP) is to encode the metadata required to ensure the traceability and reproducibility of any kind of climate product (data files, plots, maps...), thus requiring a comprehensive framework to track the operations undertaken through often complex data workflows.

Objectives

Although data provenance is not a new concept and is potentially very valuable to track bugs and ensure the scientific integrity of data products, there are few popular implementations in the wild. One possible explanation is that it requires familiarity the provenance ontology and graph manipulation libraries, and coding the provenance logic can be time-consuming. Also, the mechanism to embed this provenance recording into typical scientific workflows is not obvious. As mentioned, existing software tracking provenance require scientists to modify their scripts to fit into their framework (e.g. creating special objects, or applying decorators to functions).

Due to the special expertise and time required, it's unlikely that an individual scientist would invest time and energy recording provenance information from scratch. To become mainstream, provenance tracking has to be developed by specialists, and then turned on/off with a switch. That doesn't mean provenance should be hard-coded into library code however, as different users using the same software are likely to assign different meanings to the same operations. My guess is that libraries should have hooks to which thematic provenance libraries can attach. So for example, biologists and physicists using numpy could connect different provenance semantics to the same operations.

Roadmap

These are the very vague steps that I think are required to build a provenance tracking mechanism.

  • Create hooks within xarray that provide enough context information to build a provenance graph (function name, value of parameters)
  • Write generic provenance templates that apply to most statements, e.g. such function from such library was called at that time with such parameters.
  • Write custom provenance template overriding the generic provenance template.
  • Build template matching system, so that calling a function triggers the provenance hook, with fetches the most appropriate provenance template and fills it.
  • aggregate the provenance graphs from a workflow into a provenance document.

Librairies to explore or manipulate graphs

Experience so far

Tried to track provenance in cf-xarray with prov and rdflib. prov offers a pythonic interface to PROV statements, but I'm struggling to understand how to weave the METACLIP ontology into it.

RDFLib more closely resembles igraph, the R library used in metaclipR, so that makes the translation easier.

NetworkX looks like a more generic framework for a larger variety of graphs.

See https://github.com/xarray-contrib/cf-xarray/issues/228 for discussions and PRs.

Clone this wiki locally