Skip to content

What is the expected behaviour of Primary and Foreign Keys when referencing List Columns? #899

@roblinksdata

Description

@roblinksdata

It is possible in the CSVW standard to apply a separator to a column which allows multiple values per cell:

separator

An atomic property that MUST have a single string value that is the string used to separate items in the string value of the cell. If null (the default) or unspecified, the cell does not contain a list. Otherwise, application MUST split the string value of the cell on the specified separator and parse each of the resulting strings separately. The cell's value will then be a list.

https://www.w3.org/TR/2015/REC-tabular-metadata-20151217/#cell-separator

It is also possible to specify whether these values are ordered or unordered:

A boolean atomic property taking a single value which indicates whether a list that is the value of the cell is ordered (if true) or unordered (if false). The default is false. This property is irrelevant if the separator is null or undefined, but this is not an error.

https://www.w3.org/TR/2015/REC-tabular-metadata-20151217/#cell-ordered

This behaviour is quite expressive and useful, however it brings up some uncertainty as to how this is applied to primary and foreign keys when one of the columns is a list columns.

Below I'm going to touch on some of my uncertainty, and perhaps someone will be able to point me to a mailing list item, or previous github post which clarifies the expected/defined behaviour. I haven't managed to find anything so far discussing this topic, and I'm not aware of any test cases which clarify the expected behaviour.

Primary Keys

Single List Column

Should the primary key ensure that set of all individual values inside the columns are unique in the table? Or should it simply ensure that either the ordered or unordered aggregation of the list items is unique in the table?

Consider the following table where myId is an ordered list column with separator , and the primary key is defined as ['myId']:

myId
A,B
B,A
C

Which of the following should be considered unique?

  1. { A, B, B, A, C } (PK violated)
  2. { [A,B], [B,A], C } (PK valid)

Here I use the python-esque notation of curly braces { Z, ..., A } to represent an unordered set and square brackets [ A, ..., Z ] to represent an ordered list.

i.e. should the above table be considered valid or invalid according to the primary key constraint that I have described? If the myId column were to be marked as unordered then presumably the above table would be considered invalid regardless of approach.

References multiple columns, at least one of which is a List Column

I think the question becomes a little more complicated when you extend a primary key to represent two columns, one of which is a List column.

Consider the following table where Categories is defined as an unordered list column with separator , and a primary key defined as ['location', 'categories']:

location categories
One A,B
One B,C
Two A,B,C

Should the primary key enforce that every individual combination of values should be unique w.r.t. the table, or should it consider the aggregated value, i.e. which of the following should be unique?

  1. { [One, {A, B}], [One, {B, C}], [Two, {A, B, C}]} (PK valid - but I'm not sure of a scenario where this would be useful)
  2. { [One, A], [One, B], [One, B], [One, C], [Two A], [Two, B], [Two, C] } (PK voilated - but it feels odd to have multiple PK values per row)

Again, I'm unsure about whether the primary key definition I have described should render the above table valid or invalid. The open question about ordering is still pertinant here.

Foreign Keys

Now to the point that actually brings me here, I'd like to apply a foreign key constraint to a list column:

Given the following parent table (parent.csv) with a primary key defined as ['countryCode']:

countryCode name
UK United Kingdom
IE Republic of Ireland
FR France
DE Germany
CA Canada

I would like to define a child table (child.csv) with a list column countries referencing parent.csv#countryCode:

group countries description
G7 UK,FR,DE,CA,... The Group of Seven (G7) is an intergovernmental political and economic forum
EU FR,DE,IE,... The European Union (EU) is a supranational political and economic union of 27 member states that are located primarily in Europe.

In this way I could ensure that the countries defined in child.csv#countries are defined in a way that is much more user-friendly than the longer form approach better represented across two tables as in the more traditional approach. This approach felt like the natural one which should work.

A further question presents itself when dealing with multiple columns in a foreign key. If my desired approach here were valid and correct then it stands to reason that a foreign key referencing multiple columns where at least one of them was a list column would have to ensure that all possible combinations of values for a given row were unique; and I'm not sure that I can think of a scenario where this would be truly predictable or useful.

N.B. I'm not even going to think about the situation in which one of the parent table's candidate key columns is a list column.

The more traditional approach

This may be obvious, feel free to skip it. It is more verbose than desired, but it is fairly obvious how to apply foreign keys to achieve the desired result here.

group description
G7 The Group of Seven (G7) is an intergovernmental political and economic forum
EU The European Union (EU) is a supranational political and economic union of 27 member states that are located primarily in Europe.
group country
G7 UK
G7 FR
G7 DE
G7 CA
G7 ...
EU FR
EU DE
EU IE
EU ...

AboutUrls

One of the ways that I have tried to interpret this is to make an analogy between a primary key for a row and the aboutUrl template that could be generated for one as an identifier. Given the strong relationship between the CSV on the web standard and RDF, I would argue that it only really makes sense to enforce primary key constraints for identifiers which are serialisable into RDF IRIs.

Since there is no way to create IRIs mapping to individual items within the list column I think that the most consistent (overall) approach would be to say that since we can only convert a list column's values into a single aggregated URI, the primary key should also consider the values in an aggregated way.

Now on to the question of ordering (which would not be relevant if we considered each combination of values per row to be unique):

Noting that the values in a list column are serialised according to rfc6570, I can define aboutUrls in a number of ways, including the following applied to the single list column example:

https://example.com/{?myId} https://example.com/{/myId}
https://example.com/?myId=A&myId=B https://example.com/A/B
https://example.com/?myId=B&myId=A https://example.com/B/A
https://example.com/?myId=C https://example.com/C

To my mind you could consider the query string URI approach to represent either an ordered or an unordered list, but it's questionable whether that ambiguity is widely accepted (see here). I certainly find nothing in rfc6570 which specifies a way to serialise the list values in a way which I would consider truly unordered; and at the end of the day it's going to be a string representation so fundamentally it's always going to have some order.

So I wonder whether it make senses for primary/foreign keys to ignore any ordered property set against the column definition when considering uniqueness, however I understand this might be very counterintuitive to the user who configures the CSV-W.

To effectively contradict everything I've argued for in this section, it is still possible to serialise these values into RDF both preserving or ignoring the order, so I don't feel like I can be decisive on any of this.

What do I want

  1. Does anyone have any recollection of similar issues being discussed previously, if so do you know where they might be available?
  2. Does anyone have some other way of interpreting things that I'm not seeing?
  3. What should the behaviour be in these situations?
    • Should list columns just be excluded from primary/foreign keys?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions