What is the expected behaviour of Primary and Foreign Keys when referencing List Columns?

It is possible in the CSVW standard to apply a separator to a column which allows multiple values per cell:

> separator
>
>    An atomic property that MUST have a single string value that is the string used to separate items in the string value of the cell. If null (the default) or unspecified, the cell does not contain a list. Otherwise, application MUST split the string value of the cell on the specified separator and parse each of the resulting strings separately. __The cell's value will then be a list__.
> 
> <https://www.w3.org/TR/2015/REC-tabular-metadata-20151217/#cell-separator>

It is also possible to specify whether these values are ordered or unordered:

> A boolean atomic property taking a single value which indicates whether a list that is the value of the cell is ordered (if true) or unordered (if false). The default is false. This property is irrelevant if the separator is null or undefined, but this is not an error. 
>
> <https://www.w3.org/TR/2015/REC-tabular-metadata-20151217/#cell-ordered>

This behaviour is quite expressive and useful, however it brings up some uncertainty as to how this is applied to primary and foreign keys when one of the columns is a list columns.

Below I'm going to touch on some of my uncertainty, and perhaps someone will be able to point me to a mailing list item, or previous github post which clarifies the expected/defined behaviour. I haven't managed to find anything so far discussing this topic, and I'm not aware of any [test cases](https://w3c.github.io/csvw/tests/#manifest-validation) which clarify the expected behaviour.

## Primary Keys

### Single List Column

Should the primary key ensure that set of all individual values inside the columns are unique in the table? Or should it simply ensure that either the ordered or unordered aggregation of the list items is unique in the table?

Consider the following table where `myId` is an __ordered__ list column with separator `,` and the primary key is defined as `['myId']`:

| myId |
|:-----|
| A,B  |
| B,A  |
| C    |

Which of the following should be considered unique?

1. `{ A, B, B, A, C }` (PK violated)
2. `{ [A,B], [B,A], C }` (PK valid)

Here I use the python-esque notation of curly braces `{ Z, ..., A }` to represent an unordered set and square brackets `[ A, ..., Z ]` to represent an ordered list.

i.e. should the above table be considered valid or invalid according to the primary key constraint that I have described? If the `myId` column were to be marked as __unordered__ then presumably the above table would be considered invalid regardless of approach.

### References multiple columns, at least one of which is a List Column

I think the question becomes a little more complicated when you extend a primary key to represent two columns, one of which is a List column. 

Consider the following table where `Categories` is defined as an __unordered__ list column with separator `,` and a primary key defined as `['location', 'categories']`:

| location | categories |
|:---------|:-----------|
| One      | A,B        |
| One      | B,C        |
| Two      | A,B,C      |

Should the primary key enforce that every individual combination of values should be unique w.r.t. the table, or should it consider the aggregated value, i.e. which of the following should be unique?

1. `{ [One, {A, B}], [One, {B, C}], [Two, {A, B, C}]}` (PK valid - but I'm not sure of a scenario where this would be useful)
2. `{ [One, A], [One, B], [One, B], [One, C], [Two A], [Two, B], [Two, C] }` (PK voilated - but it feels odd to have multiple PK values per row)

Again, I'm unsure about whether the primary key definition I have described should render the above table valid or invalid. The open question about ordering is still pertinant here.

## Foreign Keys

Now to the point that actually brings me here, I'd like to apply a foreign key constraint to a list column:

Given the following parent table (`parent.csv`) with a primary key defined as `['countryCode']`: 

| countryCode | name                |
|:------------|:--------------------|
| UK          | United Kingdom      |
| IE          | Republic of Ireland |
| FR          | France              |
| DE          | Germany             |
| CA          | Canada              |

I would like to define a child table (`child.csv`) with a list column `countries` referencing `parent.csv#countryCode`:

| group | countries       | description                                                                                                                       |
|:------|:----------------|:----------------------------------------------------------------------------------------------------------------------------------|
| G7    | UK,FR,DE,CA,... | The Group of Seven (G7) is an intergovernmental political and economic forum                                                      |
| EU    | FR,DE,IE,...    | The European Union (EU) is a supranational political and economic union of 27 member states that are located primarily in Europe. |

In this way I could ensure that the countries defined in `child.csv#countries` are defined in a way that is much more user-friendly than the longer form approach better represented across two tables as in [the more traditional approach](#the-more-traditional-approach). This approach felt like the natural one which should work.

A further question presents itself when dealing with multiple columns in a foreign key. If my desired approach here were valid and correct then it stands to reason that a foreign key referencing multiple columns where at least one of them was a list column would have to ensure that all possible combinations of values for a given row were unique; and I'm not sure that I can think of a scenario where this would be truly predictable or useful.

__N.B. I'm not even going to think about the situation in which one of the parent table's candidate key columns is a list column.__

### The more traditional approach

This may be obvious, feel free to skip it. It is more verbose than desired, but it is fairly obvious how to apply foreign keys to achieve the desired result here.

| group | description                                                                                                                       |
|:------|:----------------------------------------------------------------------------------------------------------------------------------|
| G7    | The Group of Seven (G7) is an intergovernmental political and economic forum                                                      |
| EU    | The European Union (EU) is a supranational political and economic union of 27 member states that are located primarily in Europe. |

| group | country |
|:------|:--------|
| G7    | UK      |
| G7    | FR      |
| G7    | DE      |
| G7    | CA      |
| G7    | ...     |
| EU    | FR      |
| EU    | DE      |
| EU    | IE      |
| EU    | ...     |


## AboutUrls

One of the ways that I have tried to interpret this is to make an analogy between a primary key for a row and the `aboutUrl` template that could be generated for one as an identifier. Given the strong relationship between the CSV on the web standard and RDF, I would argue that it only really makes sense to enforce primary key constraints for identifiers which are serialisable into RDF IRIs. 

Since [there is no way to create IRIs mapping to individual items within the list column](https://lists.w3.org/Archives/Public/public-csvw/2016Aug/0001.html) I think that the most consistent (overall) approach would be to say that since we can only convert a list column's values into a single aggregated URI, the primary key should also consider the values in an aggregated way. 

Now on to the question of ordering (which would not be relevant if we considered each combination of values per row to be unique):

Noting that the values in a list column are serialised according to [rfc6570](https://datatracker.ietf.org/doc/html/rfc6570), I can define `aboutUrl`s in a number of ways, including the following applied to the [single list column](#single-list-column) example:

| `https://example.com/{?myId}`        | `https://example.com/{/myId}` |
|:-------------------------------------|:------------------------------|
| <https://example.com/?myId=A&myId=B> | <https://example.com/A/B>     |
| <https://example.com/?myId=B&myId=A> | <https://example.com/B/A>     |
| <https://example.com/?myId=C>        | <https://example.com/C>       |

To my mind you could consider the query string URI approach to represent either an __ordered__ or an __unordered__ list, but it's questionable whether that ambiguity is widely accepted ([see here](https://stackoverflow.com/a/70264621)). I certainly find nothing in [rfc6570](https://datatracker.ietf.org/doc/html/rfc6570) which specifies a way to serialise the list values in a way which I would consider truly unordered; and at the end of the day it's going to be a string representation so fundamentally it's always going to have some order. 

So I wonder whether it make senses for primary/foreign keys to ignore any `ordered` property set against the column definition when considering uniqueness, however I understand this might be very counterintuitive to the user who configures the CSV-W. 

To effectively contradict everything I've argued for in this section, it is still possible to serialise these values into RDF both preserving or ignoring the order, so I don't feel like I can be decisive on any of this.

## What do I want

1. Does anyone have any recollection of similar issues being discussed previously, if so do you know where they might be available? 
2. Does anyone have some other way of interpreting things that I'm not seeing?
3. What should the behaviour be in these situations?
    - Should list columns just be excluded from primary/foreign keys?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the expected behaviour of Primary and Foreign Keys when referencing List Columns? #899

Primary Keys

Single List Column

References multiple columns, at least one of which is a List Column

Foreign Keys

The more traditional approach

AboutUrls

What do I want

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

countryCode	name
UK	United Kingdom
IE	Republic of Ireland
FR	France
DE	Germany
CA	Canada

group	countries	description
G7	UK,FR,DE,CA,...	The Group of Seven (G7) is an intergovernmental political and economic forum
EU	FR,DE,IE,...	The European Union (EU) is a supranational political and economic union of 27 member states that are located primarily in Europe.

`https://example.com/{?myId}`	`https://example.com/{/myId}`
https://example.com/?myId=A&myId=B	https://example.com/A/B
https://example.com/?myId=B&myId=A	https://example.com/B/A
https://example.com/?myId=C	https://example.com/C

What is the expected behaviour of Primary and Foreign Keys when referencing List Columns? #899

Description

Primary Keys

Single List Column

References multiple columns, at least one of which is a List Column

Foreign Keys

The more traditional approach

AboutUrls

What do I want

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions