Skip to content

Conversation

@lkitching
Copy link
Contributor

Add implementations of the csv2rdf RowSource protocol which allow
transformed versions of pipeline input files to be passed directly.

The RowSource protocol represents tabular resources as a logical
sequence of records, each containing the source row number and parsed
data cells. Pipelines previously wrote transformed version of the
input files to disk so they could be passed to the CSVW process.

Add implementations of hte RowSource protocol which allow the
transformation process to be done in memory and presents the
transformed row records directly into the CSVW process.

The number of component specifications derived within cube-pipeline
is expected to be quite small. Load these into memory and add
a RowSource implementation which returns the corresponding tabular
rows to csv2rdf.

Update the tests which check the format of the intermediate
transformed data to used the transformed row sources.

Add implementations of the csv2rdf RowSource protocol which allow
transformed versions of pipeline input files to be passed directly.

The RowSource protocol represents tabular resources as a logical
sequence of records, each containing the source row number and parsed
data cells. Pipelines previously wrote transformed version of the
input files to disk so they could be passed to the CSVW process.

Add implementations of hte RowSource protocol which allow the
transformation process to be done in memory and presents the
transformed row records directly into the CSVW process.

The number of component specifications derived within cube-pipeline
is expected to be quite small. Load these into memory and add
a RowSource implementation which returns the corresponding tabular
rows to csv2rdf.

Update the tests which check the format of the intermediate
transformed data to used the transformed row sources.
@Robsteranium
Copy link
Contributor

I guess this complements #27? Whereas that deals with the metadata, this deals with the tables themselves?

Sorry this hasn't been reviewed sooner @lkitching .

Is it still valid? I guess we'll need to update it to resolve merge conflicts. I wonder if there's any interaction now with #120?

Ideally we'll reach the point where we can run as either a) csv->csvw or b) csv->rdf to support both interop/ scrutability and overall efficiency respectively.

@lkitching
Copy link
Contributor Author

@Robsteranium - I'm not sure we want to use this any more. We don't use #27 any more either within #120 since we always write the CSVW to disk. We could resurrect this approach in future since the infrastructure still exists within csv2rdf but it's probably more effort that it's worth for now.

@Robsteranium
Copy link
Contributor

Robsteranium commented Apr 21, 2020

There was a reason for doing this though wasn't there... Was it an OOME or that it was faster without I/O? I can't remember!

Let's leave the PR and branch open in case we want to reintroduce it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants