-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Description
This is a catch-all issue to remind everyone that we want to collect 'relevant' (whatever that means -- there is really no perfect way to determine that) datasets. This would be datasets we imagine could be used as inputs for one or several of our workflows.
In addition to the datasets itself, we want to collect as much metadata about them as possible (origin/authors, description, schema, how they were produced, whether they were already pre-processed and how, etc...).
Also, it would be great to describe the imperfections of each dataset, what would ideally be needed to make them 'perfect' as inputs to workflows and how they would look like once that was done (meaning, there would be no further pre-processing steps needed, they could be reliably used as inputs without having to worry about data quality).
Task list
- determine how/where to store those datasets
- create a minimal schema of information we want to collect for each dataset
- create a schema for optional metadata we would like to collect for each dataset, if easily possible
- collect datasets (open ended)