Skip to content

[Catalog]: DATASET & Determine how to store catalog - externally versioned repo? #746

@edwardchalstrey1

Description

@edwardchalstrey1

Initial discussion

  • Games catalog could live outside gambit repo in a separate repo that generates the dataset
  • Should have its own versioning
  • Could be a submodule in Gambit
  • Stick data releases on zenodo with a DOI Harvard Dataverse or HuggingFace
  • Data should output in croissant format or be convertable to that via an external utility script
  • Gambit catalog could pull games from e.g. v1.1.x of the catalog e.g. the smallest version being for typos
  • Could be that the catalog does not ship with gambit, there is a download function which caches it, then that's versioned
  • Perhaps this is an issue for gambit 17, that version uses published datasets
  • Ultimately all games will be files and not code

NeurIPS Evaluation and Benchmarks

NeurIPS Evaluation and Benchmarks might have a specific platform to use:

...authors should clearly explain what claims the dataset is intended to support (e.g., improved model performance, fairness, robustness, safety, or other model characteristics), under what assumptions those claims are valid, and what limitations constrain them.

data-centric and benchmarking submissions historically welcomed by the track remain fully in scope. These include, but are not limited to: New datasets and dataset collections

We strongly encourage all authors to release code whenever feasible to promote transparency and reproducibility. However, code release is required at submission when the primary contribution is a reusable executable artifact, such as a benchmark suite, evaluation environment, data generator, or software tool, whose functionality must be inspected in order to evaluate the scientific claims.

Ed's thoughts on where/how to host the catalog

  • I think we should use HuggingFace to maximise visibility to ML community, no reason we can't also put on Harvard Dataverse for academic visibility
  • HuggingFace will be a good place from which to consume the data, if the Gambit catalog module pulls from an external repo: https://huggingface.co/datasets
  • Hosting: HuggingFace has 928K datasets, Harvard has 295K (and 8K "dataverses")
  • Seems pretty easy to use the croissant package to generate metadata: https://github.com/mlcommons/croissant although there is a specific way they want us to generate it:

The dataset hosting process as part of submitting to the Evaluations and Datasets Track involves:

  1. Choosing among 4 options to host your dataset: Harvard Dataverse, Kaggle, Hugging Face, and OpenML
  2. Using platform tooling to download the automatically generated Croissant file
  3. Complete the Croissant file with Responsible AI (RAI) metadata. We aim to provide additional tooling for this
  4. Including a URL to your dataset and uploading the generated Croissant file in OpenReview
  5. If your submission is accepted: making your dataset public by the camera ready deadline

Questions

  • What is the format of our data? Just the set of EFG/NFG files?

Metadata

Metadata

Labels

catalogIssues that relate to the Gambit's Catalog of gamesdocumentation

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions