Skip to content

Should we import datasets as CAR files? #23

@hannahhoward

Description

@hannahhoward

Related to #18 , I noticed that when we write tests that transfer a dataset (such as a folder / file) as opposed to just random generated bytes, we import the data into UnixFS from the file system (in normal system format). If our intent is to truly test some of the data sets on https://awesome.ipfs.io/datasets/ as they exist on IPFS, the reliable way to bring them in is to export them as CAR files and import into the tests. The reason is a UnixFS import from system files is not gauranteed to produce the exact same DAG or root CID. There are several variables that affect how the DAG is built -- such as chunking strategy, use of raw leaves, etc. The only reliable way to know you the exact same dag is to use CAR files. This might also make sense in terms of writing scripts to download datasets-- as long as IPFS is, unfortunately, not as fast as HTTP on a fast hosted site, it's going to be much more efficient if we can import into the seeds blockstore from a car file on a CDN network -- plus that means we don't have to download ahead of time -- we can probably just include it as part of the test, which makes things more reproducable on CI.

Anyway, curious to get your thoughts @adlrocha -- also does this make sense? It may not be obvious if you haven't worked a lot with UnixFS files and DAG structures.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions