Skip to content

new index proposal (car v3) #643

@gammazero

Description

@gammazero

New Index - CAR v3 Proposal

Goals

  1. Better support for random access and streaming - know what to expect in stream before consuming entire stream.
  2. Support reading specific blocks from CAR files using HTTP range requests
  3. Know what part of DAG a CAR is part of, and prove that it is that part of the DAG.
  4. More reliable streaming (truncation detection, including on block boundary).
  5. Standardize CAR data representation for interoperability and optional determinism

Solution

Address problems by introducing a new standard, CARv3. Creating a new CAR version, instead of adding capabilities to V2, will make it easier to signal support for new features and detect support across the ecosystem.

Main characteristics:

  1. Allow CAR index at start of stream
    1. Or always have it retrievable separately from CAR data
    2. CAR header should contain index length in addition to offset, unless CAR format changes to have index at fixed offset.
  2. CAR index should contain block offsets and lengths to support range requests. This allows someone to receive only the v3 header, and then request specific blocks in the CAR file.
  3. CAR index optionally provides additional DAG path information
    1. Optional DAG Parent
    2. Optional DAG Children
  4. CAR index should contain its own length header to detect truncation when streamed separately from CAR file. This will allow detection of truncation at a block boundary, which is not currently possible with V1.
  5. Support creation of byte-for-byte deterministic CARs. At very least CAR should self-describe order and duplicate block behavior.
  6. Updated specification and working implementation in go-car project.

Should a CAR file always have an index, or should it be optional?

When streaming a subset of a CAR file, an index can be generated for the stream.

Index Ordering and Memory Optimization (Eytzinger)

One reason that the index is loaded into memory is so that it can be binary searched efficiently. A binary search can be performed without needing to store the index in memory, and only needing to read farther into the data stream, if the index is sorted into Eytzinger order.

Eytzinger orders the index as a binomial heap and a binary search only needs to read farther into the stream as the search proceeds. Previous parts of the index do not need to be retained, as would be necessary for a traditional binary search. This minimizes streaming/reading the index, because the reader can skip forward as the heap-ordered search proceeds. Prefetching can optimize this process even more. See the following for an explanation of Eytzinger binary search:

Alternative solutions

Consider using the current CARv2 specification, but with a new index type. Defining alternate indexes is already allowed. The new index could solve the truncation detection problem in v2 by including a size field at the start of the index.

The availability of the index, without needing to consume the CAR data payload, could be addressed by having the index available separately from its associated CAR data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions