Skip to content

Evolving the dataset formats#26

Merged
jungerm2 merged 41 commits intomainfrom
datasets
Feb 23, 2026
Merged

Evolving the dataset formats#26
jungerm2 merged 41 commits intomainfrom
datasets

Conversation

@jungerm2
Copy link
Member

@jungerm2 jungerm2 commented Feb 5, 2026

Dataset format changes:
Metadata from the simulation is not saved, nor returned as a transforms.json file anymore. Instead, metadata is directly saved to a SQLite database when ground truths are rendered. This new format, composed of 3 tables defined in code in simulate/schema.py, allows many workers to write metadata to shared databases without contention. These new databases are defined per data type, one for depths, another for normals, etc. This disentagles that datasets, with each having it's independant metadata. Previously, ground-truth intensity frames and composited frames were confounded: if a blender scene had an existing compositor tree set up, for example, to add glare or change the image contrast, this was saved as the ground-truth frames instead of the unadulterated render. This now corresponds to a new "composites" output type, which comes with its own include_composites() method.

The directory structure has also changed, all data is now saved in numbered subfolders, with previews (previously dubbed "debug" views) saved in a separate folder (only shown for composites here for brevity):

└── <SCENE-NAME>
    ├── composites
    │   ├── 0000
    │   │   ├── 000.png
    │   │   ├── 001.png
    │   │   └── ...
    │   ├── 0001/
    │   ├── ....
    │   └── transforms.db
    ├── depths/
    ├── flows/
    ├── frames/
    ├── normals/
    ├── segmentations/
    └── previews
        ├── depths/
        ├── flows/
        │   └── forward/
        ├── normals/
        └── segmentations/

A Metadata class, and its related classes, namely Camera, Data, Frame, have been added in dataset/models.py as the main way to interact with the metadata files, both the .db and .json variants. These classes are Pydantic models which mirror the data schema used by the simulation code, provide data validation capabilities, extend and supersede the previous Nerfstudio-esque json schema, and provide utilities to convert between the formats. Users should use these classes instead of directly interacting with the schema variants.

That dataset loaders have been re-written to accommodate these changes too, with NpyDataset or ImgDataset being replaced by a combined Dataset class which can support, and load data, that is either a set of images/exrs or a properly formed .db or .json dataset, supporting images/exrs/and numpy formats (bitpacked too!).

CLI Changes:

  • Remove dataset CLI imgs-to-npy and npy-to-imgs commands,
  • Add CLI for dataset.convert to go from a .db to a .json dataset
  • Emulate CLI:
    • Force emulate.spad to save bitpacked npy's
    • All emulate tasks now work on a dataset or directory of frames, using the --pattern switch
    • All emulate tasks now respect the --force flag
  • Rename interpolate.frames to interpolate.dataset, make it work with frames and datasets, using the --pattern switch
  • Transforms CLI:
    • Remove complicated dataloading schemes with in-collate processing in favor of simpler code (potentially slower)
    • Change all default patterns to account for nested data (eg: "flow_.exr" -> "**/.exr")
    • Rename transforms.tonemap-exrs to transforms,tonemap-frames to make it clear this is for image data, not depths (also works with composites)

MISC:

  • Use ElapsedProgress instead of Progress where possible.
  • Use Path as inputs for CLI tasks directly, offloading conversions to tyro
  • Remove opencv (only used by rife for image loading, replaced with imageio)
  • Remove torchvision (unused)
  • Remove jsonschema (replaced by pydantic)
  • Clean up typing/typing_extensions/collections.abc imports
  • Move interpolate_poses into pose.py
  • Clean up unused rife code + enable it to work with a list of frames OR using input-dir and pattern, which allows for saving interp'd frames to subfolders
  • Add py.typed to enable end-users to use vsim types

TODO:

  • Add docstrings to all new methods, ensure they render properly in Sphinx docs
  • Refine schema/model interplay:
    • We can/should add extra info in the schema (fps, key-frame multiplier, arclen?, etc), especially since dataset.info and blender.sequence_info have been removed.
  • Add dataset.merge CLI that can merge different (but similar!) transforms files, renaming the path parameters as needed.
    • Note: This will break if there are any per-frame attributes such as bitpack_dim or offset. These only relate to NPY dataformats, so perhaps this is an OK limitation for now.
  • Add additional tests as needed
  • Validate that all new code passes typing checks, add types where needed, and add type checking to CI.
  • Update Data Format and Loading docs, merge with data conventions PR

Will punt on the following:

  • Emulate SPAD needs to handle Alpha + binomial frames (see this PR)
  • Only PINHOLE camera is supported at the moment. What info to save for other camera types? How to convert to nerfstudio-esque json format? The distortion parameters in the schema are currently unused. Further, Blender's fisheye model is different than OpenCV's.

Questions:

  • How future-proof is this new data format? Can it be applied to all sensor modalities we wish to emulate?
  • What additional attributes should we extract/capture in the .db vs the .json formats?
  • Is the naming confusing around schema.Metadata vs models.Metadata (same for the other classes)? They mirror one another, but I don't want users accidentally importing the wrong one. These have been marked as private.

📚 Documentation preview 📚: https://visionsim--26.org.readthedocs.build/en/26/

@jungerm2
Copy link
Member Author

jungerm2 commented Feb 9, 2026

@shantanu-gupta I'm still working my way through the TODOs above, but since we chatted about these features, let me know if you have any high-level feedback on the implementation.

@shantanu-gupta
Copy link
Contributor

@jungerm2 Thanks for looping me in. I'll take a look.

Just to start:

Force emulate.spad to save bitpacked npy's

This should first confirm that the SPAD data is being generated with 1-bit precision; in the other PR I do take care of this. There is sometimes a practical case for not always generating binary data for SPAD, but instead using 3-bit or 5-bit data at a correspondingly lower frame rate.

@jungerm2
Copy link
Member Author

jungerm2 commented Feb 10, 2026 via email

@jungerm2
Copy link
Member Author

I think this PR's just about done. The plan is to merge it with main, fast-forward the sensor sim branch and merge it there, then update the quickstart docs as needed. Thoughts @shantanu-gupta ?

@jungerm2
Copy link
Member Author

I think I've addressed most of the issues here, so I'll go ahead and merge this and go work on the other branch. Going forward, it might be a good idea to add some database migrations when we modify the schema. I added the structure to support this with example code in the comments.

@jungerm2 jungerm2 merged commit bd50b1c into main Feb 23, 2026
13 checks passed
@jungerm2 jungerm2 deleted the datasets branch February 23, 2026 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants