feat: support optional datasetUrl in metadata for remote parquet loading#162
Conversation
Allow static exports to load parquet data from an external URL by adding an optional `datasetUrl` field to the metadata database config. When present, DuckDB WASM uses this URL instead of the default relative `dataset.parquet` path. This enables hosting large datasets separately from the viewer (e.g., in a cloud storage bucket or HuggingFace dataset repository) while keeping the static export lightweight. Fully backward-compatible — without `datasetUrl`, behavior is unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
donghaoren
left a comment
There was a problem hiding this comment.
This looks good! The Python CLI should already support passing a URL to a dataset through pandas' support of fsspec. It will load the data with pandas, do some preprocessing (if the user wants to compute embedding) and put it into DuckDB in the server and tell the client to use the server's DuckDB.
|
Thanks for the quick review! The use case I had in mind is for static exports via Currently, Right now this requires manually editing For the Python CLI, it could be nice to have a flag i.e. |
|
Thanks for sharing the use case! I assume in order to get the parquet file we already need to unzip the bundle, so it shouldn't be much effort to set the |
A PR to add
datasetUrlfield to the metadatadatabaseconfigdataset.parquetpathThis enables hosting large parquet datasets separately from the viewer (e.g., in a cloud storage bucket or dataset repository) while keeping the static export lightweight. Fully backwards-compatible — without
datasetUrlin metadata, behaviour is unchanged.You can see an example here: https://huggingface.co/spaces/davanstrien/test-atlas-remote
Where
metadata.json(https://huggingface.co/spaces/davanstrien/test-atlas-remote/blob/main/data/metadata.json) points to a dataset hosted here: https://huggingface.co/datasets/davanstrien/test-atlas-remote-data/tree/main, i.e. in a dataset repo. This should enable remote hosting of fairly large datasets. I will continue testing with larger examples.Motivation
When deploying static exports for large datasets, bundling the parquet file alongside the viewer can hit storage limits (e.g., Hugging Face Spaces has a ~10GB limit). With this change, you can store the parquet separately and point the viewer at it via metadata:
{ "isStatic": true, "database": { "type": "wasm", "load": true, "datasetUrl": "https://example.com/path/to/dataset.parquet" } }The viewer (~100MB) and data (potentially many GB) can then be hosted independently.
Change
Two lines in
packages/viewer/src/app/backend_data_source.ts:datasetUrl?: stringto theMetadata.databaseinterfaceTest plan
datasetUrlwork as beforeFollow-up
I think it would be nice to expose the
datasetUrlthrough the Python CLI (e.g.,--dataset-url) and theDataSource.make_archive()method so exports can be pre-configured for remote loading without manually editingmetadata.json, but wanted to first open this to discus if you were open to that and what your prefered approach would be.