Skip to content

feat: support optional datasetUrl in metadata for remote parquet loading#162

Merged
donghaoren merged 1 commit intoapple:mainfrom
davanstrien:feat/remote-dataset-url
Feb 13, 2026
Merged

feat: support optional datasetUrl in metadata for remote parquet loading#162
donghaoren merged 1 commit intoapple:mainfrom
davanstrien:feat/remote-dataset-url

Conversation

@davanstrien
Copy link
Contributor

A PR to add

  • optional datasetUrl field to the metadata database config
  • When present in static exports, DuckDB WASM loads parquet from this URL instead of the default relative dataset.parquet path

This enables hosting large parquet datasets separately from the viewer (e.g., in a cloud storage bucket or dataset repository) while keeping the static export lightweight. Fully backwards-compatible — without datasetUrl in metadata, behaviour is unchanged.

You can see an example here: https://huggingface.co/spaces/davanstrien/test-atlas-remote

Where metadata.json (https://huggingface.co/spaces/davanstrien/test-atlas-remote/blob/main/data/metadata.json) points to a dataset hosted here: https://huggingface.co/datasets/davanstrien/test-atlas-remote-data/tree/main, i.e. in a dataset repo. This should enable remote hosting of fairly large datasets. I will continue testing with larger examples.

Motivation

When deploying static exports for large datasets, bundling the parquet file alongside the viewer can hit storage limits (e.g., Hugging Face Spaces has a ~10GB limit). With this change, you can store the parquet separately and point the viewer at it via metadata:

{
  "isStatic": true,
  "database": {
    "type": "wasm",
    "load": true,
    "datasetUrl": "https://example.com/path/to/dataset.parquet"
  }
}

The viewer (~100MB) and data (potentially many GB) can then be hosted independently.

Change

Two lines in packages/viewer/src/app/backend_data_source.ts:

  1. Added datasetUrl?: string to the Metadata.database interface
  2. Changed parquet URL construction to use it with fallback:
    let datasetUrl = metadata.database?.datasetUrl ?? joinUrl(this.serverUrl, "dataset.parquet");

Test plan

  • Tested with a static export deployed to a HuggingFace Space loading 25,000 rows from a remote parquet file hosted in a separate HF dataset repository
  • Verified backward compatibility — exports without datasetUrl work as before
  • Prettier formatting passes

Follow-up

I think it would be nice to expose the datasetUrl through the Python CLI (e.g., --dataset-url) and the DataSource.make_archive() method so exports can be pre-configured for remote loading without manually editing metadata.json, but wanted to first open this to discus if you were open to that and what your prefered approach would be.

Allow static exports to load parquet data from an external URL by
adding an optional `datasetUrl` field to the metadata database config.
When present, DuckDB WASM uses this URL instead of the default relative
`dataset.parquet` path. This enables hosting large datasets separately
from the viewer (e.g., in a cloud storage bucket or HuggingFace dataset
repository) while keeping the static export lightweight.

Fully backward-compatible — without `datasetUrl`, behavior is unchanged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Collaborator

@donghaoren donghaoren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! The Python CLI should already support passing a URL to a dataset through pandas' support of fsspec. It will load the data with pandas, do some preprocessing (if the user wants to compute embedding) and put it into DuckDB in the server and tell the client to use the server's DuckDB.

@davanstrien
Copy link
Contributor Author

Thanks for the quick review!

The use case I had in mind is for static exports via --export-application, when you want to host the exported visualisation as a lightweight static site (e.g., a Hugging Face Space with sdk: static) but the parquet file is too large to bundle inside the ZIP.

Currently, --export-application bundles everything i.e.viewer + parquet, which can hit storage limits for large datasets. With this change, you could generate a static export as usual, but host the parquet separately (e.g., in an HF dataset repo, S3 bucket, etc.) and set datasetUrl in metadata.json to point at it. The viewer assets (~100MB) can then be deployed as a static site, with DuckDB WASM loading the parquet directly from the remote URL.

Right now this requires manually editing metadata.json after export, this is what I did here: https://huggingface.co/spaces/davanstrien/test-atlas-remote.

For the Python CLI, it could be nice to have a flag i.e. --dataset-url flag on --export-application) So you can set the metadata for the URL automatically?

@donghaoren
Copy link
Collaborator

Thanks for sharing the use case!

I assume in order to get the parquet file we already need to unzip the bundle, so it shouldn't be much effort to set the datasetUrl in the metadata.json file before uploading the static site. With access to the metadata.json, you can also set other parts of the metadata to configure the default charts (e.g., color the embedding by a field by default), and in the future we could even allow loading in a pre-configured UI state. Adding a --dataset-url flag would be a bit limiting. Maybe we can have a --export-application-metadata-overrides flag (need a better name) and you can pass in a custom JSON that will merge into the default metadata? Another option is to have --export-application support exporting to a folder instead of a zip file, so it's a bit easier to modify.

@donghaoren donghaoren merged commit 2e55595 into apple:main Feb 13, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants