feat: support optional datasetUrl in metadata for remote parquet loading by davanstrien · Pull Request #162 · apple/embedding-atlas

davanstrien · 2026-02-12T16:17:11Z

A PR to add

optional datasetUrl field to the metadata database config
When present in static exports, DuckDB WASM loads parquet from this URL instead of the default relative dataset.parquet path

This enables hosting large parquet datasets separately from the viewer (e.g., in a cloud storage bucket or dataset repository) while keeping the static export lightweight. Fully backwards-compatible — without datasetUrl in metadata, behaviour is unchanged.

You can see an example here: https://huggingface.co/spaces/davanstrien/test-atlas-remote

Where metadata.json (https://huggingface.co/spaces/davanstrien/test-atlas-remote/blob/main/data/metadata.json) points to a dataset hosted here: https://huggingface.co/datasets/davanstrien/test-atlas-remote-data/tree/main, i.e. in a dataset repo. This should enable remote hosting of fairly large datasets. I will continue testing with larger examples.

Motivation

When deploying static exports for large datasets, bundling the parquet file alongside the viewer can hit storage limits (e.g., Hugging Face Spaces has a ~10GB limit). With this change, you can store the parquet separately and point the viewer at it via metadata:

{
  "isStatic": true,
  "database": {
    "type": "wasm",
    "load": true,
    "datasetUrl": "https://example.com/path/to/dataset.parquet"
  }
}

The viewer (~100MB) and data (potentially many GB) can then be hosted independently.

Change

Two lines in packages/viewer/src/app/backend_data_source.ts:

Added datasetUrl?: string to the Metadata.database interface

Changed parquet URL construction to use it with fallback:

let datasetUrl = metadata.database?.datasetUrl ?? joinUrl(this.serverUrl, "dataset.parquet");

Test plan

Tested with a static export deployed to a HuggingFace Space loading 25,000 rows from a remote parquet file hosted in a separate HF dataset repository
Verified backward compatibility — exports without datasetUrl work as before
Prettier formatting passes

Follow-up

I think it would be nice to expose the datasetUrl through the Python CLI (e.g., --dataset-url) and the DataSource.make_archive() method so exports can be pre-configured for remote loading without manually editing metadata.json, but wanted to first open this to discus if you were open to that and what your prefered approach would be.

Allow static exports to load parquet data from an external URL by adding an optional `datasetUrl` field to the metadata database config. When present, DuckDB WASM uses this URL instead of the default relative `dataset.parquet` path. This enables hosting large datasets separately from the viewer (e.g., in a cloud storage bucket or HuggingFace dataset repository) while keeping the static export lightweight. Fully backward-compatible — without `datasetUrl`, behavior is unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

donghaoren

This looks good! The Python CLI should already support passing a URL to a dataset through pandas' support of fsspec. It will load the data with pandas, do some preprocessing (if the user wants to compute embedding) and put it into DuckDB in the server and tell the client to use the server's DuckDB.

davanstrien · 2026-02-12T17:32:40Z

Thanks for the quick review!

The use case I had in mind is for static exports via --export-application, when you want to host the exported visualisation as a lightweight static site (e.g., a Hugging Face Space with sdk: static) but the parquet file is too large to bundle inside the ZIP.

Currently, --export-application bundles everything i.e.viewer + parquet, which can hit storage limits for large datasets. With this change, you could generate a static export as usual, but host the parquet separately (e.g., in an HF dataset repo, S3 bucket, etc.) and set datasetUrl in metadata.json to point at it. The viewer assets (~100MB) can then be deployed as a static site, with DuckDB WASM loading the parquet directly from the remote URL.

Right now this requires manually editing metadata.json after export, this is what I did here: https://huggingface.co/spaces/davanstrien/test-atlas-remote.

For the Python CLI, it could be nice to have a flag i.e. --dataset-url flag on --export-application) So you can set the metadata for the URL automatically?

donghaoren · 2026-02-13T21:35:46Z

Thanks for sharing the use case!

I assume in order to get the parquet file we already need to unzip the bundle, so it shouldn't be much effort to set the datasetUrl in the metadata.json file before uploading the static site. With access to the metadata.json, you can also set other parts of the metadata to configure the default charts (e.g., color the embedding by a field by default), and in the future we could even allow loading in a pre-configured UI state. Adding a --dataset-url flag would be a bit limiting. Maybe we can have a --export-application-metadata-overrides flag (need a better name) and you can pass in a custom JSON that will merge into the default metadata? Another option is to have --export-application support exporting to a folder instead of a zip file, so it's a bit easier to modify.

donghaoren approved these changes Feb 12, 2026

View reviewed changes

donghaoren merged commit 2e55595 into apple:main Feb 13, 2026
7 checks passed

davanstrien mentioned this pull request Feb 17, 2026

feat: add --export-metadata option and folder export support #163

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support optional datasetUrl in metadata for remote parquet loading#162

feat: support optional datasetUrl in metadata for remote parquet loading#162
donghaoren merged 1 commit intoapple:mainfrom
davanstrien:feat/remote-dataset-url

davanstrien commented Feb 12, 2026

Uh oh!

donghaoren left a comment

Uh oh!

davanstrien commented Feb 12, 2026

Uh oh!

donghaoren commented Feb 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

davanstrien commented Feb 12, 2026

Motivation

Change

Test plan

Follow-up

Uh oh!

donghaoren left a comment

Choose a reason for hiding this comment

Uh oh!

davanstrien commented Feb 12, 2026

Uh oh!

donghaoren commented Feb 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants