Skip to content

[FEATURE] Add S3 write support to cli datagen #70

@prantogg

Description

@prantogg

Is your feature request related to a problem? Please describe.

When generating large-scale SpatialBench datasets (e.g., SF1000 or higher), there is currently no way to write generated data directly to S3. This creates several significant limitations:

  • Local Storage bottlenecks: Large-scale datasets can be hundreds of GBs or TBs in size, quickly exhausting local disk space. For example, SF1000 Trip table alone can exceed 500GB.
  • Workflow inefficiency: The current workflow requires generating data locally first, then manually uploading to S3 using separate tools (aws cli, rclone, etc.), which is time-consuming and error-prone.

Describe the solution you'd like

Add support for S3 URIs in the --output-dir parameter, enabling the tool to stream generated data directly to S3 without requiring local storage:

# Current workflow:
spatialbench-cli --scale-factor 1000 --output-dir ./data
# Then manually: aws s3 cp ./data s3://my-bucket/spatialbench/sf1000 --recursive

# Proposed workflow:
spatialbench-cli --scale-factor 1000 --output-dir s3://my-bucket/spatialbench/sf1000

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions