Skip to content

Support for generating and inserting results from object storage such as amazon S3 #27

@azimov

Description

@azimov

Currently, this package only directly supports uploads of files from a directory structure.
However, this is limiting for many projects because it may be significantly faster to asynchronously produce results and export them to simple object stores such as Amazon S3.

Furthemore, many tasks that execute are likely mainly database intensive and not cpu intensive. Requiring EC2 nodes or other services that write to a disk is likely an expensive solution when results can easily be unloaded from Databases into object stores in an async manner.

Proposals:

  • Define interfaces for import of files from S3 buckets/google cloudstore/
  • Support a load table solution where results can be imported into load tables in databases in a threadsafe manner:
  • Upload csv objects then copy them to main table one at a time so any race conditions don't lock up tables
  • Support creating manifests that can be transfered. E.g. results are generated by some analytics package and a json file is created listing the bucket/object store and file reference as well as the result model spec
  • Support a simple table back end (in lue of a message queue/broker) that stores and logs the state of the results insert
  • Make a simple Plumber API that lets you initiate an upload from a given manifest (hashed entries to prevent multiple requests with identical uploads)
  • Cleanup/Garbage collection step: Delete objects from object stores when inserts are successful

Potential Issues:

  • Storage of keys for buckets

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions