Skip to content

SCOPING: reduce cloud storage costs #1820

@becky-gilbert

Description

@becky-gilbert

TL;DR

In order to make CHS more sustainable long term, we should consider ways to reduce cloud storage costs.

Here's a non-mutually-exclusive set of options for reducing our clearing our cloud storage, along with estimated amount of effort and storage savings.

Current state

We have ~880k video files totaling ~11.5 TB on AWS. In the past few weeks, our storage has increased by about 0.5 TB (500GB) per week.

The bulk of the storage is primarily located in our production buckets. We have 3 separate buckets for Pipe, RecordRTC, and jsPsych experiments.

Pipe: 6.0 TB
RecordRTC: 5.1 TB
jsPsych: 196.9 GB

These buckets are increasing at different rates: jsPsych is growing the fastest (as a percentage of its total size), and Pipe has the lowest relative increase in size.

In our video database, we have:
~854k objects
Preview: 157,977 (18%)
Non-preview: 696,397 (82%)

Below are some options for automated deletion of video files, along with the associated impact on storage.

Videos associated with preview responses after some duration

Here are the number of videos associated with preview responses, where the preview response is older than the duration:

  • > 30 days: 155,032 (18% of 854k total videos)
  • > 90 days: 145,935 (17%)
  • > 1 year: 121,628 (14%)
  • > 2 years 95,548 (11%)
  • > 3 years: 73,960 (9%)
  • > 4 years: 52,366 (6%)

Videos associated with any response after some duration

  • > 1 year: 666,901 (78% of 854k total videos)
  • > 2 years 473,941 (56%)
  • > 3 years: 347,477 (41%)
  • > 4 years: 220,989 (26%)

Videos associated with a study after some duration/event that indicates the study has become inactive/stale

  • Study status is archived
  • Last (non-preview?) response older than some duration

More info to come!

Videos associated with a response when consent is rejected

  • > 30 days: 66,509 (8% of 854k total videos)
  • > 90 days: 61,075 (7%)
  • > 1 year: 48,140 (6%)
  • > 2 years 33,267 (4%)
  • > 3 years: 22,618 (3%)
  • > 4 years: 13,932 (2%)

Work estimate

To address this, we would need to:

  • Add a celery task that checks for studies/responses that meet the relevant criteria and delete any associated videos from S3. This should be straightforward since our set of other tasks do similar things. Estimate: 3-5 days

And if we're just deleting videos from storage (and not deleting our records of the associated response/study), then we could add a column to the response/video table indicating that video(s) used to exist for this response but have been deleted. Options:

  • Keep all Video objects in the DB: add a ‘deleted’ column to the Video table, which would allow researchers to see the number of video files associated with a response and the file names. (This solution would not reduce the database size.)
  • Delete the Video objects from the DB: add a ‘videos_deleted’ flag to the associated response object. This would inform users that videos used to exist for this response but have been deleted. Researchers would not know the number of files or their names. (This solution would reduce the size of the Video table in proportion to the number of video files deleted.)

Both of the options above would require a database migration and some UI updates.
Estimate: 5 days

Related issues: #1431
I suggest we address this issue at the same time.

Metadata

Metadata

Assignees

Labels

Scoping[Work Type] Lacking specifics regarding feasibility and implementation

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions