Skip to content

Clean up old taxonomy databases #6099

@maverbiest

Description

@maverbiest

Since #6018 we are downloading taxonomic information from NCBI and storing it as a sqlite DB on S3. The DB is around 93MB once gzipped.

The DB creation script runs once per month in a cronjob through this action. Current behaviour is to store a versioned database each time, using the current date as a timestamp. This allows us to keep previous versions of the taxonomy around, and could allow different preprocessing pipeline versions to use a specific NCBI taxonomy version.

However, we currently naively store a new DB each month, regardless of whether the taxonomy has been updated. There is also no clean-up logic to remove old and unused database versions. Over time, this could make pay for storage that we aren't using.

Possible improvements

  1. Regularly perform cleanup of e.g., databases that are over 6 months old. Either do this manually or automated.
  2. Have the DB creation action calculate the hash of the newly created database and only upload it to S3 if the hash is different than the latest version on S3

Metadata

Metadata

Assignees

No one assigned

    Labels

    github_actionsPull requests that update GitHub Actions codepreprocessingIssues related to the preprocessing component

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions