Clean up old taxonomy databases

Since https://github.com/loculus-project/loculus/pull/6018 we are downloading taxonomic information from NCBI and storing it as a sqlite DB on S3. The DB is around 93MB once gzipped.

The DB creation script runs once per month in a cronjob through [this action](https://github.com/loculus-project/loculus/blob/main/.github/workflows/ncbi-taxonomy-db.yml). Current behaviour is to store a versioned database each time, using the current date as a timestamp. This allows us to keep previous versions of the taxonomy around, and could allow different preprocessing pipeline versions to use a specific NCBI taxonomy version.

However, we currently naively store a new DB each month, regardless of whether the taxonomy has been updated. There is also no clean-up logic to remove old and unused database versions. Over time, this could make pay for storage that we aren't using.

## Possible improvements

1. Regularly perform cleanup of e.g., databases that are over 6 months old. Either do this manually or automated.
2. Have the DB creation action calculate the hash of the newly created database and only upload it to S3 if the hash is different than the latest version on S3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up old taxonomy databases #6099

Possible improvements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clean up old taxonomy databases #6099

Description

Possible improvements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions