-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Labels
github_actionsPull requests that update GitHub Actions codePull requests that update GitHub Actions codepreprocessingIssues related to the preprocessing componentIssues related to the preprocessing component
Description
Since #6018 we are downloading taxonomic information from NCBI and storing it as a sqlite DB on S3. The DB is around 93MB once gzipped.
The DB creation script runs once per month in a cronjob through this action. Current behaviour is to store a versioned database each time, using the current date as a timestamp. This allows us to keep previous versions of the taxonomy around, and could allow different preprocessing pipeline versions to use a specific NCBI taxonomy version.
However, we currently naively store a new DB each month, regardless of whether the taxonomy has been updated. There is also no clean-up logic to remove old and unused database versions. Over time, this could make pay for storage that we aren't using.
Possible improvements
- Regularly perform cleanup of e.g., databases that are over 6 months old. Either do this manually or automated.
- Have the DB creation action calculate the hash of the newly created database and only upload it to S3 if the hash is different than the latest version on S3
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
github_actionsPull requests that update GitHub Actions codePull requests that update GitHub Actions codepreprocessingIssues related to the preprocessing componentIssues related to the preprocessing component