Conversation
Embed Copernicus product UUID in filenames ({product_id}__{safe_name}.ext)
so that different queries returning the same tile share the download.
- Add find_product_on_disk() to detect already-downloaded products by UUID
- Add zip integrity check to catch corrupted/truncated downloads
- Update process_products() to skip downloads for existing products
- Update S1/S2 filename format to include product ID
- Add 16 tests covering dedup, corruption detection, and cross-bbox scenarios
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Product-level deduplication for Copernicus downloads
Problem
The old cache was keyed on exact request parameters (bbox, dates, etc.). Shifting the bbox by even 1 meter produced a different cache key and re-downloaded the same Copernicus tiles.
Solution
Deduplication now happens at the product level instead of the request level. The Copernicus product UUID is embedded directly in the filename:
Before downloading,
process_products()globs for{product_id}__*in the cache directory. If found, the download is skipped. The filesystem is the registry — no extra state files to manage.Zip files are also validated on lookup (
zipfile.testzip()) so truncated/corrupted downloads from interrupted connections are detected, cleaned up, and re-downloaded automatically.What changed
common.py— addedfind_product_on_disk(),_is_valid_zip(), updatedprocess_products()with dedup logics1.py/s2.py— filename format changed to{product_id}__{safe_name}.exttests/test_product_dedup.py— 16 tests covering dedup, corruption detection, cross-bbox scenarios