-
Notifications
You must be signed in to change notification settings - Fork 9
Description
@SuYe99 I'm looking at how to best integrate this code with watch, and I'm trying to connect the steps your taking in prepare_ard.py with the data we have available.
Something prepare_ard does it it makes a lot of assumptions about filenames. This is something that kwcoco helps to abstract away. Filename assumptions should be fine for the packed data (although I'd recommend to avoid it, as my opinion is that it is usually an anti-pattern), so the goal is just to get data from a kwcoco file into something pycold will accept.
The following snippets are what I've done so far. First here is a bit of demo code that will build a kwcoco dataset with landsat8 data assuming you have the watch repo and appropriate AWS credentials to pull data from the USGS stac catalog.
(Note, it does depend on a few tweaks and fixes I haven't landed in main yet, so let me know if you actually need to run this and I'll land those).
DATASET_SUFFIX=DemoKHQ-2022-06-10-V2
DEMO_DPATH=$HOME/.cache/watch/demo/datasets
REGION_FPATH="$HOME/.cache/watch/demo/annotations/KHQ_R001.geojson"
SITE_GLOBSTR="$HOME/.cache/watch/demo/annotations/KHQ_R001_sites/*.geojson"
START_DATE=$(jq -r '.features[] | select(.properties.type=="region") | .properties.start_date' "$REGION_FPATH")
END_DATE=$(jq -r '.features[] | select(.properties.type=="region") | .properties.end_date' "$REGION_FPATH")
REGION_ID=$(jq -r '.features[] | select(.properties.type=="region") | .properties.region_id' "$REGION_FPATH")
SEARCH_FPATH=$DEMO_DPATH/stac_search.json
RESULT_FPATH=$DEMO_DPATH/all_sensors_kit/${REGION_ID}.input
mkdir -p "$DEMO_DPATH"
# Create the search json wrt the sensors and processing level we want
python -m watch.stac.stac_search_builder \
--start_date="$START_DATE" \
--end_date="$END_DATE" \
--cloud_cover=40 \
--sensors=landsat-c2ard-bt,landsat-c2ard-sr \
--out_fpath "$SEARCH_FPATH"
cat "$SEARCH_FPATH"
# Delete this to prevent duplicates
rm -f "$RESULT_FPATH"
# Create the .input file
python -m watch.cli.stac_search \
-rf "$REGION_FPATH" \
-sj "$SEARCH_FPATH" \
-m area \
--verbose 2 \
-o "${RESULT_FPATH}"
# Construct the TA2-ready dataset
python -m watch.cli.prepare_ta2_dataset \
--dataset_suffix=$DATASET_SUFFIX \
--s3_fpath "${RESULT_FPATH}" \
--collated False \
--dvc_dpath="$DEMO_DPATH" \
--aws_profile=iarpa \
--region_globstr="$REGION_FPATH" \
--site_globstr="$SITE_GLOBSTR" \
--fields_workers=8 \
--convert_workers=8 \
--align_workers=26 \
--cache=0 \
--ignore_duplicates=0 \
--visualize=True \
--requester_pays=True \
--api_key=None \
--serial=True --run=0This makes a small dataset of the Kitware HQ building and an associated ~/.cache/watch/demo/datasets/Aligned-DemoKHQ-2022-06-10-V2/data.kwcoco.json kwcoco dataset, which registers all the paths to all of the images along with metadata about what sensor they come from and when they were taken.
Given this I wrote a small script to attempt to emulate what you are doing in prepare_ard to construct blocked arrays for a single image:
import ubelt as ub
import kwcoco
import numpy as np
fpath = ub.Path('~/.cache/watch/demo/datasets/Aligned-DemoKHQ-2022-06-10-V2/data.kwcoco.json').expand()
dset = kwcoco.CocoDataset(fpath)
# available channels
# sr_qa_aerosol,qa_pixel,qa_radsat
# 'coastal|blue|green|red|nir|swir16|swir22|lwir11|lwir12'
for vidid in dset.videos():
images = dset.images(vidid=vidid)
for coco_image in images.coco_images:
print(f'coco_image.channels={coco_image.channels}')
assert coco_image.img['sensor_coarse'] == 'L8'
intensity_channels = 'blue|green|red|nir|swir16|swir22|lwir11'
quality_channels = 'qa_pixel'
delayed_intensity = coco_image.delay(channels=intensity_channels, space='video', nodata_method='ma')
delayed_qa = coco_image.delay(
channels=quality_channels, space='video', nodata_method='ma', antialias=False)
# hacks: should be available in main API
delayed_intensity._set_nested_params(interpolation='cubic')
delayed_qa._set_nested_params(interpolation='nearest')
delayed_intensity = delayed_intensity.optimize()
delayed_qa = delayed_qa.optimize()
intensity_data = delayed_intensity.finalize()
qa_data = delayed_qa.finalize()
im_band = np.concatenate([intensity_data, qa_data], axis=2)
im_band_data = im_band.data
# im_band_mask = im_band.mask
h, w, c = im_band_data.shape
n_block_x = 20
n_block_y = 20
padded_w = int(np.ceil(w / n_block_x) * n_block_x)
padded_h = int(np.ceil(h / n_block_y) * n_block_y)
# Pad data out to fit in block sizes
pad_x = padded_w - w
pad_y = padded_h - h
padded_data = np.pad(im_band_data, [(0, pad_y), (0, pad_x), (0, 0)])
bw = int(padded_w / n_block_x) # width of a block
bh = int(padded_h / n_block_y) # height of a block
import einops
blocks = einops.rearrange(padded_data, '(nby bh) (nbx bw) c -> nbx nby bh bw c', bw=bw, bh=bh)Note that in a prepared kwcoco file Using CocoImage.delay(space='video') will return an object that will resample any bands (that may exist at different scales on disk) on the fly. This resample will also take care of aligning pixels across multiple images in a sequence according to their geo-metadata.
This means I can load all of the data in a fairly concise command. The next step is to construct the block structure you expect. I'm using einops to express the appropriate operation. It's probably not as efficient as your implementation, but writing it this way will let us generalize and reduce the number of assumptions we make. I think we can improve the efficiency if it turns out to be a bottleneck.
It looks like I'll also need to check the QA band to determine if I should write a block or not.
What would be helpful is a precise description of the file format that you would expect.
It looks like there needs to be a starting_last_dates.txt with the times for the images in the sequence, and then the images are stored in "partition subfolders" where each image has shape (NY, NX, BH, BW, C) in npy format?
I'm wondering if it would be inefficient to use hdf5 here instead? Perhaps the npy format is just superior?
Also how are you handling data that doesn't have a shape where a blocksize neatly divides it? I'm just using some padding, there probably is a more efficient way to handle that part.