Skip to content

Integration with WATCH #7

@Erotemic

Description

@Erotemic

@SuYe99 I'm looking at how to best integrate this code with watch, and I'm trying to connect the steps your taking in prepare_ard.py with the data we have available.

Something prepare_ard does it it makes a lot of assumptions about filenames. This is something that kwcoco helps to abstract away. Filename assumptions should be fine for the packed data (although I'd recommend to avoid it, as my opinion is that it is usually an anti-pattern), so the goal is just to get data from a kwcoco file into something pycold will accept.

The following snippets are what I've done so far. First here is a bit of demo code that will build a kwcoco dataset with landsat8 data assuming you have the watch repo and appropriate AWS credentials to pull data from the USGS stac catalog.

(Note, it does depend on a few tweaks and fixes I haven't landed in main yet, so let me know if you actually need to run this and I'll land those).

        DATASET_SUFFIX=DemoKHQ-2022-06-10-V2
        DEMO_DPATH=$HOME/.cache/watch/demo/datasets
        REGION_FPATH="$HOME/.cache/watch/demo/annotations/KHQ_R001.geojson"
        SITE_GLOBSTR="$HOME/.cache/watch/demo/annotations/KHQ_R001_sites/*.geojson"
        START_DATE=$(jq -r '.features[] | select(.properties.type=="region") | .properties.start_date' "$REGION_FPATH")
        END_DATE=$(jq -r '.features[] | select(.properties.type=="region") | .properties.end_date' "$REGION_FPATH")
        REGION_ID=$(jq -r '.features[] | select(.properties.type=="region") | .properties.region_id' "$REGION_FPATH")
        SEARCH_FPATH=$DEMO_DPATH/stac_search.json
        RESULT_FPATH=$DEMO_DPATH/all_sensors_kit/${REGION_ID}.input

        mkdir -p "$DEMO_DPATH"

        # Create the search json wrt the sensors and processing level we want
        python -m watch.stac.stac_search_builder \
            --start_date="$START_DATE" \
            --end_date="$END_DATE" \
            --cloud_cover=40 \
            --sensors=landsat-c2ard-bt,landsat-c2ard-sr \
            --out_fpath "$SEARCH_FPATH"
        cat "$SEARCH_FPATH"

        # Delete this to prevent duplicates
        rm -f "$RESULT_FPATH"
        # Create the .input file
        python -m watch.cli.stac_search \
            -rf "$REGION_FPATH" \
            -sj "$SEARCH_FPATH" \
            -m area \
            --verbose 2 \
            -o "${RESULT_FPATH}"

        # Construct the TA2-ready dataset
        python -m watch.cli.prepare_ta2_dataset \
            --dataset_suffix=$DATASET_SUFFIX \
            --s3_fpath "${RESULT_FPATH}" \
            --collated False \
            --dvc_dpath="$DEMO_DPATH" \
            --aws_profile=iarpa \
            --region_globstr="$REGION_FPATH" \
            --site_globstr="$SITE_GLOBSTR" \
            --fields_workers=8 \
            --convert_workers=8 \
            --align_workers=26 \
            --cache=0 \
            --ignore_duplicates=0 \
            --visualize=True \
            --requester_pays=True \
            --api_key=None \
            --serial=True --run=0

This makes a small dataset of the Kitware HQ building and an associated ~/.cache/watch/demo/datasets/Aligned-DemoKHQ-2022-06-10-V2/data.kwcoco.json kwcoco dataset, which registers all the paths to all of the images along with metadata about what sensor they come from and when they were taken.

Given this I wrote a small script to attempt to emulate what you are doing in prepare_ard to construct blocked arrays for a single image:

    import ubelt as ub
    import kwcoco
    import numpy as np
    fpath = ub.Path('~/.cache/watch/demo/datasets/Aligned-DemoKHQ-2022-06-10-V2/data.kwcoco.json').expand()
    dset = kwcoco.CocoDataset(fpath)

    # available channels
    # sr_qa_aerosol,qa_pixel,qa_radsat
    # 'coastal|blue|green|red|nir|swir16|swir22|lwir11|lwir12'

    for vidid in dset.videos():
        images = dset.images(vidid=vidid)

        for coco_image in images.coco_images:
            print(f'coco_image.channels={coco_image.channels}')
            assert coco_image.img['sensor_coarse'] == 'L8'
            intensity_channels = 'blue|green|red|nir|swir16|swir22|lwir11'
            quality_channels = 'qa_pixel'
            delayed_intensity = coco_image.delay(channels=intensity_channels, space='video', nodata_method='ma')
            delayed_qa = coco_image.delay(
                channels=quality_channels, space='video', nodata_method='ma', antialias=False)

            # hacks: should be available in main API
            delayed_intensity._set_nested_params(interpolation='cubic')
            delayed_qa._set_nested_params(interpolation='nearest')

            delayed_intensity = delayed_intensity.optimize()
            delayed_qa = delayed_qa.optimize()

            intensity_data = delayed_intensity.finalize()
            qa_data = delayed_qa.finalize()
            im_band = np.concatenate([intensity_data, qa_data], axis=2)
            im_band_data = im_band.data
            # im_band_mask = im_band.mask

            h, w, c = im_band_data.shape
            n_block_x = 20
            n_block_y = 20

            padded_w = int(np.ceil(w / n_block_x) * n_block_x)
            padded_h = int(np.ceil(h / n_block_y) * n_block_y)

            # Pad data out to fit in block sizes
            pad_x = padded_w - w
            pad_y = padded_h - h
            padded_data = np.pad(im_band_data, [(0, pad_y), (0, pad_x), (0, 0)])

            bw = int(padded_w / n_block_x)  # width of a block
            bh = int(padded_h / n_block_y)  # height of a block

            import einops
            blocks = einops.rearrange(padded_data, '(nby bh) (nbx bw) c -> nbx nby bh bw c', bw=bw, bh=bh)

Note that in a prepared kwcoco file Using CocoImage.delay(space='video') will return an object that will resample any bands (that may exist at different scales on disk) on the fly. This resample will also take care of aligning pixels across multiple images in a sequence according to their geo-metadata.

This means I can load all of the data in a fairly concise command. The next step is to construct the block structure you expect. I'm using einops to express the appropriate operation. It's probably not as efficient as your implementation, but writing it this way will let us generalize and reduce the number of assumptions we make. I think we can improve the efficiency if it turns out to be a bottleneck.

It looks like I'll also need to check the QA band to determine if I should write a block or not.

What would be helpful is a precise description of the file format that you would expect.
It looks like there needs to be a starting_last_dates.txt with the times for the images in the sequence, and then the images are stored in "partition subfolders" where each image has shape (NY, NX, BH, BW, C) in npy format?

I'm wondering if it would be inefficient to use hdf5 here instead? Perhaps the npy format is just superior?

Also how are you handling data that doesn't have a shape where a blocksize neatly divides it? I'm just using some padding, there probably is a more efficient way to handle that part.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions