diff --git a/.gitmodules b/.gitmodules new file mode 100644 index 0000000..742d048 --- /dev/null +++ b/.gitmodules @@ -0,0 +1,7 @@ +[submodule "SEN12MS"] + path = SEN12MS + url = https://github.com/Berkeley-Data/SEN12MS.git + branch = taeil +[submodule "OpenSelfSup"] + path = OpenSelfSup + url = https://github.com/Berkeley-Data/OpenSelfSup.git diff --git a/.idea/.gitignore b/.idea/.gitignore new file mode 100644 index 0000000..73f69e0 --- /dev/null +++ b/.idea/.gitignore @@ -0,0 +1,8 @@ +# Default ignored files +/shelf/ +/workspace.xml +# Datasource local storage ignored files +/dataSources/ +/dataSources.local.xml +# Editor-based HTTP Client requests +/httpRequests/ diff --git a/.idea/deployment.xml b/.idea/deployment.xml new file mode 100644 index 0000000..fde1520 --- /dev/null +++ b/.idea/deployment.xml @@ -0,0 +1,22 @@ + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/.idea/hpt.iml b/.idea/hpt.iml new file mode 100644 index 0000000..3e2e3fe --- /dev/null +++ b/.idea/hpt.iml @@ -0,0 +1,19 @@ + + + + + + + + + + + + + \ No newline at end of file diff --git a/.idea/inspectionProfiles/profiles_settings.xml b/.idea/inspectionProfiles/profiles_settings.xml new file mode 100644 index 0000000..105ce2d --- /dev/null +++ b/.idea/inspectionProfiles/profiles_settings.xml @@ -0,0 +1,6 @@ + + + + \ No newline at end of file diff --git a/.idea/misc.xml b/.idea/misc.xml new file mode 100644 index 0000000..8598883 --- /dev/null +++ b/.idea/misc.xml @@ -0,0 +1,7 @@ + + + + + + + \ No newline at end of file diff --git a/.idea/modules.xml b/.idea/modules.xml new file mode 100644 index 0000000..b0929b6 --- /dev/null +++ b/.idea/modules.xml @@ -0,0 +1,8 @@ + + + + + + + + \ No newline at end of file diff --git a/.idea/other.xml b/.idea/other.xml new file mode 100644 index 0000000..a708ec7 --- /dev/null +++ b/.idea/other.xml @@ -0,0 +1,6 @@ + + + + + \ No newline at end of file diff --git a/.idea/runConfigurations/single_train.xml b/.idea/runConfigurations/single_train.xml new file mode 100644 index 0000000..c310ab9 --- /dev/null +++ b/.idea/runConfigurations/single_train.xml @@ -0,0 +1,30 @@ + + + + + \ No newline at end of file diff --git a/.idea/vcs.xml b/.idea/vcs.xml new file mode 100644 index 0000000..b120a90 --- /dev/null +++ b/.idea/vcs.xml @@ -0,0 +1,8 @@ + + + + + + + + \ No newline at end of file diff --git a/OpenSelfSup b/OpenSelfSup new file mode 160000 index 0000000..6573f31 --- /dev/null +++ b/OpenSelfSup @@ -0,0 +1 @@ +Subproject commit 6573f31a4f12a3a5a4c87a40624c9ed339049028 diff --git a/README.md b/README.md index 492d5dd..dce761f 100644 --- a/README.md +++ b/README.md @@ -1,303 +1,225 @@ -# Hierarchical Pretraining: Research Repository +## Abstract -This is a research repository for the submission "Self-Supervised Pretraining Improves Self-Supervised Pretraining" +We present a sensor-based location invariance momentum contrast for unsupervised visual representation learning in remote sensing application, where unlabeled data is well-known challenges to deep learning domain and accurate training data remains comparably scarce. In this study, we first introduce the use of SEN12MS datasets, a curated large-scale training data that include versatile remote sensing information from different sensors with global scene distributions. To continually bridge the gap between supervised and unsupervised learning on computer vision tasks in remote sensing applications, we exploit the geo-alignment data structure from SEN12MS to propose two training methods. One set is to construct sensor-based geo-alignment positive pairs in contrastive learning to design the natural augmentation. Another set is fusing data from different sensors with the objective of learning better representations. [last sentence subject to changes] Our experiments show that the proposed method outperforms the supervised learning counterpart when transferring to downstream tasks in scene classification for remote sensing data. -For initial setup, refer to [setup instructions](references/setup.md). -## Setup Weight & Biases Tracking +## Introduction +The performance of deep convolutional neural networks depends on their capability and the amount of training data. The datasets are becoming larger in every domain and different kinds of network architectures like [VGG](https://arxiv.org/pdf/1409.1556.pdf), [GoogLeNet](https://arxiv.org/pdf/1409.4842.pdf), [ResNet](https://arxiv.org/pdf/1512.03385.pdf), [DenseNet](https://arxiv.org/pdf/1608.06993.pdf), etc., increased network models' capacity. -```bash -export WANDB_API_KEY= -export WANDB_ENTITY=cal-capstone -export WANDB_PROJECT=hpt -#export WANDB_MODE=dryrun -``` +However, the collection and annotation of large-scale datasets are time-consuming and expensive. Many self-supervised methods were proposed to learn visual features from large-scale unlabeled data without using any human annotations to avoid time-consuming and costly data annotations. Contrastive learning of visual representations has emerged as the front-runner for self-supervision and has demonstrated superior performance on downstream tasks. All contrastive learning frameworks involve maximizing agreement between positive image pairs relative to negative/different images via a contrastive loss function; this pretraining paradigm forces the model to learn good representations. These approaches typically differ in how they generate positive and negative image pairs from unlabeled data and how the data are sampled during pretraining. -## Base Training +Self-supervised approaches such as Momentum Contrast (MoCo) ([He et al., 2019](https://arxiv.org/pdf/1911.05722.pdf); [Chen et al.,2020](https://arxiv.org/pdf/2003.04297.pdf)) can leverage unlabeled data to produce pre-trained models for subsequent fine-tuning on labeled data. In addition to MoCo, these include frameworks such as SimCLR ([Chen et al., 2020](https://arxiv.org/pdf/2002.05709.pdf)) and PIRL ([Misra and Maaten, 2020](https://openaccess.thecvf.com/content_CVPR_2020/papers/Misra_Self-Supervised_Learning_of_Pretext-Invariant_Representations_CVPR_2020_paper.pdf)). -[OpenSelfSup](https://github.com/Berkeley-Data/OpenSelfSup) +Remote sensing data has become broadly available at the petabyte scale, offering unprecedented visibility into natural and human activity across the Earth. In remote sensing, labeled data is usually scarce and hard to obtain. Due to the success of self-supervised learning methods, we explore their application to large-scale remote sensing datasets. -Right now we assume ImageNet base trained models. -```bash -cd OpenSelfSup/data/basetrain_chkpts/ -./download-pretrained-models.sh -``` +While most self-supervised image analysis techniques focus on natural imagery, remote sensing differs in several critical ways. Natural imagery often has one subject; remote sensing images contain numerous objects such as buildings, trees, roads, rivers, etc. Additionally, the important content changes unpredictably within just a few pixels or between images at the same location from different times. Multiple satellites capture images of the same locations on earth with a wide variety of resolutions, spectral bands (channels), and revisit rates, such that any specific problem can require a different +combination of sensor inputs([Reiche et al., 2018](https://doi.org/10.1016/j.rse.2017.10.034),[Rustowicz et al., 2019](https://openaccess.thecvf.com/content_CVPRW_2019/papers/cv4gc/Rustowicz_Semantic_Segmentation_of_Crop_Type_in_Africa_A_Novel_Dataset_CVPRW_2019_paper.pdf)). -## Pretraining With a New Dataset +While MoCo and other contrastive learning methods have demonstrated promising results on natural image classification tasks, their application to remote sensing applications has been limited. -[hpt](https://github.com/Berkeley-Data/hpt) +Traditional contrative learning utilizes augmentation to generate positive pair. Inspired by recent success (Geo-aware Paper) using natural augmentation to create positive pairs, we propose to use positive pairs from different sensors for the same location. -We have a handy set of config generators to make pretraining with a new dataset easy and consistent! +In this work, we demonstrate that pre-training [MoCo-v2](https://openaccess.thecvf.com/content_CVPR_2020/papers/He_Momentum_Contrast_for_Unsupervised_Visual_Representation_Learning_CVPR_2020_paper.pdf) on data from multiple sensors lead to improved representations for remote sensing applications. +![](web/images/architectures_1_and_2.png) -**FIRST**, you will need the image pixel mean/std of your dataset, if you don't have it, you can do: -```bash -cd src/data/ +## Related Work +#### Self-supervised contrastive learning +Many self-supervised learning methods for visual feature learning have been developed without using any + human-annotated labels. Compared to supervised learning methods which require a data pair Xi + and Yi while Yi is annotated by human labors, self-supervised learning also trained with data Xi along + with its pseudo label Pi while Pi is automatically generated for a pre-defined pretext task without involving any + human annotation. The pseudo label Pi can be generated by using attributes of images or videos such as the context of + images or by traditional hand-designed methods. As long as the pseudo labels P are automatically generated + without involving human annotations, then the methods belong to self-supervised learning. Recently, self-supervised + learning methods have achieved great progress. -# for sen12ms, run multiples times replacing --use_s1 by --use_s2 or --use_RGB -./compute-dataset-pixel-mean-std-sen12ms.py --data_dir /storage/sen12ms_x --data_index_dir /scratch/crguest/hpt/data --use_s1 --numworkers 1 + Self-supervised contrastive learning approaches such as [MoCo](https://arxiv.org/pdf/1911.05722.pdf) , + [MoCo-v2](https://arxiv.org/pdf/2003.04297.pdf), [SimCLR](https://arxiv.org/pdf/2002.05709.pdf), and [PIRL](https://openaccess.thecvf.com/content_CVPR_2020/papers/Misra_Self-Supervised_Learning_of_Pretext-Invariant_Representations_CVPR_2020_paper.pdf) have demonstrated + superior performance and have emerged as the fore-runner on various downstream tasks. The intuition behind these + methods are to learn representations by pulling positive image pairs from the same instance closer in latent space + while pushing negative pairs from difference instances further away. These methods, on the other hand, differ in the + type of contrastive loss, generation of positive and negative pairs, and sampling method. -# for others -./compute-dataset-pixel-mean-std.py --data /scratch/crguest/data/sen12ms_small --numworkers 20 --batchsize 256 + Contrastive learning of visual representations using MoCo ([**MoCo-v2**](https://arxiv.org/pdf/2003.04297.pdf) - Chen, et + al., Facebook AI Research, 2020) has emerged as the front-runner for self-supervision and has demonstrated superior performance on downstream tasks. -where image-folder has the structure from ImageFolder in pytorch -class/image-name.jp[e]g -or whatever image extension you're using -``` -if your dataset is not arranged in this way, you can either: -(i) use symlinks to put it in this structure -(ii) update the above script to read in your data +#### Performance gap in Satellite imagery +There is a performance gap between supervised learning using labels and self-supervised contrastive learning method, [MoCo-v2](https://arxiv.org/pdf/2003.04297.pdf), on remote + sensing datasets. For instance, on the Functional Map of the World ([fMoW](https://arxiv.org/abs/1711.07846)) image classification + benchmark, there is an 8% gap in top 1 accuracy between supervised and self-supervised methods. By leveraging spatially aligned + images over time to construct temporal positive pairs in contrastive learning and geo-location in the design of pre-text tasks, **[Geography-Aware + Self-supervised Learning](https://arxiv.org/pdf/2011.09980.pdf)** (Ayush, et al., Stanford University, 2020) were able to + close the gap between self-supervised and supervised learning on image classification, object detection and semantic + segmentation on remote sensing and other geo-tagged image datasets. -NOTE: For sen12ms, the code is not working as expected (refer to [this issue](https://github.com/Berkeley-Data/hpt/issues/24), until then use the following. -``` -bands_mean = {'s1_mean': [-11.76858, -18.294598], - 's2_mean': [1226.4215, 1137.3799, 1139.6792, 1350.9973, 1932.9058, - 2211.1584, 2154.9846, 2409.1128, 2001.8622, 1356.0801]} +In this work, we provide an effective approach for improving representation learning using data from different satellite imagery using [MoCo-v2](https://arxiv.org/pdf/2003.04297.pdf). -bands_std = {'s1_std': [4.525339, 4.3586307], - 's2_std': [741.6254, 740.883, 960.1045, 946.76056, 985.52747, - 1082.4341, 1057.7628, 1136.1942, 1132.7898, 991.48016]} -``` +## Problem Definition +Does contrastive pre-training with data from multiple sensors lead to improved representations for remote sensing applications? -**NEXT**, copy the pretraining template -```bash -cd src/utils -cp templates/pretraining-config-template.sh pretrain-configs/sen12ms-small.sh -# edit pretrain-configs/sen12ms-small.sh +Pre-train the contrastive model using unlabeled data from multiple satellites and use that model for downstream remote sensing tasks. -# once edited, generate the project -./gen-pretrain-project.sh pretrain-configs/my-dataset-config.sh -``` +We want to show that our approach to using images from different satellites for the same location as naturally augmented images as input to the MoCo-v2 method provides high-quality representations and transferable initializations for satellite imagery interpretation. Despite many differences in the data and task properties between natural image classification and satellite imagery interpretation, we want to show the benefit of MoCo-v2 pretraining across multiple patches from different satellites for satellite imagery and investigate representation transfer to a target dataset. -What just happened? We generated a bunch of pretraining configs in the following location (take a look at all of these files to get a feel for how this works): -``` -OpenSelfSup/configs/hpt-pretrain/${shortname} -``` +### Datasets +- [todo] keep only sen12ms. +- +To validate our ideas, we did experiments on datasets with different satellite imageries with variations in dataset size, channels, and image ground resolutions. The statistics of these datasets are given below. Readers are requested to see the the supplementary materials for examples and additional details of these datasets. +| Dataset | Satellites | Number of Images | Image Size | Labels | Notes | +|---|---|---|---|---|---| +| [BigEarthNet](https://arxiv.org/pdf/1902.06148.pdf) | Sentinel-2A/B |590,326 patches; 12 Bands | 20x20 to 120x120 | Multiple, 43 Full and 19 Simplified | No overlapping; 10 European Countries | +| [SEN12MS](https://arxiv.org/pdf/1906.07789.pdf) | Sentinel-1A/B; Sentinel-2A/B; MODIS (Terra and Aqua) | 541,986 patches; 180662 triplets (3\*180662); 4, 2 and 13 Bands | 256X256 | Single, 17 Full and 10 Simplified | Partial overlapping | +| [FMoW](https://arxiv.org/abs/1711.07846) | QuickBird-2; GeoEye-1; WorldView-2; WorldView-3 | 1,047,691 patches; 4, 8 and RGB Bands | Variable Over 2500x2500 | Multiple, up to 63; Bounding Box Annotations | Includes False Detection; Variable timestamp overlapping | -**NEXT**, you're ready to kick off a trial run to make sure the pretraining is working as expected =) +##### SEN12MS +The SEN12MS dataset contains 180,662 patch triplets of corresponding Sentinel-1 dual-pol SAR data, Sentinel-2 multi-spectral images, and MODIS-derived land cover maps. The patches are distributed across the land masses of the Earth and spread over all four meteorological seasons. This is reflected by the dataset structure. The captured scenes were tiled into patches of 256 X 256 pixels in size and implemented a stride of 128 pixels, resulting in an overlap between adjacent patches. +Only 3847 patches do not have any overlap with adjacent patches. +Most of the overlap occurs around 25% and 50% of the area with few patches overlapping less than 15% and more than 75%. -```bash -# the `-t` flag means `trial`: it'll only run a 50 iter pretraining - ./utils/pretrain-runner.sh -t -d OpenSelfSup/configs/hpt-pretrain/${shortname} -``` + All patches are provided in the form of 16-bit GeoTiffs containing the following specific information: +* Sentinel-1 SAR: 2 channels corresponding to sigma nought backscatter values in dB scale for VV and VH polarization. +* Sentinel-2 Multi-Spectral: 13 channels corresponding to the 13 spectral bands (B1, B2, B3, B4, B5, B6, B7, B8, B8a, B9, B10, B11, B12). +* MODIS Land Cover: 4 channels corresponding to IGBP, LCCS Land Cover, LCCS Land Use, and LCCS Surface Hydrology layers. -**NEXT**, if this works, kick off the full training. NOTE: you can kick this off multiple times as long as the config directories share the same filesystem -```bash -# simply removing the `-t` flag from above - ./utils/pretrain-runner.sh -d OpenSelfSup/configs/hpt-pretrain/${shortname} -``` +(TODO for Ernesto) should discuss dataset split for training and test (holdout set) +- 32K set +- potential issue with full set +- test set +- 1K set -**NEXT**, if you want to perform BYOL pretraining, add `-b` flag. -```bash -# simply add the `-b` flag to above. - ./utils/pretrain-runner.sh -d OpenSelfSup/configs/hpt-pretrain/${shortname} -b -``` +## Method +In this section, we briefly review Contrastive Learning Framework for unsupervised learning and detail our proposed approach to improve Moco-v2, a recent contrastive learning framework, on satellite imagery from multiple sensors data. +**Multiple-Sensor** +Update on different bands, different satellites etc. with images. -Congratulations: you've launch a full hierarchical pretraining experiment. +#### 1. Contrastive Learning Framework +Contrastive methods attempt to learn a mapping fq from raw pixels to semantically meaningful representations z in an unsupervised way. The training objective encourages representations corresponding to pairs of images that are known a priori to be semantically similar (positive pairs) to be closer to each other than typical unrelated pairs (negative pairs). With similarity measured by dot product, recent approaches in contrastive learning differ in the type of contrastive loss and generation of positive and negative pairs. In this work, we focus on the state-of-the-art contrastive learning framework [MoCo-v2](https://arxiv.org/pdf/2003.04297.pdf), an improved version of [MoCo](https://arxiv.org/pdf/1911.05722.pdf), and study improved methods for the construction of positive and negative pairs tailored to remote sensing applications. -**FAQs/PROBLEMS?** -* How does `pretrain-runner.sh` keep track of what's been pretrained? - * In each config directory, it creates a `.pretrain-status` folder to keep track of what's processing/finished. See them with e.g. `find OpenSelfSup/configs/hpt-pretrain -name '.pretrain-status'` -* How to redo a pretraining, e.g. because it crashed or something changed? Remove the - * Remove the associate `.proc` or `.done` file. Find these e.g. - ```bash - find OpenSelfSup/configs/hpt-pretrain -name '.proc' - find OpenSelfSup/configs/hpt-pretrain -name '.done' - ``` +#### 2. Sensors-based Geo-alignment Positive Pairs +Given the SEN12MS that provides the largest remote sensing dataset available with its global scene distribution and the wealth of versatile remote sensing information, It is natural to leverage the geo-alignment imagery from different remote sensing sensors while constructing positive or negative pairs . For example, Sentinel 1 consists of two images (vertical and horizontal polarization) and Sentinel 2 consist of thirteen images (different wavelength bands) of the same patch. Any combination from the same patch would correspond to a positive pair without the need of additional augmentation, while negative pairs would correspond to any image from different patches without restriction of the same or different satellites. -## Evaluating Pretrained Representations -This has been simplified to simply: -```bash -./utils/pretrain-evaluator.sh -b OpenSelfSup/work_dirs/hpt-pretrain/${shortname}/ -d OpenSelfSup/configs/hpt-pretrain/${shortname} -``` -where `-b` is the backbone directory and `-d` is the config directory. This command also works for cross-dataset evaluation (e.g. evaluate models trained on Resic45 and evaluate on UC Merced dataset). +In short, given an image x_i(s1) collected from sentinel 1, we can randomly select another image x_i(s2) collected from sentinel 2 that is geographically aligned with the x_i(s1), and then have them passthrough MoCo-v2 to the geographically aligned image pair x_i(s1) and x_i(s2), which provides us with a sensor-based geo-alignment positive pair ( v and v’ in Figure xxx) that can be used for training the contrastive learning framework by the query and key encoders in MoCo-v2. -**FAQ** +For a sample of x_i(s1), our GeoSensorInforNCE objective loss can be demonstrated as follows: -Where are the checkpoints and logs? E.g., if you pass in `configs/hpt-pretrain/resisc` as the config directory, then the working directories for this evalution is e.g. `work_dirs/hpt-pretrain/resisc/linear-eval/...`. If w&b is enabled, it will be logged on weight & biases +***INSERT FIGURE HERE (GeoSensorInfoNCE)*** -## Finetuning -Assuming you generated the pretraining project as specified above, finetuning is as simple as: +where z_i(s1) and z_i(s2) are the encoded representation of the randomly geo-aligned positive pair x_i(s1) and x_i(s2). N denotes the number of negative samples, {k_j}j=1_n are the encoded negative pairs, and \lambda is the temperature hyperparameter. -```bash -./utils/finetune-runner.sh -d ./OpenSelfSup/configs/hpt-pretrain/${shortname}/finetune/ -b ./OpenSelfSup/work_dirs/hpt-pretrain/${shortname}/ -``` -where `-b` is the backbone directory and `-d` is the config directory -Note: to finetune using other backbones, simply pass in a different backbone directory (the script searches for `final_backbone.pth` files in the provided directory tree) +What we used are the actual images from the same location but different sensors. With the inspiration of the success of geography-aware self-supvervised learing (insert ref -- **geo xxxxx paper** )that constructs temporal pairs from real images, we also rely on the assumptions that the actual images for positive pairs encourages the entire network to learn better representations for real sensors data than the one focusing on augmentation strategies and synthetic images. +xxxxx +#### 3. Geo-alignment Data Fusion -## Finetuning only on pretrained checkpoints with BEST linear analysis - -First, specify the pretraining epochs which gives the best linear evaluation result in `./utils/top-linear-analysis-ckpts.txt`. Here is an example: - -``` -# dataset best-moco-bt best-sup-bt best-no-bt -chest_xray_kids 5000 10000 100000 -resisc 5000 50000 100000 -chexpert 50000 50000 400000 -``` -, in which for `chest_xray_kids` dataset, `5000`-iters, `10000`-iters, `100000`-iters are the best pretrained models under `moco base-training`, `imagenet-supervised base-training`, and `no base-training`, respectively. - -Second, run the following command to perform finetuning only on the best checkpoints (same as above, except that the change of script name): -```bash -./utils/finetune-runner-top-only.sh -d ./OpenSelfSup/configs/hpt-pretrain/${shortname}/finetune/ -b ./OpenSelfSup/work_dirs/hpt-pretrain/${shortname} -``` - - - -## Pretraining on top of pretraining -Using the output of previously pretrained models, it is very easy to correctly setup pretraining on top of the pretraining. -Simply create a new config -``` -utils/pretrain-configs/dataname1-dataname2.sh -``` -(see `resisc-ucmerced.sh` for an example) +Instead of the first approach, we data fusioned sentinel 1 (2 bands) and sentinel 2 images (10 bands) together with the same locations and apply a set of combinations of images including sentinel 1 and sentinel 2 together, sentinel 2 only, and sentinel 1 only to construct one fusioned image. In a sense that we build a straightforward constraving learning directly under MoCo v2. -and then set the basetrained models to be the `final_backbone.pth` from the output of the last pretrained. e.g. for using resisc-45 outputs: +#### 4. 1x1 Convolution filters -``` -export basetrain_weights=( - "work_dirs/hpt-pretrain/resisc/moco_v2_800ep_basetrain/50000-iters/final_backbone.pth" +From the above perspective of the training methods in contrastive learning, including naturally augmented positive and negative pairs and the data fusion approach, we noticed that the volume (bands) of the inputs from different sensors are different. In order to match the typical dimensions of the image channels, our study also applies the Network in Network concept (Min Lin et al)(insert ref --**NIN**) to the sourced images in the sensor-based geo-aligment pairs scheme, as well as the data fusioned image in our second set of experiments. As such, we introduced an extra layer of one by one convolution filter block to perform cross channel sampling, thereby matching and aligning the depth of the channels from different sensor images while introducing non-linearity before the MoCo v2 encoding. With the implementation, we leverage this trick to carry out a pretty non-trivial computation on the input volume whereas we hope to increase the generalization capability in the network. - "work_dirs/hpt-pretrain/resisc/imagenet_r50_supervised_basetrain/50000-iters/final_backbone.pth" - "work_dirs/hpt-pretrain/resisc/no_basetrain/200000-iters/final_backbone.pth" -) -``` -(see `resisc-ucmerced.sh` for an example) +## Experiments +#### Pre-training on SEN12MS -To select which backbones to use, evaluate the linear performance from the various source outputs (e.g. all the resisc pretrained outputs) on the target data (e.g. on uc-merced data). +Pre-training is performed twice for comparison proposes. First, examples from all patches are included (180,662). Second, pre-train includes a sample of the dataset which patches do not overlap with their adjacent patches. This sample of the dataset is selected on firs come first serve basis and any adjacent overlapping patch is ignored. The selection consist of 35,792 patches. -Then simply generate the project and execute the pretraining as normal: +The model is pre-trained on different scenarios to compare the performance of the model. ***First, the model is trained by using the original approach of MoCo V2. The input image is augmented by gaussian blur, elastic transformation, vertical and horizontal flip***. Second, the model with the approach proposed in this work that is using images from different satellites as positive pairs. ***Third, in order to generalize the model, augmentation is applied to both satellites during training***. The pre-train is also done with both the complete dataset and the non-overlapping sample described in the previous section. -``` -./gen-pretrain-project.sh pretrain-configs/dataname1-dataname2.sh +The encoders have ***ResNet50*** architecture (50 layers deep, 2048 nodes) with 128 output nodes. +These encoders are designed for a RGB input (3 bands) and Sen12MS data set is 2, 4 and 13 bands for S1, LC and S2 respectively. +To overcome this structure constrain, a convolutional layer is included before the encoders to map the input with different bands to 3. +***The weights of this layer are not updated during training***. +The momentum constant (***m***) is ***0.9*** and the learning rate is ***0.03***. The temperature factor for the loss function is ***0.2***. The batch size is ***64***. -./pretrain-runner.sh -d OpenSelfSup/configs/hpt-pretrain/$dataname1-dataname2 -``` +#### Transfer Learning Experiments -## Object Detection / Semantic Segmentation -Object detection/segmentation uses detectron2 and takes place in the directory -``` -OpenSelfSup/benchmarks/detection -``` +We compared supervised learning with HPT model -**First:** Check if the dataset configs you need are already present in `configs`. E.g. if you're working with CoCo, you'll see the following 2 configs: -``` -configs/coco_R_50_C4_2x.yaml -configs/coco_R_50_C4_2x_moco.yaml -``` -We'll use the config with the `_moco` suffix for all obj det and segmentation. If your configs already exist, skip the next step. +1. SEN12MS - Scene Classification -**Next:** assuming your configs do not exist, set up the configs you need for your dataset by copying an existing set of configs -``` -cp configs/coco_R_50_C4_2x.yaml ${MYDATA}_R50_C4_2x.yaml -cp configs/coco_R_50_C4_2x_moco.yaml ${MYDATA}_R50_C4_2x_moco.yaml -``` -Edit `${MYDATA}_R50_C4_2x.yaml` and set `MIN_SIZE_TRAIN` and `MIN_SIZE_TEST` to be appropriate for your dataset. Also, rename `TRAIN` and `TEST` to have your dataset name, set `MASK_ON` to `True` if doing semantic segmentation, and update `STEPS` and `MAX_ITER` if running the training for a different amount of time is appropriate (check relevant publications / codebases to set the training schedule). +- Supervised Learning Benchmark vs HPT model -Edit `${MYDATA}_R50_C4_2x_moco.yaml` and set `PIXEL_MEAN` and `PIXEL_STD` (use `compute-dataset-pixel-mean-std.py` script above, if you don't know them). +**Implementation Details ** -Then, edit `train_net.py` and add the appropriate data registry lines for your train/val data -``` -register_coco_instances("dataname_train", {}, "obj-labels-in-coco-format_train.json", "datasets/dataname/dataname_train") -register_coco_instances("dataname_val", {}, "obj-labels-in-coco-format_val.json", "datasets/dataname/dataname_val") -``` +- downloaded pretrained models from t +- the original IGBP land cover scheme has 17 classes. +- the simplified version of IGBP classes has 10 classes, which derived and consolidated from the original 17 classes. -Then, setup symlinks to your data under `datasets/dataname/dataname_train` and `datasets/dataname/dataname_val`, where you replace dataname with your dataname used in the config/registry. +**Qualitative Analyis** -**Next**, convert your backbone(s) to detectron format, e.g. (NOTE: I recommend keeping backbones in the same directory that they are originally present in, and appending a `-detectron2` suffix) -``` -python convert-pretrain-to-detectron2.py ../../data/basetrain_chkpts/imagenet_r50_supervised.pth ../../data/basetrain_chkpts/imagenet_r50_supervised-detectron2.pth -``` +- Supervised training (full dataset) + - baseline: downloaded the pre-trained the models and evaluate without finetuning. +- Supervised training (1k dataset) + - Supervised: original ResNet50 used by Sen12ms + - Supervised_1x1: adding conv1x1 block to the ResNet50 used by Sen12ms +- Finetune/transfer learning (1k dataset) + - Moco: the ResNet50 used by Sen12ms is initialized with the weight from Moco backbone + - Moco_1x1: adding conv1x1 block to the ResNet50 used by Sen12ms and both input module and ResNet50 layers are initialized with the weight from Moco + - Moco_1x1Rnd: adding conv1x1 block to the ResNet50 used by Sen12ms. ResNet50 layers are initialized with the weight from Moco but input module is initialized with random weights +- Finetune v2 (1k dataset) + - freezing ResNet50 fully or partially does not seem to help with accuracy. We will continue explore and share the results once we are sure there is no issue with implementation. + +other findings: +- ResNet50_1x1 (s2) 100 epoch and 500 epoch shows similar accuracy. (especially for multi-label). +- ResNet50_1x1 (s2) shows significantly better result with 0.001 than 0.00001 (both single label and multi-label) -**Next** kick off training -``` -python train_net.py --config-file configs/DATANAME_R_50_C4_24k_moco.yaml --num-gpus 4 OUTPUT_DIR results/${UNIQUE_DATANAME_EXACTLY_DESCRIBING_THIS_RUN}/ TEST.EVAL_PERIOD 2000 MODEL.WEIGHTS ../../data/basetrain_chkpts/imagenet_r50_supervised-detectron2.pth SOLVER.CHECKPOINT_PERIOD ${INT_HOW_OFTEN_TO_CHECKPOINT} -``` -results will be in `results/${UNIQUE_DATANAME_EXACTLY_DESCRIBING_THIS_RUN}`, and you can use tensorboard to view them. +(findings pending verifications) +- By looking at the results between models with 1x1 conv and without 1x1 conv counterparts, almost all models with 1x1 conv block underperform the ones without 1x1 conv block. It appears that adding 1x1 conv layer as a volumn filters may loss some bands information overall with the finetune evalutions. -## Commit and Share Results -Run the following command to grab all results (linear analysis, finetunes, etc) and put them into the appropriate json results file in `results/`: -``` -./utils/update-all-results.sh -``` +### Results -You can verify the results in `results` and then add the new/updated results file to git and commit. +#### Sensor-based Augmentation -**Did you get an error message such as:** -``` -!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! +Our sensor-based geo-alignment postive pair approach took sentinel 2 images with 10 bands and sentinel 1 images with 2 bands with the same locations, and had them pass through a convolution 1x1 block before the MoCo v2 framework. Thereafter each of the images outputted a generalizable 3 channels wide images from sentinel 2 and sentinel 1 separately for us to construct the query and key encoder under MoCo v2. -Please investigate as your results may not be complete. -(see errors in file: base-training/utils/tmp/errors.txt) +The evaluations utilizing SEN12MS sense classification pipeline. Overall, multi-label accuracy resulted better than single-label accuracy across supervised and MoCo models. In general, due to the label noise for the SEN12MS dataset, the highest accuracy we could get may introduce irreducible errors. Knowing the provided supervised pre-train models on the full dataset does not contain s1-only data, In order to bring comparisons with the provided supervised model (full dataset) to our approach, our finetune started with 1k dataset and applied both supervised models and MoCo models with s2 dataset as well as s1/s2 dataset. In addition, we applied different finetune strategies with or without introducing the 1x1 conv block outputted weights from MoCo. -will not include partial result for /home/XXX/base-training/utils/../OpenSelfSup/work_dirs/hpt-pretrain/resisc/finetune/1000-labels/imagenet_r50_supervised_basetrain/50000-iters-2500-iter-0_01-lr-finetune/20200911_170916.log.json -!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! -``` -This means that this particular evaluation run did not appear to run for enough iterations. Investigate the provided log file, rerun any necessary evaluations, and remove the offending log file. +As a result of the finetune/ transfer learning, introducing 1x1 convolution weight from MoCo underperforms the ones without 1x1 convolution block, it appears models adding 1x1 convolution block from MoCo may distort the finetune evaluations, suggesting the representation of the learning may not be optimal. We continue to explore transfer learning using either the simplified dataset and evaluations, or the dataset that has less label noises. -**Debugging this script** this script finds the top val accuracy, and save the corresponding test accuracy using the following script: -``` -./utils/agg-results.sh -``` -which outputs results to `utils/tmp/results.txt` and errors to `utils/tmp/errors.txt`. Look at this file if your results aren't being generated correctly. +#### Geo-alignment Data Fusion -## Generate plots +aug set 1: resizecrop +aug set 2: resizecrop, blur +aug set 3: aug set 2 + color jittering/ grayscale (optional for now) -```bash -cd utils -python plot-results.py -``` +* all fusion: s1/s2 stacked image are augmented and used as q, k. +* partial fusion: s1, s2, s1/s2 image are equally mixed in the train dataset -See plots in directory `plot-results` -(you can also pass in a `--data` flag to only generate plots for a specific dataset, e.g. `python plot-results.py --data resisc`) +### Ablation +#### SEN12MS evaluation +scence classification (multi-label) -**To plot the eval & test acc curves**, use `./utils/plot.py` -```bash -cd utils -python plot.py --fname PLOT_NAME --folder FOLDER_CONTAINING_DIFFERENT_.PTH_FOLDERs -``` +**?Quesiont/Discussion -- (1) should we bring dp place down, or perhaps transform to % number? as other paper like moco, moco v2, simclr all show more succinet number format in terms of the acc results. and it looks more clear. (2) should we temporarily choose the best resuls for each of the dataset. moco has a randomly sample mechencism unless we further details and fintune with different paramters. but now, would it be goood if showcase our results that we are in the promising direction (and for presenation). we definilty need to visit back for this part tho.** -**To Generate plot for Exp-2-finetuning**, do -```bash -bash utils/plot-results-exp-2.sh -``` +https://wandb.ai/cal-capstone/scene_classification/reports/Evaluation--Vmlldzo1OTgzNjA -See plot in directory `plot-results/exp-2`. -**To Generate plot for Exp-3-Hierarchical Pretraining**, do -```bash -bash utils/plot-results-exp-3.sh -``` +### para +We perform transfer learning experiments with our two proposed methods on land cover scene classification across xx label class using SEN12MS dataset to understand the quality of the learned representations. Given that the image samples are randomly sampled to positive and negative pairs in MoCo v2 to compute the loss during training processes, we have finetune with different hyper-parameters and selected the best results on each of the evaluation dataset in our reporting. -See plot in directory `plot-results/exp-3`. +Using the pre-trained with our sensor-based geo-aligned pairs approach, our finetune on the 1,024 samples for the multi class classification shows that the performance on s1 and s2 data together has better performance than the supervised model with the same size of samples. Due to our learned representation being constructed by encoders from two sensor images data with multiple bands information, we argue that the downstream results would perform better on the dataset including both s1 and s2 data, as it appears in the results. -## Getting activations for similarity measures +Whereas by looking at the evaluation results with our geo-alignment data fusion approach on 1,024 samples, as shown in the table, we can see that -Run `get_acts.py` with a model used for a classifaction task -(one that has a test/val set).\ -Alternatively, run dist_get_acts as follows: -```shell -bash dist_get_acts.sh ${CFG} ${CHECKPOINT} [--grab_conv...] -``` -Default behavior is to grab the entire batch of linear layers. -Setting `--grab_conv` will capture a single batch of all convolutional layers.\ -Layers will be saved in `${WORK_DIR}/model_acts.npz`. -The npz contains a dictionary which maps layer names to the activations. + - (1) In general, the downstream multi-class accuracy of our data fusion on moco framework results in better performance of the supervised counterparts applying on s1 and s2 data together. Similar to our first approach result, as our learned representation for the data fusion is on sets of combinations of s1 and s2 data, the downstream results perform better on the dataset including s1 and s2 data. + - (2) The results of the optional fusion approach also perform slightly less accuracy than the full fusion and partial fusion strategy across the combination of the evaluation dataset. -## Debugging and Developing Within OpenSelfSup + - (3) By only looking at the pre-trained model. The evaluation on both s1 and s2’s performance shows better accuracy compared to s1 or s2 alone. This is inline with our expectations. + + +#### BigEarthNet Evaluation (TBD) + +https://wandb.ai/cal-capstone/scene_classification/reports/Evaluation--Vmlldzo1OTgzNjA + +### Relationship between different batch size and training epochs + +hypothesis: the impact of batch size when models are trained and evaluated on. When training epochs is small(100, 200), larger batch sizes have a significant advantage over the smaller ones. [Having more training epochs, we also shows that the gaps would decrease between different batch size]. As such, our pre-trained approach on the MoCo v2 contrastive learning framework in remote sensing imagery helps stabilize the model performance as a function of labeled data set size, compared to supervised model across the same size of data. -Here's a command that will allow breakpoints (WARNING: the results with the debug=true flag SHOULD NOT BE USED -- they disable sync batch norms and are not comparable to other results): -```bash -# from OpenSelfSup/ -# replace with your desired config -python tools/train.py configs/hpt-pretrain/resisc/moco_v2_800ep_basetrain/500-iters.py --work_dir work_dirs/debug --debug -``` diff --git a/SEN12MS b/SEN12MS new file mode 160000 index 0000000..a6012cc --- /dev/null +++ b/SEN12MS @@ -0,0 +1 @@ +Subproject commit a6012cc292f79147d85ae4e2d658476a5a5d7fd3 diff --git a/metrics.md b/metrics.md new file mode 100644 index 0000000..ba11f9d --- /dev/null +++ b/metrics.md @@ -0,0 +1,125 @@ + +#### Fusion approach + +**SEN12MS (1024)** +| aug set 2| s1 | s2 | s1/s2 | Note | +| --- | --- | --- | --- | --- | +| Supervised () | ? | ?| ? | | +| [all fusion]() | ? | ? | ? | running | +| [partial fusion]() | ? | ? | ? | done | +| [optional fusion]() | ? | ? | ? | done | + +**SEN12MS (512)** +| aug set 2| s1 | s2 | s1/s2 | Note | +| --- | --- | --- | --- | --- | +| Supervised () | ? | ?| ? | | +| [all fusion]() | ? | ? | ? | running | +| [partial fusion]() | ? | ? | ? | done | +| [optional fusion]() | ? | ? | ? | done | + +**BigEarthNet** +| aug set 2| s1 | s2 | s1/s2 | Note | +| --- | --- | --- | --- | --- | +| Supervised (1024) | ? | ?| ? | running | +| [all fusion]() | ? | ? | ? | running | +| [partial fusion]() | ? | ? | ? | done | +| [optional fusion]() | ? | ? | ? | done | + +**BigEarthNet (512)** +| aug set 2| s1 | s2 | s1/s2 | Note | +| --- | --- | --- | --- | --- | +| Supervised (1024) | ? | ?| ? | running | +| [all fusion]() | ? | ? | ? | running | +| [partial fusion]() | ? | ? | ? | done | +| [optional fusion]() | ? | ? | ? | done | + + +#### sensor augmentation + +| | Metrics|single-label |multi-label | Note | +| --- | --- | --- | --- | --- | +| | | | | | +| full dataset | Supervised s2 | .57 | .60| | +| | Supervised s1/s2 | .45 | .64|| +| | Supervised RGB | .45 | .58| | +| | | | | | +|s2 | Supervised 1x1 | .3863 | .4893 | | +| | Supervised | .4355 | .5931 | too good?| +| | Moco 1x1 RND | .4345 | .6004 | | +| | Moco 1x1 | .4469 | **.601**| not necessarily better | +| | Moco 1x1 RND (1000ep) | .4264 | .5757 | overfitting? | +| | Moco 1x1 (1000ep) | .4073 | .5622 | overfitting? | + +| | Metrics|single-label |multi-label | Note | +| --- | --- | --- | --- | --- | +|s1/s2 | :white_check_mark: Supervised 1x1 | .4094 | .5843 | | +| | :white_check_mark: Supervised | .4426 | .4678 | | +| | :no_entry_sign: Moco 1x1 RND | .4477 | .5317 | | +| | :no_entry_sign: Moco 1x1 | .4474 | .5302 | no conv1 weight transfer | +| | :no_entry_sign: **Moco** | .4718 | **.6697** | no conv1 weight transfer | + +- single-label: Average Accuracy +- multi-label: Overall Accuracy + + +crimson-pyramid + +**aug set 1(TBD)** + +| aug set 1| s1 | s2 | s1/s2 | Note | +| --- | --- | --- | --- | --- | +| Supervised (full) | xx | xx | xx | xx | +| Supervised (1024) | xx | xx | xx | xx | +| --- | --- | --- | --- | --- | +| [sensor-based augmentation] | xx | xx | xx | xx | +| [all fusion] | xx | xx| xx | xx | +| [partial fusion] | xx | xx | xx | xx | +| [optional fusion] | xx | xx | xx | xx| + + +**aug set 2** + +| aug set 2| s1 | s2 | s1/s2 | Note | +| --- | --- | --- | --- | --- | +| Supervised (full) | [Pretrained model is not provided](https://syncandshare.lrz.de/getlink/fiCDbqiiSFSNwot5exvUcW1y/trained_models) | [.60](https://wandb.ai/cal-capstone/sup_scene_cls/runs/3mg9zr5t) | [.64](https://wandb.ai/cal-capstone/sup_scene_cls/runs/2lda2016) | need to retest s1, s2 with zero padding | +| Supervised (1024) | [0.4003](https://wandb.ai/cal-capstone/sup_scene_cls/runs/555fv4cb) | [0.6108](https://wandb.ai/cal-capstone/sup_scene_cls/runs/3m1h27zt) | [.5856](https://wandb.ai/cal-capstone/sup_scene_cls/runs/dpwjby4o) | | +| --- | --- | --- | --- | --- | +| [sensor-based augmentation] | - | [0.6277](https://wandb.ai/cal-capstone/SEN12MS/runs/2826nuca) | [0.6697](https://wandb.ai/cal-capstone/SEN12MS/runs/22tv0kud) | xx | +| [all fusion](https://wandb.ai/cal-capstone/hpt4/runs/ak0xdbfu/overview) | xx | [.6251]? | [.5957](https://wandb.ai/cal-capstone/scene_classification/runs/2y2q8boi) | | +| [partial fusion](https://wandb.ai/cal-capstone/hpt4/runs/367tz8vs) | [.4729](https://wandb.ai/cal-capstone/scene_classification/runs/1qx384cs) | [.5812](https://wandb.ai/cal-capstone/scene_classification/runs/1bdmms2d) |[.6072](https://wandb.ai/cal-capstone/scene_classification/runs/1meu9iym) | | +| [optional fusion](https://wandb.ai/cal-capstone/hpt4/runs/2iu8yfs6) | [.4824](https://wandb.ai/cal-capstone/scene_classification/runs/tu3vuefx) | [.5601](https://wandb.ai/cal-capstone/scene_classification/runs/2hdbuxtv) | [.5884](https://wandb.ai/cal-capstone/scene_classification/runs/y5x2xce6) | | + + +- Supervised (full) s1, s2 need to be retested with zero padding 12 channel. + + +#### BigEarthNet Evaluation (TBD) +scence classification (multi or single label?) + +**aug set 1(TBD)** + + +| aug set 1| s1 | s2 | s1/s2 | Note | +| --- | --- | --- | --- | --- | +| Supervised (full) | xx | xx | xx | xx | +| Supervised (1024) | xx | xx | xx | xx | +| --- | --- | --- | --- | --- | +| [sensor-based augmentation] | xx | xx | xx | xx | +| [all fusion] | xx | xx | xx | xx | +| [partial fusion] | xx | xx | xx | xx | +| [optional fusion] | xx | xx | xx | xx| + + +**aug set 2** + +| aug set 2| s1 | s2 | s1/s2 | Note | +| --- | --- | --- | --- | --- | +| Supervised (full) | xx | xx | xx | xx | +| Supervised (1024) | [.4008](https://wandb.ai/cal-capstone/sup_scene_cls/runs/1lnfsmdi) | [.5496](https://wandb.ai/cal-capstone/sup_scene_cls/runs/3fpzht5f) | [.5423](https://wandb.ai/cal-capstone/sup_scene_cls/runs/1qma48o1) | xx | +| --- | --- | --- | --- | --- | +| [sensor-based augmentation] | xx | xx | xx | xx | +| [all fusion] | xx | xx | xx | xx | +| [partial fusion] | [.4279](https://wandb.ai/cal-capstone/scene_classification/runs/2a1tlnbv) | [.5351](https://wandb.ai/cal-capstone/scene_classification/runs/2f0pjxwx) | [.5352](https://wandb.ai/cal-capstone/scene_classification/table?workspace=user-kenhan) | xx | +| [optional fusion] | [.4478](https://wandb.ai/cal-capstone/scene_classification/runs/36c8z6ae) | [.5120](https://wandb.ai/cal-capstone/scene_classification/runs/3oazvjke) | [.5294](https://wandb.ai/cal-capstone/scene_classification/runs/nar53xcn) | xx| + +- Supervised (full) s1, s2 need to be retested with zero padding 12 channel. diff --git a/paper_draft.md b/paper_draft.md deleted file mode 100644 index 0f36c1b..0000000 --- a/paper_draft.md +++ /dev/null @@ -1,193 +0,0 @@ -## Abstract - -We present a sensor-based location invariance momentum contrast for unsupervised visual representation learning in remote sensing application, where unlabeled data is well-known challenges to deep learning domain and accurate training data remains comparably scarce. In this study, we first introduce the use of SEN12MS datasets, a curated large-scale training data that include versatile remote sensing information from different sensors with global scene distributions. To continually bridge the gap between supervised and unsupervised learning on computer vision tasks in remote sensing application, we exploit the geo-alignment data structure from SEN12MS and construct sensor-based geo-alignment positive pairs in contrastive learning to design the natural augmentation. [last sentence subject to changes] Our experiments show that the proposed method outperforms the supervised learning counterpart when transferring to downstream tasks in scene classification for remote sensing data. - - -## Introduction -The performance of deep convolutional neural networks depends on their capability and the amount of training data. The datasets are becoming larger in every domain and different kinds of network architectures like [VGG](https://arxiv.org/pdf/1409.1556.pdf), [GoogLeNet](https://arxiv.org/pdf/1409.4842.pdf), [ResNet](https://arxiv.org/pdf/1512.03385.pdf), [DenseNet](https://arxiv.org/pdf/1608.06993.pdf), etc., increased network models' capacity. - -However, the collection and annotation of large-scale datasets are time-consuming and expensive. Many self-supervised methods were proposed to learn visual features from large-scale unlabeled data without using any human annotations to avoid time-consuming and costly data annotations. Contrastive learning of visual representations has emerged as the front-runner for self-supervision and has demonstrated superior performance on downstream tasks. All contrastive learning frameworks involve maximizing agreement between positive image pairs relative to negative/different images via a contrastive loss function; this pretraining paradigm forces the model to learn good representations. These approaches typically differ in how they generate positive and negative image pairs from unlabeled data and how the data are sampled during pretraining. - -Self-supervised approaches such as Momentum Contrast (MoCo) ([He et al., 2019](https://arxiv.org/pdf/1911.05722.pdf); [Chen et al.,2020](https://arxiv.org/pdf/2003.04297.pdf)) can leverage unlabeled data to produce pre-trained models for subsequent fine-tuning on labeled data. In addition to MoCo, these include frameworks such as SimCLR ([Chen et al., 2020](https://arxiv.org/pdf/2002.05709.pdf)) and PIRL ([Misra and Maaten, 2020](https://openaccess.thecvf.com/content_CVPR_2020/papers/Misra_Self-Supervised_Learning_of_Pretext-Invariant_Representations_CVPR_2020_paper.pdf)). - -Remote sensing data has become broadly available at the petabyte scale, offering unprecedented visibility into natural and human activity across the Earth. In remote sensing, labeled data is usually scarce and hard to obtain. Due to the success of self-supervised learning methods, we explore their application to large-scale remote sensing datasets. - -While most self-supervised image analysis techniques focus on natural imagery, remote sensing differs in several critical ways. Natural imagery often has one subject; remote sensing images contain numerous objects such as buildings, trees, roads, rivers, etc. Additionally, the important content changes unpredictably within just a few pixels or between images at the same location from different times. Multiple satellites capture images of the same locations on earth with a wide variety of resolutions, spectral bands (channels), and revisit rates, such that any specific problem can require a different -combination of sensor inputs([Reiche et al., 2018](https://doi.org/10.1016/j.rse.2017.10.034),[Rustowicz et al., 2019](https://openaccess.thecvf.com/content_CVPRW_2019/papers/cv4gc/Rustowicz_Semantic_Segmentation_of_Crop_Type_in_Africa_A_Novel_Dataset_CVPRW_2019_paper.pdf)). - -While MoCo and other contrastive learning methods have demonstrated promising results on natural image classification tasks, their application to remote sensing applications has been limited. - -Traditional contrative learning utilizes augmentation to generate positive pair. Inspired by recent success (Geo-aware Paper) using natural augmentation to create positive pairs, we propose to use positive pairs from different sensors for the same location. - -In this work, we demonstrate that pre-training [MoCo-v2](https://openaccess.thecvf.com/content_CVPR_2020/papers/He_Momentum_Contrast_for_Unsupervised_Visual_Representation_Learning_CVPR_2020_paper.pdf) on data from multiple sensors lead to improved representations for remote sensing applications. - -## Related Work -#### Self-supervised contrastive learning -Many self-supervised learning methods for visual feature learning have been developed without using any - human-annotated labels. Compared to supervised learning methods which require a data pair Xi - and Yi while Yi is annotated by human labors, self-supervised learning also trained with data Xi along - with its pseudo label Pi while Pi is automatically generated for a pre-defined pretext task without involving any - human annotation. The pseudo label Pi can be generated by using attributes of images or videos such as the context of - images or by traditional hand-designed methods. As long as the pseudo labels P are automatically generated - without involving human annotations, then the methods belong to self-supervised learning. Recently, self-supervised - learning methods have achieved great progress. - - Self-supervised contrastive learning approaches such as [MoCo](https://arxiv.org/pdf/1911.05722.pdf) , - [MoCo-v2](https://arxiv.org/pdf/2003.04297.pdf), [SimCLR](https://arxiv.org/pdf/2002.05709.pdf), and [PIRL](https://openaccess.thecvf.com/content_CVPR_2020/papers/Misra_Self-Supervised_Learning_of_Pretext-Invariant_Representations_CVPR_2020_paper.pdf) have demonstrated - superior performance and have emerged as the fore-runner on various downstream tasks. The intuition behind these - methods are to learn representations by pulling positive image pairs from the same instance closer in latent space - while pushing negative pairs from difference instances further away. These methods, on the other hand, differ in the - type of contrastive loss, generation of positive and negative pairs, and sampling method. - - Contrastive learning of visual representations using MoCo ([**MoCo-v2**](https://arxiv.org/pdf/2003.04297.pdf) - Chen, et - al., Facebook AI Research, 2020) has emerged as the front-runner for self-supervision and has demonstrated superior performance on downstream tasks. - -#### Performance gap in Satellite imagery -There is a performance gap between supervised learning using labels and self-supervised contrastive learning method, [MoCo-v2](https://arxiv.org/pdf/2003.04297.pdf), on remote - sensing datasets. For instance, on the Functional Map of the World ([fMoW](https://arxiv.org/abs/1711.07846)) image classification - benchmark, there is an 8% gap in top 1 accuracy between supervised and self-supervised methods. By leveraging spatially aligned - images over time to construct temporal positive pairs in contrastive learning and geo-location in the design of pre-text tasks, **[Geography-Aware - Self-supervised Learning](https://arxiv.org/pdf/2011.09980.pdf)** (Ayush, et al., Stanford University, 2020) were able to - close the gap between self-supervised and supervised learning on image classification, object detection and semantic - segmentation on remote sensing and other geo-tagged image datasets. - -In this work, we provide an effective approach for improving representation learning using data from different satellite imagery using [MoCo-v2](https://arxiv.org/pdf/2003.04297.pdf). - -## Problem Definition -Does contrastive pre-training with data from multiple sensors lead to improved representations for remote sensing applications? - -Pre-train the contrastive model using unlabeled data from multiple satellites and use that model for downstream remote sensing tasks. - -We want to show that our approach to using images from different satellites for the same location as naturally augmented images as input to the MoCo-v2 method provides high-quality representations and transferable initializations for satellite imagery interpretation. Despite many differences in the data and task properties between natural image classification and satellite imagery interpretation, we want to show the benefit of MoCo-v2 pretraining across multiple patches from different satellites for satellite imagery and investigate representation transfer to a target dataset. - -### Datasets -- [todo] keep only sen12ms. -- -To validate our ideas, we did experiments on datasets with different satellite imageries with variations in dataset size, channels, and image ground resolutions. The statistics of these datasets are given below. Readers are requested to see the the supplementary materials for examples and additional details of these datasets. -| Dataset | Satellites | Number of Images | Image Size | Labels | Notes | -|---|---|---|---|---|---| -| [BigEarthNet](https://arxiv.org/pdf/1902.06148.pdf) | Sentinel-2A/B |590,326 patches; 12 Bands | 20x20 to 120x120 | Multiple, up to 43 | No overlapping; 10 European Countries | -| [SEN12MS](https://arxiv.org/pdf/1906.07789.pdf) | Sentinel-1A/B; Sentinel-2A/B; MODIS (Terra and Aqua) | 541,986 patches; 180662 triplets (3\*180662); 4, 2 and 13 Bands | 256X256 | Single, 17 Full and 10 Simplified | Partial overlapping | -| [FMoW](https://arxiv.org/abs/1711.07846) | QuickBird-2; GeoEye-1; WorldView-2; WorldView-3 | 1,047,691 patches; 4, 8 and RGB Bands | Variable Over 2500x2500 | Multiple, up to 63; Bounding Box Annotations | Includes False Detection; Variable timestamp overlapping | - -##### SEN12MS -The SEN12MS dataset contains 180,662 patch triplets of corresponding Sentinel-1 dual-pol SAR data, Sentinel-2 multi-spectral images, and MODIS-derived land cover maps. The patches are distributed across the land masses of the Earth and spread over all four meteorological seasons. This is reflected by the dataset structure. The captured scenes were tiled into patches of 256 X 256 pixels in size and implemented a stride of 128 pixels, resulting in an overlap between adjacent patches. -Only 3847 patches do not have any overlap with adjacent patches. -Most of the overlap occurs around 25% and 50% of the area with few patches overlapping less than 15% and more than 75%. - - All patches are provided in the form of 16-bit GeoTiffs containing the following specific information: -* Sentinel-1 SAR: 2 channels corresponding to sigma nought backscatter values in dB scale for VV and VH polarization. -* Sentinel-2 Multi-Spectral: 13 channels corresponding to the 13 spectral bands (B1, B2, B3, B4, B5, B6, B7, B8, B8a, B9, B10, B11, B12). -* MODIS Land Cover: 4 channels corresponding to IGBP, LCCS Land Cover, LCCS Land Use, and LCCS Surface Hydrology layers. - -## Method -In this section, we briefly review Contrastive Learning Framework for unsupervised learning and detail our proposed approach to improve Moco-v2, a recent contrastive learning framework, on satellite imagery from multiple sensors data. - -**Multiple-Sensor** -Update on different bands, different satellites etc. with images. - -![](web/images/moco_framework.png)![](web/images/current_approach.png) - -#### 1. Contrastive Learning Framework -Contrastive methods attempt to learn a mapping fq from raw pixels to semantically meaningful representations z in an unsupervised way. The training objective encourages representations corresponding to pairs of images that are known a priori to be semantically similar (positive pairs) to be closer to each other than typical unrelated pairs (negative pairs). With similarity measured by dot product, recent approaches in contrastive learning differ in the type of contrastive loss and generation of positive and negative pairs. In this work, we focus on the state-of-the-art contrastive learning framework [MoCo-v2](https://arxiv.org/pdf/2003.04297.pdf), an improved version of [MoCo](https://arxiv.org/pdf/1911.05722.pdf), and study improved methods for the construction of positive and negative pairs tailored to remote sensing applications. - -#### 2. Sensors-based Geo-alignment Positive Pairs -Given the SEN12MS that provides the largest remote sensing dataset available with its global scene distribution and the wealth of versatile remote sensing information, It is natural to leverage the geo-alignment imagery from different remote sensing sensors while constructing positive or negative pairs . For example, Sentinel 1 consists of two images (vertical and horizontal polarization) and Sentinel 2 consist of thirteen images (different wavelength bands) of the same patch. Any combination from the same patch would correspond to a positive pair without the need of additional augmentation, while negative pairs would correspond to any image from different patches without restriction of the same or different satellites. - -In short, given an image x_i(s1) collected from sentinel 1, we can randomly select another image x_i(s2) collected from sentinel 2 that is geographically aligned with the x_i(s1), and then have them passthrough MoCo-v2 to the geographically aligned image pair x_i(s1) and x_i(s2), which provides us with a sensor-based geo-alignment positive pair ( v and v’ in Figure xxx) that can be used for training the contrastive learning framework by the query and key encoders in MoCo-v2. - -For a sample of x_i(s1), our GeoSensorInforNCE objective loss can be demonstrated as follows: - -***INSERT FIGURE HERE (GeoSensorInfoNCE)*** - -where z_i(s1) and z_i(s2) are the encoded representation of the randomly geo-aligned positive pair x_i(s1) and x_i(s2). N denotes the number of negative samples, {k_j}j=1_n are the encoded negative pairs, and \lambda is the temperature hyperparameter. - -What we used are the actual images from the same location but different sensors. With the inspiration of the success of geography-aware self-supvervised learing (insert ref -- **geo xxxxx paper** )that constructs temporal pairs from real images, we also rely on the assumptions that the actual images for positive pairs encourages the entire network to learn better representations for real sensors data than the one focusing on augmentation strategies and synthetic images. -xxxxx - -#### 3. 1x1 Convolution filters - -From the above perspective of constructing naturally augmented positive and negative pairs in contrastive learning, we noticed that the volume (bands) of the inputs from different sensors are different. In order to match the typical dimensions of the image channels, our study also applies the Network in Network concept (Min Lin et al)(insert ref --**NIN**) to the sourced images. As such, we introduced an extra layer of one by one convolution filter block to perform cross channel sampling, thereby matching and aligning the depth of the channels from different sensor images while introducing non-linearity before the MoCo v2 encoding. With the implementation, we leverage this trick to carry out a pretty non-trivial computation on the input volume whereas we hope to increase the generalization capability in the network. - - - - -## Experiments -#### Pre-training on SEN12MS - -Pre-training is performed twice for comparison proposes. First, examples from all patches are included (180,662). Second, pre-train includes a sample of the dataset which patches do not overlap with their adjacent patches. This sample of the dataset is selected on firs come first serve basis and any adjacent overlapping patch is ignored. The selection consist of 35,792 patches. - -The model is pre-trained on different scenarios to compare the performance of the model. ***First, the model is trained by using the original approach of MoCo V2. The input image is augmented by gaussian blur, elastic transformation, vertical and horizontal flip***. Second, the model with the approach proposed in this work that is using images from different satellites as positive pairs. ***Third, in order to generalize the model, augmentation is applied to both satellites during training***. The pre-train is also done with both the complete dataset and the non-overlapping sample described in the previous section. - -The encoders have ***ResNet50*** architecture (50 layers deep, 2048 nodes) with 128 output nodes. -These encoders are designed for a RGB input (3 bands) and Sen12MS data set is 2, 4 and 13 bands for S1, LC and S2 respectively. -To overcome this structure constrain, a convolutional layer is included before the encoders to map the input with different bands to 3. -***The weights of this layer are not updated during training***. -The momentum constant (***m***) is ***0.9*** and the learning rate is ***0.03***. The temperature factor for the loss function is ***0.2***. The batch size is ***64***. - - -#### Transfer Learning Experiments - -We compared supervised learning with HPT model - -1. SEN12MS - Scene Classification - -- Supervised Learning Benchmark vs HPT model - -**Implementation Details ** - -- downloaded pretrained models from t -- the original IGBP land cover scheme has 17 classes. -- the simplified version of IGBP classes has 10 classes, which derived and consolidated from the original 17 classes. - -**Qualitative Analyis** - -- Supervised training (full dataset) - - baseline: downloaded the pre-trained the models and evaluate without finetuning. -- Supervised training (1k dataset) - - Supervised: original ResNet50 used by Sen12ms - - Supervised_1x1: adding conv1x1 block to the ResNet50 used by Sen12ms -- Finetune/transfer learning (1k dataset) - - Moco: the ResNet50 used by Sen12ms is initialized with the weight from Moco backbone - - Moco_1x1: adding conv1x1 block to the ResNet50 used by Sen12ms and both input module and ResNet50 layers are initialized with the weight from Moco - - Moco_1x1Rnd: adding conv1x1 block to the ResNet50 used by Sen12ms. ResNet50 layers are initialized with the weight from Moco but input module is initialized with random weights -- Finetune v2 (1k dataset) - - freezing ResNet50 fully or partially does not seem to help with accuracy. We will continue explore and share the results once we are sure there is no issue with implementation. - -| Metrics|single-label Average Accuracy (%)|multi-label Overall Accuracy (%) | -| --- | --- | --- | -| Supervised s2 (full) | .57 | .60| -| Supervised s1/s2 (full) | .45 | .64| -| Supervised RGB (full) | .45 | .58| -| --- | --- | --- | -| Supervised s2 (1024) | **.4355** | .5931 | -| Supervised s1/s2 (1024) | .4652 | .4652 | -| Supervised 1x1 s2 (1024) | **.3863** | .4893 | -| Supervised 1x1 s1/s2 (1024) | .4094 | .5843 | -| Moco s2 (1024) | .4545 | **.6277** | -| Moco s1/s2 (1024) | .4514 | **.6697** | -| Moco 1x1 s2 (1024)| .4454 | **.601**| -| Moco 1x1 s1/s2 (1024)| _.425_ (?) | .5302 | -| Moco 1x1 RND s2 (1024)| 0.371 | .5374 | -| Moco 1x1 RND s1/s2 (1024)| .4477 | .5152 | - -(before): before learning rate adjustment - -other findings: -- ResNet50_1x1 (s2) 100 epoch and 500 epoch shows similar accuracy. (especially for multi-label). -- ResNet50_1x1 (s2) shows significantly better result with 0.001 than 0.00001 (both single label and multi-label) - -(findings pending verifications) -- By looking at the results between models with 1x1 conv and without 1x1 conv counterparts, almost all models with 1x1 conv block underperform the ones without 1x1 conv block. It appears that adding 1x1 conv layer as a volumn filters may loss some bands information overall with the finetune evalutions. - -## Conclusion - - - -## References -TODO: Use APA style later. Do this once the draft is ready by taking the links in the document, giving them a number and use APA style generator. -[1] -[2] -[3] -[4] -[5] diff --git a/references/hpt_repo.md b/references/hpt_repo.md new file mode 100644 index 0000000..c43f900 --- /dev/null +++ b/references/hpt_repo.md @@ -0,0 +1,351 @@ +# Hierarchical Pretraining: Research Repository + +This is a research repository for the submission "Self-Supervised Pretraining Improves Self-Supervised Pretraining" + +For initial setup, refer to [setup instructions](setup_pretraining.md). + +## Setup Weight & Biases Tracking + +```bash +export WANDB_API_KEY= +export WANDB_ENTITY=cal-capstone +export WANDB_PROJECT=scene_classification +#export WANDB_MODE=dryrun +``` + +## Base Training + +[OpenSelfSup](https://github.com/Berkeley-Data/OpenSelfSup) + +Right now we assume ImageNet base trained models. +```bash +cd OpenSelfSup/data/basetrain_chkpts/ +./download-pretrained-models.sh +``` + +## Pretraining With a New Dataset + +[hpt](https://github.com/Berkeley-Data/hpt) + +We have a handy set of config generators to make pretraining with a new dataset easy and consistent! + +**FIRST**, you will need the image pixel mean/std of your dataset, if you don't have it, you can do: +```bash +cd src/data/ + +# for sen12ms, run multiples times replacing --use_s1 by --use_s2 or --use_RGB +./compute-dataset-pixel-mean-std-sen12ms.py --data_dir /storage/sen12ms_x --data_index_dir /scratch/crguest/hpt/data --use_s1 --numworkers 1 + +# for others +./compute-dataset-pixel-mean-std.py --data /scratch/crguest/data/sen12ms_small --numworkers 20 --batchsize 256 + +where image-folder has the structure from ImageFolder in pytorch +class/image-name.jp[e]g +or whatever image extension you're using +``` +if your dataset is not arranged in this way, you can either: +(i) use symlinks to put it in this structure +(ii) update the above script to read in your data + +NOTE: For sen12ms, the code is not working as expected (refer to [this issue](https://github.com/Berkeley-Data/hpt/issues/24), until then use the following. +``` +bands_mean = {'s1_mean': [-11.76858, -18.294598], + 's2_mean': [1226.4215, 1137.3799, 1139.6792, 1350.9973, 1932.9058, + 2211.1584, 2154.9846, 2409.1128, 2001.8622, 1356.0801]} + +bands_std = {'s1_std': [4.525339, 4.3586307], + 's2_std': [741.6254, 740.883, 960.1045, 946.76056, 985.52747, + 1082.4341, 1057.7628, 1136.1942, 1132.7898, 991.48016]} +``` + +## Pre-training with SEN12MS Dataset +[OpenSelfSup](https://github.com/Berkeley-Data/OpenSelfSup) +- see `src/utils/pretrain-runner.sh` for end-to-end run (require prep creating config files). + +Check installation by pretraining using mocov2, extracting the model weights, evaluating the representations, and then viewing the results on tensorboard or [wandb](https://wandb.ai/cal-capstone/hpt): + +Set up experimental tracking and model versioning: +```bash +export WANDB_API_KEY= +export WANDB_ENTITY=cal-capstone +export WANDB_PROJECT=hpt4 +``` + +#### Run pre-training +```bash +cd OpenSelfSup + +# set which GPUs to use +# CUDA_VISIBLE_DEVICES=1 +# CUDA_VISIBLE_DEVICES=0,1,2,3 + +# (sanity check) Single GPU training on samll dataset +/tools/single_train.sh configs/selfsup/moco/r50_v2_sen12ms_in_basetrain_aug_20ep.py --debug + +# (sanity check) Single GPU training on samll dataset on sen12ms fusion +./tools/single_train.sh configs/selfsup/moco/r50_v2_sen12ms_fusion_in_smoke_aug.py --debug + +# (sanity check) 4 GPUs training on samll dataset +./tools/dist_train.sh configs/selfsup/moco/r50_v2_sen12ms_in_basetrain_aug_20ep.py 4 + +# (sanity check) 4 GPUs training on samll fusion dataset +./tools/dist_train.sh configs/selfsup/moco/r50_v2_sen12ms_fusion_in_smoke_aug.py 4 + +# distributed full training +/tools/dist_train.sh configs/selfsup/moco/r50_v2_sen12ms_in_fulltrain_20ep.py 4 +``` + +#### (OPTIONAL) download pre-trained models + +Some of key pre-trained models are on s3 (s3://sen12ms/pretrained): +- [200 epochs w/o augmentation: vivid-resonance-73](https://wandb.ai/cjrd/BDOpenSelfSup-tools/runs/3qjvxo2p/overview?workspace=user-cjrd) +- [20 epochs w/o augmentation: silvery-oath7-2rr3864e](https://wandb.ai/cal-capstone/hpt2/runs/2rr3864e?workspace=user-taeil) +- [sen12ms-baseline: soft-snowflake-3.pth](https://wandb.ai/cal-capstone/SEN12MS/runs/3gjhe4ff/overview?workspace=user-taeil) + +``` +aws configure +aws s3 sync s3://sen12ms/pretrained . --dryrun +aws s3 sync s3://sen12ms/pretrained_sup . --dryrun +``` + +#### Extract pre-trained model +Any other models can be restored by run ID if stored with W&B. Go to files section under the run to find `*.pth` files + +```bash +BACKBONE=work_dirs/selfsup/moco/r50_v2_sen12ms_in_basetrain_20ep/epoch_20_moco_in_baseline.pth + +# method 1: From working dir(same system for pre-training) +# CHECKPOINT=work_dirs/selfsup/moco/r50_v2_resisc_in_basetrain_20ep/epoch_20.pth + +# method 2: from W&B, {projectid}/{W&B run id} (any system) +CHECKPOINT=hpt2/3l4yg63k + +# Extract the backbone +python tools/extract_backbone_weights.py ${BACKBONE} ${CHECKPOINT} + +``` + + +## Evaluating Pretrained Representations + +Using OpenSelfSup +```bash +python tools/train.py $CFG --pretrained $PRETRAIN + +# RESISC finetune example +tools/train.py --local_rank=0 configs/benchmarks/linear_classification/resisc45/r50_last.py --pretrained work_dirs/selfsup/moco/r50_v2_resisc_in_basetrain_20ep/epoch_20_moco_in_basetrain.pth --work_dir work_dirs/benchmarks/linear_classification/resisc45/moco-selfsup/r50_v2_resisc_in_basetrain_20ep-r50_last --seed 0 --launcher=pytorch + + + +``` + + +Using Sen12ms +```bash +``` + + + + + +#### Previous +``` +# Evaluate the representations (NOT SURE) +./benchmarks/dist_train_linear.sh configs/benchmarks/linear_classification/resisc45/r50_last.py ${BACKBONE} +``` + +This has been simplified to simply: +```bash +./utils/pretrain-evaluator.sh -b OpenSelfSup/work_dirs/hpt-pretrain/${shortname}/ -d OpenSelfSup/configs/hpt-pretrain/${shortname} +``` +where `-b` is the backbone directory and `-d` is the config directory. This command also works for cross-dataset evaluation (e.g. evaluate models trained on Resic45 and evaluate on UC Merced dataset). + +**FAQ** + +Where are the checkpoints and logs? E.g., if you pass in `configs/hpt-pretrain/resisc` as the config directory, then the working directories for this evalution is e.g. `work_dirs/hpt-pretrain/resisc/linear-eval/...`. If w&b is enabled, it will be logged on weight & biases + +## Finetuning +Assuming you generated the pretraining project as specified above, finetuning is as simple as: + +```bash +./utils/finetune-runner.sh -d ./OpenSelfSup/configs/hpt-pretrain/${shortname}/finetune/ -b ./OpenSelfSup/work_dirs/hpt-pretrain/${shortname}/ +``` +where `-b` is the backbone directory and `-d` is the config directory +Note: to finetune using other backbones, simply pass in a different backbone directory (the script searches for `final_backbone.pth` files in the provided directory tree) + + +## Finetuning only on pretrained checkpoints with BEST linear analysis + +First, specify the pretraining epochs which gives the best linear evaluation result in `./utils/top-linear-analysis-ckpts.txt`. Here is an example: + +``` +# dataset best-moco-bt best-sup-bt best-no-bt +chest_xray_kids 5000 10000 100000 +resisc 5000 50000 100000 +chexpert 50000 50000 400000 +``` +, in which for `chest_xray_kids` dataset, `5000`-iters, `10000`-iters, `100000`-iters are the best pretrained models under `moco base-training`, `imagenet-supervised base-training`, and `no base-training`, respectively. + +Second, run the following command to perform finetuning only on the best checkpoints (same as above, except that the change of script name): +```bash +./utils/finetune-runner-top-only.sh -d ./OpenSelfSup/configs/hpt-pretrain/${shortname}/finetune/ -b ./OpenSelfSup/work_dirs/hpt-pretrain/${shortname} +``` + + + +## Pretraining on top of pretraining +Using the output of previously pretrained models, it is very easy to correctly setup pretraining on top of the pretraining. +Simply create a new config +``` +utils/pretrain-configs/dataname1-dataname2.sh +``` +(see `resisc-ucmerced.sh` for an example) + +and then set the basetrained models to be the `final_backbone.pth` from the output of the last pretrained. e.g. for using resisc-45 outputs: + +``` +export basetrain_weights=( + "work_dirs/hpt-pretrain/resisc/moco_v2_800ep_basetrain/50000-iters/final_backbone.pth" + + "work_dirs/hpt-pretrain/resisc/imagenet_r50_supervised_basetrain/50000-iters/final_backbone.pth" + + "work_dirs/hpt-pretrain/resisc/no_basetrain/200000-iters/final_backbone.pth" +) +``` +(see `resisc-ucmerced.sh` for an example) + +To select which backbones to use, evaluate the linear performance from the various source outputs (e.g. all the resisc pretrained outputs) on the target data (e.g. on uc-merced data). + +Then simply generate the project and execute the pretraining as normal: + +``` +./gen-pretrain-project.sh pretrain-configs/dataname1-dataname2.sh + +./pretrain-runner.sh -d OpenSelfSup/configs/hpt-pretrain/$dataname1-dataname2 +``` + + +## Object Detection / Semantic Segmentation +Object detection/segmentation uses detectron2 and takes place in the directory +``` +OpenSelfSup/benchmarks/detection +``` + +**First:** Check if the dataset configs you need are already present in `configs`. E.g. if you're working with CoCo, you'll see the following 2 configs: +``` +configs/coco_R_50_C4_2x.yaml +configs/coco_R_50_C4_2x_moco.yaml +``` +We'll use the config with the `_moco` suffix for all obj det and segmentation. If your configs already exist, skip the next step. + +**Next:** assuming your configs do not exist, set up the configs you need for your dataset by copying an existing set of configs +``` +cp configs/coco_R_50_C4_2x.yaml ${MYDATA}_R50_C4_2x.yaml +cp configs/coco_R_50_C4_2x_moco.yaml ${MYDATA}_R50_C4_2x_moco.yaml +``` +Edit `${MYDATA}_R50_C4_2x.yaml` and set `MIN_SIZE_TRAIN` and `MIN_SIZE_TEST` to be appropriate for your dataset. Also, rename `TRAIN` and `TEST` to have your dataset name, set `MASK_ON` to `True` if doing semantic segmentation, and update `STEPS` and `MAX_ITER` if running the training for a different amount of time is appropriate (check relevant publications / codebases to set the training schedule). + +Edit `${MYDATA}_R50_C4_2x_moco.yaml` and set `PIXEL_MEAN` and `PIXEL_STD` (use `compute-dataset-pixel-mean-std.py` script above, if you don't know them). + +Then, edit `train_net.py` and add the appropriate data registry lines for your train/val data +``` +register_coco_instances("dataname_train", {}, "obj-labels-in-coco-format_train.json", "datasets/dataname/dataname_train") +register_coco_instances("dataname_val", {}, "obj-labels-in-coco-format_val.json", "datasets/dataname/dataname_val") +``` + +Then, setup symlinks to your data under `datasets/dataname/dataname_train` and `datasets/dataname/dataname_val`, where you replace dataname with your dataname used in the config/registry. + +**Next**, convert your backbone(s) to detectron format, e.g. (NOTE: I recommend keeping backbones in the same directory that they are originally present in, and appending a `-detectron2` suffix) +``` +python convert-pretrain-to-detectron2.py ../../data/basetrain_chkpts/imagenet_r50_supervised.pth ../../data/basetrain_chkpts/imagenet_r50_supervised-detectron2.pth +``` + +**Next** kick off training +``` +python train_net.py --config-file configs/DATANAME_R_50_C4_24k_moco.yaml --num-gpus 4 OUTPUT_DIR results/${UNIQUE_DATANAME_EXACTLY_DESCRIBING_THIS_RUN}/ TEST.EVAL_PERIOD 2000 MODEL.WEIGHTS ../../data/basetrain_chkpts/imagenet_r50_supervised-detectron2.pth SOLVER.CHECKPOINT_PERIOD ${INT_HOW_OFTEN_TO_CHECKPOINT} +``` +results will be in `results/${UNIQUE_DATANAME_EXACTLY_DESCRIBING_THIS_RUN}`, and you can use tensorboard to view them. + +## Commit and Share Results +Run the following command to grab all results (linear analysis, finetunes, etc) and put them into the appropriate json results file in `results/`: +``` +./utils/update-all-results.sh +``` + +You can verify the results in `results` and then add the new/updated results file to git and commit. + +**Did you get an error message such as:** +``` +!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! + +Please investigate as your results may not be complete. +(see errors in file: base-training/utils/tmp/errors.txt) + +will not include partial result for /home/XXX/base-training/utils/../OpenSelfSup/work_dirs/hpt-pretrain/resisc/finetune/1000-labels/imagenet_r50_supervised_basetrain/50000-iters-2500-iter-0_01-lr-finetune/20200911_170916.log.json +!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! +``` +This means that this particular evaluation run did not appear to run for enough iterations. Investigate the provided log file, rerun any necessary evaluations, and remove the offending log file. + +**Debugging this script** this script finds the top val accuracy, and save the corresponding test accuracy using the following script: +``` +./utils/agg-results.sh +``` +which outputs results to `utils/tmp/results.txt` and errors to `utils/tmp/errors.txt`. Look at this file if your results aren't being generated correctly. + +## Generate plots + +```bash +cd utils +python plot-results.py +``` + +See plots in directory `plot-results` +(you can also pass in a `--data` flag to only generate plots for a specific dataset, e.g. `python plot-results.py --data resisc`) + + +**To plot the eval & test acc curves**, use `./utils/plot.py` +```bash +cd utils +python plot.py --fname PLOT_NAME --folder FOLDER_CONTAINING_DIFFERENT_.PTH_FOLDERs +``` + +**To Generate plot for Exp-2-finetuning**, do +```bash +bash utils/plot-results-exp-2.sh +``` + +See plot in directory `plot-results/exp-2`. + +**To Generate plot for Exp-3-Hierarchical Pretraining**, do +```bash +bash utils/plot-results-exp-3.sh +``` + +See plot in directory `plot-results/exp-3`. + + +## Getting activations for similarity measures + +Run `get_acts.py` with a model used for a classifaction task +(one that has a test/val set).\ +Alternatively, run dist_get_acts as follows: +```shell +bash dist_get_acts.sh ${CFG} ${CHECKPOINT} [--grab_conv...] +``` +Default behavior is to grab the entire batch of linear layers. +Setting `--grab_conv` will capture a single batch of all convolutional layers.\ +Layers will be saved in `${WORK_DIR}/model_acts.npz`. +The npz contains a dictionary which maps layer names to the activations. + + +## Debugging and Developing Within OpenSelfSup + +Here's a command that will allow breakpoints (WARNING: the results with the debug=true flag SHOULD NOT BE USED -- they disable sync batch norms and are not comparable to other results): + +```bash +# from OpenSelfSup/ +# replace with your desired config +python tools/train.py configs/hpt-pretrain/resisc/moco_v2_800ep_basetrain/500-iters.py --work_dir work_dirs/debug --debug + +``` + diff --git a/references/model_architectures.md b/references/model_architectures.md new file mode 100644 index 0000000..7215889 --- /dev/null +++ b/references/model_architectures.md @@ -0,0 +1,50 @@ +#### Key model architectures and terms: +- Supervised training (full dataset) + - baseline: downloaded the pre-trained the models and evaluate without finetuning. +- Supervised training (1k dataset) + - Supervised: original ResNet50 used by Sen12ms + - Supervised_1x1: adding conv1x1 block to the ResNet50 used by Sen12ms +- Finetune/transfer learning (1k dataset) + - Moco: the ResNet50 used by Sen12ms is initialized with the weight from Moco backbone + - Moco_1x1: adding conv1x1 block to the ResNet50 used by Sen12ms and both input module and ResNet50 layers are initialized with the weight from Moco + - Moco_1x1Rnd: adding conv1x1 block to the ResNet50 used by Sen12ms. ResNet50 layers are initialized with the weight from Moco but input module is initialized with random weights +- Finetune v2 (1k dataset) + - freezing ResNet50 fully or partially does not seem to help with accuracy. We will continue explore and share the results once we are sure there is no issue with implementation. + +#### Key pretrained models + +![[pretraining_loss_comparisions.png]] + +Some pretrained models: + +**Sensor Augmentation** +- [dainty-dragon-14](https://wandb.ai/cal-capstone/hpt3/runs/b2de56v2) hpt3 + +(old) +- [vivid-resonance-73](https://wandb.ai/cjrd/BDOpenSelfSup-tools/runs/3qjvxo2p) +- [silvery-oath-7](https://wandb.ai/cal-capstone/hpt2/runs/2rr3864e) +- sen12_crossaugment_epoch_1000.pth: 1000 epocs + +**Data Fusion - Augmentation Set 2** +- [(optional fusion) crimson-pyramid-70](https://wandb.ai/cal-capstone/hpt4/runs/2iu8yfs6): 200 epochs +- [(partial fusion) decent-bird-80](https://wandb.ai/cal-capstone/hpt4/runs/yuy7sdav) to replace due to consistent kernel size. [(partial fusion) laced-water-61](https://wandb.ai/cal-capstone/hpt4/runs/367tz8vs) and [visionary-lake-62](https://wandb.ai/cal-capstone/hpt4/runs/1srlc7jr) +- [(full fusion 200 epocs) volcanic-disco-84](https://wandb.ai/cal-capstone/hpt4/runs/21toacw1). +- [(full fusion 500 epocs) pleasant-moon-88](https://wandb.ai/cal-capstone/hpt4/runs/11yc8up0) +- [(full fusion 900 epocs) major-sky-90](https://wandb.ai/cal-capstone/hpt4/runs/3l1wwwvo) + +- [(full fusion 200 - 180K) stilted-mountain-91](https://wandb.ai/cal-capstone/hpt4/runs/xcthtqmn) - not evaluated yet + + +**Data Fusion - Augmentation Set 1** +- [(optional fusion) proud-snowball-86](https://wandb.ai/cal-capstone/hpt4/runs/3lsgncpe) +- [silvery-meadow-88](https://wandb.ai/cal-capstone/hpt4/runs/1jkg2ym0) + +**Archived ** +- [(full fusion) electric-mountain-33](https://wandb.ai/cal-capstone/hpt4/runs/ak0xdbfu) +- [(partial fusion) visionary-lake-62](https://wandb.ai/cal-capstone/hpt4/runs/1srlc7jr/overview?workspace=user-taeil) should deprecate. different number of epochs from other pretrained models + + +#### running + +volcacine 128_64 all : gpu 9 +silvery-meadow-88: gpu 7 diff --git a/references/setup.md b/references/setup.md deleted file mode 100644 index 728009e..0000000 --- a/references/setup.md +++ /dev/null @@ -1,160 +0,0 @@ - - -## (optional) GPU instance - -Use `Deep Learning AMI (Ubuntu 18.04) Version 40.0` AMI -- on us-west-2, ami-084f81625fbc98fa4 -- additional disk may be required for data - -Once logged in -``` -# update conda to the latest -conda update -n base conda - -conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch - -``` - -## Installation - -**Dependency repo** -- [modified OpenSelfSup](https://github.com/Berkeley-Data/OpenSelfSup) -- [modified SEN12MS](https://github.com/Berkeley-Data/SEN12MS) -- [modified irrigation_detection](https://github.com/Berkeley-Data/irrigation_detection) - -```bash -# clone dependency repo on the same levels as this repo and cd into this repo - -# setup environment -conda create -n hpt python=3.7 ipython -conda activate hpt - -# NOTE: if you are not using CUDA 10.2, you need to change the 10.2 in this command appropriately. Make sure to use torch 1.6.0 -# (check CUDA version with e.g. `cat /usr/local/cuda/version.txt`) -# latest -conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch - -# 1.6 torch (no support for torchvision transform on tensor) -conda install pytorch==1.6.0 torchvision==0.7.0 cudatoolkit=10.2 -c pytorch -#colorado machine -conda install pytorch==1.2.0 torchvision==0.7.0 cudatoolkit=10.2 -c pytorch - -# install local submodules -cd OpenSelfSup -pip install -v -e . -``` - -## Data installation - -Installing and setting up all 16 datsets is a bit of work, so this tutorial shows how to install and setup RESISC-45, and provides links to repeat those steps with other datasets. - -### RESISC-45 -RESISC-45 contains 31,500 aerial images, covering 45 scene classes with 700 images in each class. - -``` shell -# cd to the directory where you want the data, $DATA -wget -q https://bit.ly/3pfkHYp -O resisc45.tar.gz -md5sum resisc45.tar.gz # this should be 964dafcfa2dff0402d0772514fb4540b -tar xf resisc45.tar.gz - -mkdir ~/data -mv resisc45 ~/data - -# replace/set $DATA and $CODE as appropriate -# e.g., ln -s /home/ubuntu/data/resisc45 /home/ubuntu/hpt/OpenSelfSup/data/resisc45/all -ln -s $DATA/resisc45 $CODE/OpenSelfSup/data/resisc45/all - -e.g., ln -s /home/ubuntu/data/resisc45 /home/ubuntu/hpt/OpenSelfSup/data/resisc45/all -``` - -### Download Pretrained Models -``` shell -cd OpenSelfSup/data/basetrain_chkpts/ -./download-pretrained-models.sh -``` - -## Verify Install With RESISC DataSet -[OpenSelfSup](https://github.com/Berkeley-Data/OpenSelfSup) - -Check installation by pretraining using mocov2, extracting the model weights, evaluating the representations, and then viewing the results on tensorboard or [wandb](https://wandb.ai/cal-capstone/hpt): - - -```bash -export WANDB_API_KEY= -export WANDB_ENTITY=cal-capstone -export WANDB_PROJECT=hpt2 -#export WANDB_MODE=dryrun - -cd OpenSelfSup - -# Sanity check with single train and single epoch -CUDA_VISIBLE_DEVICES=1 ./tools/single_train.sh configs/selfsup/moco/r50_v2_resisc_in_basetrain_20ep.py --debug - -CUDA_VISIBLE_DEVICES=1 ./tools/single_train.sh /scratch/crguest/OpenSelfSup/configs/selfsup/moco/r50_v2_sen12ms_in_basetrain_20ep.py --work_dir work_dirs/selfsup/moco/r50_v2_sen12ms_in_basetrain_20ep/ --debug - -# Sanity check: MoCo for 20 epoch on 4 gpus -./tools/dist_train.sh configs/selfsup/moco/r50_v2_resisc_in_basetrain_20ep.py 4 - -# if debugging, use -tools/train.py configs/selfsup/moco/r50_v2_resisc_in_basetrain_1ep.py --work_dir work_dirs/selfsup/moco/r50_v2_resisc_in_basetrain_1ep/ --debug - -# make some variables so its clear what's happening -CHECKPOINT=work_dirs/selfsup/moco/r50_v2_resisc_in_basetrain_20ep/epoch_20.pth -BACKBONE=work_dirs/selfsup/moco/r50_v2_resisc_in_basetrain_20ep/epoch_20_moco_in_basetrain.pth -# Extract the backbone -python tools/extract_backbone_weights.py ${CHECKPOINT} ${BACKBONE} - -# Evaluate the representations -./benchmarks/dist_train_linear.sh configs/benchmarks/linear_classification/resisc45/r50_last.py ${BACKBONE} - -# View the results (optional if wandb is not configured) -cd work_dirs -# you may need to install tensorboard -tensorboard --logdir . -``` - - -## Verify Install With SEN12MS Dataset -[OpenSelfSup](https://github.com/Berkeley-Data/OpenSelfSup) - -Check installation by pretraining using mocov2, extracting the model weights, evaluating the representations, and then viewing the results on tensorboard or [wandb](https://wandb.ai/cal-capstone/hpt): - -```bash -export WANDB_API_KEY= -export WANDB_ENTITY=cal-capstone -export WANDB_PROJECT=hpt2 - -cd OpenSelfSup - -# single GPU training -CUDA_VISIBLE_DEVICES=1 ./tools/single_train.sh configs/selfsup/moco/r50_v2_sen12ms_in_basetrain_20ep.py --debug - -CUDA_VISIBLE_DEVICES=1 ./tools/single_train.sh configs/selfsup/moco/r50_v2_sen12ms_in_fulltrain_20ep.py --debug - - -# command for remote debugging, use full path -python /scratch/crguest/OpenSelfSup/tools/train.py /scratch/crguest/OpenSelfSup/configs/selfsup/moco/r50_v2_sen12ms_in_fulltrain_20ep.py --debug - -CUDA_VISIBLE_DEVICES=1 python ./tools/single_train.sh configs/selfsup/moco/r50_v2_sen12ms_in_fulltrain_20ep.py --debug - -# Sanity check: MoCo for 20 epoch on 4 gpus -#CUDA_VISIBLE_DEVICES=0,1,2,3 -CUDA_VISIBLE_DEVICES=1 ./tools/dist_train.sh configs/selfsup/moco/r50_v2_sen12ms_in_basetrain_20ep.py 4 - -# distributed training -#CUDA_VISIBLE_DEVICES=0,1,2,3 -./tools/dist_train.sh configs/selfsup/moco/r50_v2_sen12ms_in_fulltrain_20ep.py 4 - -BACKBONE=work_dirs/selfsup/moco/r50_v2_sen12ms_in_basetrain_20ep/epoch_20_moco_in_baseline.pth -# method 1: from working dir -CHECKPOINT=work_dirs/selfsup/moco/r50_v2_resisc_in_basetrain_20ep/epoch_20.pth -# method 2: from W&B, {projectid}/{W&B run id} -CHECKPOINT=hpt2/3l4yg63k - -# Extract the backbone -python tools/extract_backbone_weights.py ${BACKBONE} ${CHECKPOINT} - -# Evaluate the representations -./benchmarks/dist_train_linear.sh configs/benchmarks/linear_classification/resisc45/r50_last.py ${BACKBONE} - -``` \ No newline at end of file diff --git a/references/setup_pretraining.md b/references/setup_pretraining.md new file mode 100644 index 0000000..e9e2ae3 --- /dev/null +++ b/references/setup_pretraining.md @@ -0,0 +1,125 @@ + + +## (optional) GPU instance + +Use `Deep Learning AMI (Ubuntu 18.04) Version 40.0` AMI +- on us-west-2, ami-084f81625fbc98fa4 +- additional disk may be required for data + +Once logged in +``` +# update conda to the latest +conda update -n base conda + +conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch + +``` + +## Installation + +**Dependency repo** +- [modified OpenSelfSup](https://github.com/Berkeley-Data/OpenSelfSup) +- [modified SEN12MS](https://github.com/Berkeley-Data/SEN12MS) +- [modified irrigation_detection](https://github.com/Berkeley-Data/irrigation_detection) + +```bash +# clone dependency repo on the same levels as this repo and cd into this repo + +# setup environment +conda create -n hpt python=3.7 ipython +conda activate hpt + +# NOTE: if you are not using CUDA 10.2, you need to change the 10.2 in this command appropriately. Make sure to use torch 1.6.0 +# (check CUDA version with e.g. `cat /usr/local/cuda/version.txt`) + +# latest torch +conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch + +# 1.6 torch (no support for torchvision transform on tensor) +conda install pytorch==1.6.0 torchvision==0.7.0 cudatoolkit=10.2 -c pytorch + +#llano machine +conda install pytorch==1.2.0 torchvision==0.7.0 cudatoolkit=10.2 -c pytorch + +# install local submodules +cd OpenSelfSup +pip install -v -e . +``` + +## Data installation + +Installing and setting up all 16 datsets is a bit of work, so this tutorial shows how to install and setup RESISC-45, and provides links to repeat those steps with other datasets. + +### RESISC-45 +RESISC-45 contains 31,500 aerial images, covering 45 scene classes with 700 images in each class. + +``` shell +# cd to the directory where you want the data, $DATA +wget -q https://bit.ly/3pfkHYp -O resisc45.tar.gz +md5sum resisc45.tar.gz # this should be 964dafcfa2dff0402d0772514fb4540b +tar xf resisc45.tar.gz + +mkdir ~/data +mv resisc45 ~/data + +# replace/set $DATA and $CODE as appropriate +# e.g., ln -s /home/ubuntu/data/resisc45 /home/ubuntu/OpenSelfSup/data/resisc45/all +ln -s $DATA/resisc45 $CODE/OpenSelfSup/data/resisc45/all + +e.g., ln -s /home/ubuntu/data/resisc45 /home/ubuntu/hpt/OpenSelfSup/data/resisc45/all +``` + +### Download Pretrained Models +``` shell +tools/download-pretrained-models.sh +mkdir OpenSelfSup/data/basetrain_chkpts +mv +``` + +## Verify Install With RESISC DataSet +[OpenSelfSup](https://github.com/Berkeley-Data/OpenSelfSup) + +Check installation by pretraining using mocov2, extracting the model weights, evaluating the representations, and then viewing the results on tensorboard or [wandb](https://wandb.ai/cal-capstone/hpt): + + +```bash +cd OpenSelfSup + +CUDA_VISIBLE_DEVICES=0,1,2,3 + +# Sanity check with single train and single epoch +./tools/single_train.sh configs/selfsup/moco/r50_v2_resisc_in_basetrain_1ep.py --debug + +# Sanity check: MoCo for 20 epoch on 4 gpus + ./tools/dist_train.sh configs/selfsup/moco/r50_v2_resisc_in_basetrain_20ep.py 4 +``` + + +## setup sub-modules for sen12ms and openselfsup repo + +Cloning +```console +git clone --recurse-submodules https://github.com/Berkeley-Data/hpt.git + +``` + +or alternatiely +``` +git submodule init +git submodule update +``` + +additional config +``` +git config push.recurseSubmodules on-demand +# show status including submodule +git config status.submodulesummary 1 +``` + +update +``` +git submodule update --remote +``` + +For mroe info: [7.11 Git Tools - Submodules](https://git-scm.com/book/en/v2/Git-Tools-Submodules) + \ No newline at end of file