- This repository contains the work done as part of project-based course under Dr. Pratik Narang, CSIS, BITS Pilani.
Adversarial Camouflage Review Paper.pdf– This review paper has been written in partial fulfilment of the selection process for this course.
Visual Question Answering (VQA) on the FloodNet dataset with multiple architectures.
-
Task
- Visual question answering on the FloodNet Challenge @ EARTHVISION 2021 – Track 2 dataset.
- Given a flood-scene image and a natural-language question, the model predicts a categorical answer (e.g., flooded, non flooded, counting answers, yes/no).
-
Key Result
- Implemented and evaluated multiple architectures for the VQA task, including VGG–LSTM, parallel co-attention and LXMERT models, achieving 90.2% test accuracy on the released FloodNet dataset.
-
Basic VGG–LSTM VQA (Single-Stream Attention)
- CNN backbone: VGG-based visual feature extractor on FloodNet images.
- Text encoder: LSTM over tokenized questions.
- Fusion: concatenation / dense layers over image and question embeddings for answer classification.
- Model diagram:
basic_VGG_LSTM_floodnet_vqamodel.png– Basic VGG–LSTM VQA pipeline. - Model weights:
Basic/BaselineModel/best.hdf5,Basic1/BaselineModel/best.hdf5 - Implementation:
Basic_Modeling.ipynb
-
Parallel Co-Attention VQA
- Jointly models image regions and question words with parallel co-attention.
- Learns attention maps over both modalities to focus on the most relevant visual areas and question components.
- Achieves the best performance (~90.2% test accuracy) on FloodNet among the implemented models.
- Model diagram:
parallel_coattention_VGG_emb512_floodnet_vqamod.png– Parallel co-attention model with VGG image encoder and 512-dim joint embedding. - Implemented :
Parallel_CoAttention_Modeling.ipynb
The table below compares the performance of different model architectures on the FloodNet VQA task:
| Model | Optimizer | Scheduler | Batch Size | Val Loss | Val Acc | Test Loss | Test Acc | Dataset Split |
|---|---|---|---|---|---|---|---|---|
| VGG+LSTM_1 | Adam | No | 64 | -- | 67.85% | -- | -- | FloodNet (8:2) |
| VGG+LSTM_2 | AdaBound | No | 16 | 0.7334 | 80.27% | 0.5448 | 84.05% | FloodNet (7:1:2) |
| VGG+LSTM_2 (SGD) | SGD | No | 16 | 0.5850 | 86.70% | 0.4444 | 88.82% | FloodNet (7:1:2) |
| VGG(448×448)+Parallel-CoAttention | Adam | Exp decay | 16 | 0.5818 | 88.25% | 0.4830 | 90.25% | FloodNet (7:1:2) |
Training details:
- All models used early stopping and model checkpointing based on validation metrics
- The Parallel Co-Attention model used higher resolution inputs (448×448) compared to basic models
- Best test accuracy of 90.25% was achieved by the Parallel Co-Attention architecture
The following examples are taken from the basic VGG–LSTM baseline using the three images shown below (in the same order as listed here).
-
- Question:
what is the overall condition of the given image? - Ground-truth answer:
non flooded - Top predicted answers:
non flooded– 99.079765flooded– 0.9049095flooded,non flooded– 0.0145151122– 0.00012647729Yes– 0.00009099581
- Question:
-
- Question:
is the entire road non flooded? - Ground-truth answer:
No - Top predicted answers:
No– 97.32986Yes– 2.59499488– 0.070233216– 0.00173389881– 0.00057414506
- Question:
-
- Question:
what is the condition of road? - Ground-truth answer:
non flooded - Top predicted answers:
non flooded– 99.98999flooded,non flooded– 0.004487529flooded– 0.0028996274No– 0.00255007132– 0.00003084845
- Question:
The following examples demonstrate the parallel co-attention model's performance on various question types.
-
- Question:
what is the overall condition of the given image? - Ground-truth answer:
non flooded - Top predicted answers:
non flooded– 99.925476flooded– 0.06995208flooded,non flooded– 0.00249820091– 0.000760826342– 0.00061414845
- Question:
-
- Question:
what is the condition of road? - Ground-truth answer:
non flooded - Top predicted answers:
non flooded– 99.89973flooded– 0.0904831flooded,non flooded– 0.005562062– 0.0017628605No– 0.00074605073
- Question:
-
- Question:
how many buildings are in the image? - Ground-truth answer:
16 - Top predicted answers:
7– 10.11590813– 8.8536484– 8.07526416– 6.86858123– 6.1892276
- Question:
-
- Question:
is the entire road non flooded? - Ground-truth answer:
Yes - Top predicted answers:
Yes– 98.16781No– 1.76087762– 0.0167330843– 0.0113962138– 0.00597608
- Question:
- FloodNet Challenge @ EARTHVISION 2021 – Track 2
- Flooded urban and semi-urban scenes captured from UAVs.
- Question types include condition recognition, counting, and yes/no.
To reproduce experiments, you need access to the official FloodNet dataset and must respect its usage/license terms.






