Hi, 12-in-1 is a very interesting work based on vilbert.
However, I am confused about the extract_features_from_gt.py scipt.
In the README under data/ directory, you said: to extract data features, users should firstly transform all grounding truth as the following format:
{
{
'file_name': 'name_of_image_file',
'file_path': '<path_to_image_file_on_your_disk>',
'bbox': array([
[ x1, y1, width1, height1],
[ x2, y2, width2, height2],
...
]),
'num_box': 2
},
....
}
However, I notice that in the extract_features_from_gt.py script, you do not recover the xywh to xyxy format, which should cause wrong feature extraction.
I am not sure whether this is an elaborate design or a bug.
Further, if this is a bug, what about the features used in VILBERT and 12-in-1? Are they correctly extracted using the correct bounding boxes?