-
Notifications
You must be signed in to change notification settings - Fork 22
Detection false positives? #4
Copy link
Copy link
Open
Description
When evaluating whether inputs are adversarial, the framework first checks whether the classification of the input matches the groundtruth label. If it does not, then it uses the detection mechanism to reject/ignore inputs. Only when both conditions are satisfied does the framework consider the input adversarial.
selfstudy-adversarial-robustness/common/framework.py
Lines 217 to 234 in 15d1c01
| def evaluate(self, defense, example_idx, true_label, | |
| src_example, adv_example, | |
| src_pred, adv_pred, | |
| src_detector, adv_detector): | |
| # Verify that the label is now incorrect | |
| if np.argmax(adv_pred) == true_label: | |
| return False, "Label {} matches true label {}".format(np.argmax(adv_pred), true_label) | |
| # Verify that example is within the allowed Lp norm | |
| distortion = np.linalg.norm((src_example - adv_example).flatten(), ord=self.norm) | |
| if distortion > self.threshold + 1e-3: | |
| return False, "Distortion {} exceeds bound {}".format(distortion, self.threshold) | |
| # Verify that it's not detected as adversarial | |
| if adv_detector > defense.threshold: | |
| return False, "Adversarial example rejected by detector with score {}.".format(adv_detector) | |
| return True, None |
My expectation was that correctly classified inputs also ought to be rejected if they trip the detector, but because L223 returns early this can never happen. This is particularly pronounced in the transform defense where a non-trivial majority of the benign inputs would be rejected by the "stable prediction" detector. Is this intentional? It’s a little weird to force the attacker to defeat some objective that the defender can almost never achieve.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels