Skip to content

Detection false positives? #4

@dxoigmn

Description

@dxoigmn

When evaluating whether inputs are adversarial, the framework first checks whether the classification of the input matches the groundtruth label. If it does not, then it uses the detection mechanism to reject/ignore inputs. Only when both conditions are satisfied does the framework consider the input adversarial.

def evaluate(self, defense, example_idx, true_label,
src_example, adv_example,
src_pred, adv_pred,
src_detector, adv_detector):
# Verify that the label is now incorrect
if np.argmax(adv_pred) == true_label:
return False, "Label {} matches true label {}".format(np.argmax(adv_pred), true_label)
# Verify that example is within the allowed Lp norm
distortion = np.linalg.norm((src_example - adv_example).flatten(), ord=self.norm)
if distortion > self.threshold + 1e-3:
return False, "Distortion {} exceeds bound {}".format(distortion, self.threshold)
# Verify that it's not detected as adversarial
if adv_detector > defense.threshold:
return False, "Adversarial example rejected by detector with score {}.".format(adv_detector)
return True, None

My expectation was that correctly classified inputs also ought to be rejected if they trip the detector, but because L223 returns early this can never happen. This is particularly pronounced in the transform defense where a non-trivial majority of the benign inputs would be rejected by the "stable prediction" detector. Is this intentional? It’s a little weird to force the attacker to defeat some objective that the defender can almost never achieve.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions