Detection by Attack: Detecting Adversarial Samples by Undercover Attack
Abstract
The safety of artificial intelligence systems has aroused great concern due to the vulnerability of deep neural networks. Studies show that malicious modifications to the inputs of a network classifier, can fool the classifier and lead to wrong predictions. These modified inputs are called adversarial samples. In order to resolve this challenge, this paper proposes a novel and effective framework called Detection by Attack (DBA) to detect adversarial samples by Undercover Attack. DBA works by converting the difficult adversarial detection problem into a simpler attack problem, which is inspired by the espionage technique. It appears to be attacking the system, but it is actually defending the system. Reviewing the literature shows that this paper is the first attempt to introduce a detection method that can effectively detect adversarial samples in both images and texts. Experimental results show that the DBA scheme yields state-of-the-art detection performances in both detector-unaware (95.66% detection accuracy on average) and detector-aware (2.10% attack success rate) scenarios. Furthermore, DBA is robust to the perturbation size and confidence of adversarial samples.