Defending against neural network model stealing attacks using deceptive perturbations
Machine learning architectures are readily available, but obtaining the high quality labeled data for training is costly. Pre-trained models available as cloud services can be used to generate this costly labeled data, and would allow an attacker to replicate trained models, effectively stealing them. Limiting the information provided by cloud based models by omitting class probabilities has been proposed as a means of protection but significantly impacts the utility of the models. In this work, we illustrate how cloud based models can still provide useful class probability information for users, while significantly limiting the ability of an adversary to steal the model. Our defense perturbs the model's final activation layer, slightly altering the output probabilities. This forces the adversary to discard the class probabilities, requiring significantly more queries before they can train a model with comparable performance. We evaluate our defense under diverse scenarios and defense aware attacks. Our evaluation shows our defense can degrade the accuracy of the stolen model at least 20%, or increase the number of queries required by an adversary 64 fold, all with a negligible decrease in the protected model accuracy.