1
Different models can make the same prediction but arrive at their answers through very different reasoning. Here, we use influence functions to compare two different models – the Inception network [3] and a radial basis function SVM – trained on the same image classification task. Models that place a lot of influence on a few points can be vulnerable to training input perturbations, posing a serious security risk in real-world ML systems where attackers can influence the training data. Influence functions can be used to craft adversarial training images that are visually-indistinguishable and can flip a model's prediction on a separate test image. In contrast, recent work has generated adversarial test images that are visually indistinguishable from real images but fool a classifier [4]. To the best of our knowledge, this is the first proof-of-concept that visually-indistinguishable training attacks can be executed on otherwise highly-accurate neural networks. Adversarial training examples We want to understand the effect of each training point on the model. Formally, how would the predictions change if we put an additional weight on a training point ? Consider an empirical risk minimization setting: Training points: Loss: Params: Let be the parameters after upweighting . We can compute the influence function: where is the Hessian of the empirical risk. This tells us how much the model parameters change with the upweighting. We can use this to calculate the effect of the training point on the loss at any test point : If we instead modified the value of to , we can derive an analogous expression: This tells us how much the loss at the test point changes as we change the value of . We can evaluate these expressions efficiently using second-order optimization techniques, such as conjugate-gradient methods or stochastic Hessian inversion [2]. Altogether, these give us a computationally-tractable way of evaluating the effect of training-set perturbations on the model, without needing to retrain the model. Here, we assumed that the loss is differentiable and convex. We deal with the non-differentiable and non-convex cases in our preprint. Understanding Black-box Predictions via Influence Functions Pang Wei Koh and Percy Liang Department of Computer Science, Stanford University References [1] Cook and Weisberg, Technometrics, 1980. [2] Agarwal et al., arXiv, 2016. [3] Szegedy et al., CVPR, 2016. [4] Goodfellow et al., ICLR, 2015. For more details, please read our preprint at https://arxiv.org/abs/1703.04730. We use influence functions – a classic technique from robust statistics – to trace a model's prediction through the learning algorithm and back to its training data, identifying the points most responsible for a given prediction. On linear models and ConvNets, we show that influence functions can be used to understand model behavior, debug models and detect dataset errors, and even identify and exploit vulnerabilities to adversarial training-set attacks. Given a machine learning model, we might want to ask: “Why did it make this prediction?” Unfortunately, high-performing models, like deep neural networks, are often complicated black- boxes whose predictions are hard to explain. We tackle this question by tracing a model's predictions through its learning algorithm and back to its training data, where the model parameters ultimately come from. To formalize the impact of a training point on a prediction, we ask the counterfactual: what would happen if we removed this training point, or if its values were changed slightly? We show how to answer this counterfactual with influence functions, a classic technique from robust statistics [1] that tells us how the model parameters change as we upweight a training point by an infinitesimal amount. Using second- order optimization techniques, we can efficiently approximate influence in any differentiable black-box model, only requiring oracle access to gradients and Hessian-vector products. Overview Comparing models Labels in the real world are noisy. Human experts can sometimes fix them, but datasets are too large in most applications to manually review everything. By identifying the most influential training points on the model, we can help experts prioritize which examples to check. Fixing noisy labels Approach Fig 1. Bottom left: Scatter plot of influence vs. distance. Green dots are fish and red dots are dogs (the other class in this classification problem). Bottom right: The two most helpful training images, for each model, on the test image. Fig 2. On an email spam classification dataset, we randomly chose 10% of the training labels to flip, and used different algorithms to prioritize which training labels to check. Our approach with influence functions significantly improved the accuracy gained after checking a fixed fraction of training points, more than other baselines (like checking training points with the highest loss). Fig 3. A small perturbation to just one training example can flip the test prediction. This was carried out on a dog vs. fish image classification task, with n=1,800 training images, and using logistic regression run on the top layer of the Inception network [3].

Understanding Black-box Predictions via Influence Functionsforum.stanford.edu/events/posterslides/Understanding... · 2017. 4. 4. · Understanding Black-box Predictions via Influence

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Understanding Black-box Predictions via Influence Functionsforum.stanford.edu/events/posterslides/Understanding... · 2017. 4. 4. · Understanding Black-box Predictions via Influence

Different models can make the same prediction but arrive at their answers through very different reasoning. Here, we use influence functions to compare two different models – the Inception network [3] and a radial basis function SVM – trained on the same image classification task.

Models that place a lot of influence on a few points can be vulnerable to training input perturbations, posing a serious security risk in real-world ML systems where attackers can influence the training data.

Influence functions can be used to craft adversarial training images that are visually-indistinguishable and can flip a model's prediction on a separate test image. In contrast, recent work has generated adversarial test images that are visually indistinguishable from real images but fool a classifier [4]. To the best of our knowledge, this is the first proof-of-concept that visually-indistinguishable training attacks can be executed on otherwise highly-accurate neural networks.

Adversarial training examples

We want to understand the effect of each training point on the model. Formally, how would the predictions change if we put an additional weight on a training point ?

Consider an empirical risk minimization setting:

Training points: Loss: Params:

Let be the parameters after upweighting . We can compute the influence function:

where is the Hessian of the empirical risk. This tells us how much the model parameters change with the upweighting.

We can use this to calculate the effect of the training point on the loss at any test point :

If we instead modified the value of to , we can derive an analogous expression:

This tells us how much the loss at the test point changes as we change the value of .

We can evaluate these expressions efficiently using second-order optimization techniques, such as conjugate-gradient methods or stochastic Hessian inversion [2]. Altogether, these give us a computationally-tractable way of evaluating the effect of training-set perturbations on the model, without needing to retrain the model.

Here, we assumed that the loss is differentiable and convex. We deal with the non-differentiable and non-convex cases in our preprint.

Understanding Black-box Predictions via Influence FunctionsPang Wei Koh and Percy Liang

Department of Computer Science, Stanford University

References[1] Cook and Weisberg, Technometrics, 1980. [2] Agarwal et al., arXiv, 2016. [3] Szegedy et al., CVPR, 2016. [4] Goodfellow et al., ICLR, 2015.

For more details, please read our preprint at https://arxiv.org/abs/1703.04730.

We use influence functions – a classic technique from robust statistics – to trace a

model's prediction through the learning algorithm and back to its training data,

identifying the points most responsible for a given prediction. On linear models and

ConvNets, we show that influence functions can be used to understand model behavior,

debug models and detect dataset errors, and even identify and exploit vulnerabilities to

adversarial training-set attacks.

Given a machine learning model, we might want to ask: “Why did it make this prediction?” Unfortunately, high-performing models, like deep neural networks, are often complicated black-boxes whose predictions are hard to explain.

We tackle this question by tracing a model's predictions through its learning algorithm and back to its training data, where the model parameters ultimately come from. To formalize the impact of a training point on a prediction, we ask the counterfactual: what would happen if we removed this training point, or if its values were changed slightly?

We show how to answer this counterfactual with influence functions, a classic technique from robust statistics [1] that tells us how the model parameters change as we upweight a training point by an infinitesimal amount. Using second-order optimization techniques, we can efficiently approximate influence in any differentiable black-box model, only requiring oracle access to gradients and Hessian-vector products.

Overview

Comparing modelsLabels in the real world are noisy. Human experts can sometimes fix them, but datasets are too large in most applications to manually review everything. By identifying the most influential training points on the model, we can help experts prioritize which examples to check.

Fixing noisy labelsApproach

Fig 1. Bottom left: Scatter plot of influence vs. distance. Green dots are fish and red dots are dogs (the other class in this classification problem). Bottom right: The two most helpful training images, for each model, on the test image.

Fig 2. On an email spam classification dataset, we randomly chose 10% of the training labels to flip, and used different algorithms to prioritize which training labels to check. Our approach with influence functions significantly improved the accuracy gained after checking a fixed fraction of training points, more than other baselines (like checking training points with the highest loss).

Fig 3. A small perturbation to just one training example can flip the test prediction. This was carried out on a dog vs. fish image classification task, with n=1,800 training images, and using logistic regression run on the top layer of the Inception network [3].