Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Mandoline: Model Evaluation under Distribution ShiftMayee Chen*, Karan Goel*, Nimit Sohoni*, Fait Poms, Kayvon Fatahalian, Christopher Ré
Stanford University
Motivation
Q: How do we evaluate model performance during deployment?
● Model’s deployment setting ≠ training setting
Motivation
Q: How do we evaluate model performance during deployment?
Source data distribution
evalu
ate
Model
Training set Validation set
Labeled
learn
● Model’s deployment setting ≠ training setting
Motivation
Q: How do we evaluate model performance during deployment?
Source data distribution
evalu
ate
Model
Target data distributionTraining set Validation set
Mandoline: user-guided framework for evaluation under distribution shiftLabeled Unlabeled
learn deploy
Distribution shift
evaluate?
● Model’s deployment setting ≠ training setting
Common approach: importance weightingSource data distribution Target data distribution
Common approach: importance weightingSource data distribution Target data distribution
Density ratio to estimate
● High dimensional data : harder to compute
Common approach: importance weighting
Problems:
● Support shift - what if ?
Source data distribution Target data distribution
Density ratio to estimate
Mandoline: Slice-based reweighting frameworkSlice: user-defined grouping of data
Mandoline: Slice-based reweighting frameworkSlice: user-defined grouping of data
negationcontains not, n’t
male pronouncontains he, him
strong sentimentcontains love, adore
Mandoline: Slice-based reweighting frameworkSlice: user-defined grouping of data
Source Accuracy: 91%
negationcontains not, n’t
male pronouncontains he, him
strong sentimentcontains love, adore
I love eating ice-cream.
He loved walking on the beach.
He didn’t like drinking coffee.
(Source) Labeled Validation Set
-1
Slices
-1
1
-1
1
1
1
1
-1
Model
Mandoline: Slice-based reweighting frameworkSlice: user-defined grouping of data
Source Accuracy: 91%
negationcontains not, n’t
male pronouncontains he, him
strong sentimentcontains love, adore
He does not love eating scones.
He loves taking risks.
She likes drinking coffee.
(Target) Unlabeled Test Set Slices
1
-1
-1
1
1
-1
1
1
-1
I love eating ice-cream.
He loved walking on the beach.
He didn’t like drinking coffee.
(Source) Labeled Validation Set
-1
Slices
-1
1
-1
1
1
1
1
-1
Model
Mandoline: Slice-based reweighting frameworkSlice: user-defined grouping of data
Source Accuracy: 91% Target Accuracy:
84%
negationcontains not, n’t
male pronouncontains he, him
strong sentimentcontains love, adore
He does not love eating scones.
He loves taking risks.
She likes drinking coffee.
(Target) Unlabeled Test Set Slices
1
-1
-1
1
1
-1
1
1
-1
I love eating ice-cream.
He loved walking on the beach.
He didn’t like drinking coffee.
(Source) Labeled Validation Set
-1
Slices
-1
1
-1
1
1
1
1
-1
Model
Mandoline
Results
Prop 1: if k slices capture all “relevant” distributional shift
between and , then reweighting with recovers .
● If support shift occurs on irrelevant slices (i.e. slices independent of Y), it can be corrected!
● Dimensionality: reduce from d → k
Results
● How to compute ? Use any density ratio estimation method on g(x)○ Kullback-Leibler Importance Estimation Procedure (KLIEP)
■ Extend to correct for noisily-defined, incomplete slices
Prop 1: if k slices capture all “relevant” distributional shift
between and , then reweighting with recovers .
● If support shift occurs on irrelevant slices (i.e. slices independent of Y), it can be corrected!
● Dimensionality: reduce from d → k
Results
Results
CelebA SNLI → MNLI
Importance weighting on x
On g(x)
Summary
Thank you! Contact: [email protected]
Model evaluation under distribution shift:● When user-specified slices capture relevant distribution shift, can reweight
using them● Can mitigate 1) support shift and 2) high dimensionality in standard
importance reweighting● Future steps: slice design - frameworks for how to construct good g?