Mandoline: Model Evaluation under

Mandoline: Model Evaluation under Distribution ShiftMayee Chen*, Karan Goel*, Nimit Sohoni*, Fait Poms, Kayvon Fatahalian, Christopher Ré

Stanford University

Motivation

Q: How do we evaluate model performance during deployment?

● Model’s deployment setting ≠ training setting

Motivation


Source data distribution

evalu

ate

Model

Training set Validation set

Labeled

learn


Motivation


Source data distribution

evalu

ate

Model

Target data distributionTraining set Validation set

Mandoline: user-guided framework for evaluation under distribution shiftLabeled Unlabeled

learn deploy

Distribution shift

evaluate?


Common approach: importance weightingSource data distribution Target data distribution

Common approach: importance weightingSource data distribution Target data distribution

Density ratio to estimate

● High dimensional data : harder to compute

Common approach: importance weighting

Problems:

● Support shift - what if ?

Source data distribution Target data distribution

Density ratio to estimate

Mandoline: Slice-based reweighting frameworkSlice: user-defined grouping of data


negationcontains not, n’t

male pronouncontains he, him

strong sentimentcontains love, adore


Source Accuracy: 91%




I love eating ice-cream.

He loved walking on the beach.

He didn’t like drinking coffee.

(Source) Labeled Validation Set

-1

Slices

-1

1

-1

1

1

1

1

-1

Model


Source Accuracy: 91%




He does not love eating scones.

He loves taking risks.

She likes drinking coffee.

(Target) Unlabeled Test Set Slices

1

-1

-1

1

1

-1

1

1

-1





-1

Slices

-1

1

-1

1

1

1

1

-1

Model


Source Accuracy: 91% Target Accuracy:

84%




He does not love eating scones.

He loves taking risks.

She likes drinking coffee.

(Target) Unlabeled Test Set Slices

1

-1

-1

1

1

-1

1

1

-1





-1

Slices

-1

1

-1

1

1

1

1

-1

Model

Mandoline

Results

Prop 1: if k slices capture all “relevant” distributional shift

between and , then reweighting with recovers .

● If support shift occurs on irrelevant slices (i.e. slices independent of Y), it can be corrected!

● Dimensionality: reduce from d → k

Results

● How to compute ? Use any density ratio estimation method on g(x)○ Kullback-Leibler Importance Estimation Procedure (KLIEP)

■ Extend to correct for noisily-defined, incomplete slices

Prop 1: if k slices capture all “relevant” distributional shift

between and , then reweighting with recovers .

● If support shift occurs on irrelevant slices (i.e. slices independent of Y), it can be corrected!

● Dimensionality: reduce from d → k

Results

Results

CelebA SNLI → MNLI

Importance weighting on x

On g(x)

Summary

Thank you! Contact: [email protected]

Model evaluation under distribution shift:● When user-specified slices capture relevant distribution shift, can reweight

using them● Can mitigate 1) support shift and 2) high dimensionality in standard

importance reweighting● Future steps: slice design - frameworks for how to construct good g?

mailto:[email protected]

Documents

Mandoline: Model Evaluation under