17
Mandoline: Model Evaluation under Distribution Shift Mayee Chen*, Karan Goel*, Nimit Sohoni*, Fait Poms, Kayvon Fatahalian, Christopher R é Stanford University

Mandoline: Model Evaluation under

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Mandoline: Model Evaluation under

Mandoline: Model Evaluation under Distribution ShiftMayee Chen*, Karan Goel*, Nimit Sohoni*, Fait Poms, Kayvon Fatahalian, Christopher Ré

Stanford University

Page 2: Mandoline: Model Evaluation under

Motivation

Q: How do we evaluate model performance during deployment?

● Model’s deployment setting ≠ training setting

Page 3: Mandoline: Model Evaluation under

Motivation

Q: How do we evaluate model performance during deployment?

Source data distribution

evalu

ate

Model

Training set Validation set

Labeled

learn

● Model’s deployment setting ≠ training setting

Page 4: Mandoline: Model Evaluation under

Motivation

Q: How do we evaluate model performance during deployment?

Source data distribution

evalu

ate

Model

Target data distributionTraining set Validation set

Mandoline: user-guided framework for evaluation under distribution shiftLabeled Unlabeled

learn deploy

Distribution shift

evaluate?

● Model’s deployment setting ≠ training setting

Page 5: Mandoline: Model Evaluation under

Common approach: importance weightingSource data distribution Target data distribution

Page 6: Mandoline: Model Evaluation under

Common approach: importance weightingSource data distribution Target data distribution

Density ratio to estimate

Page 7: Mandoline: Model Evaluation under

● High dimensional data : harder to compute

Common approach: importance weighting

Problems:

● Support shift - what if ?

Source data distribution Target data distribution

Density ratio to estimate

Page 8: Mandoline: Model Evaluation under

Mandoline: Slice-based reweighting frameworkSlice: user-defined grouping of data

Page 9: Mandoline: Model Evaluation under

Mandoline: Slice-based reweighting frameworkSlice: user-defined grouping of data

negationcontains not, n’t

male pronouncontains he, him

strong sentimentcontains love, adore

Page 10: Mandoline: Model Evaluation under

Mandoline: Slice-based reweighting frameworkSlice: user-defined grouping of data

Source Accuracy: 91%

negationcontains not, n’t

male pronouncontains he, him

strong sentimentcontains love, adore

I love eating ice-cream.

He loved walking on the beach.

He didn’t like drinking coffee.

(Source) Labeled Validation Set

-1

Slices

-1

1

-1

1

1

1

1

-1

Model

Page 11: Mandoline: Model Evaluation under

Mandoline: Slice-based reweighting frameworkSlice: user-defined grouping of data

Source Accuracy: 91%

negationcontains not, n’t

male pronouncontains he, him

strong sentimentcontains love, adore

He does not love eating scones.

He loves taking risks.

She likes drinking coffee.

(Target) Unlabeled Test Set Slices

1

-1

-1

1

1

-1

1

1

-1

I love eating ice-cream.

He loved walking on the beach.

He didn’t like drinking coffee.

(Source) Labeled Validation Set

-1

Slices

-1

1

-1

1

1

1

1

-1

Model

Page 12: Mandoline: Model Evaluation under

Mandoline: Slice-based reweighting frameworkSlice: user-defined grouping of data

Source Accuracy: 91% Target Accuracy:

84%

negationcontains not, n’t

male pronouncontains he, him

strong sentimentcontains love, adore

He does not love eating scones.

He loves taking risks.

She likes drinking coffee.

(Target) Unlabeled Test Set Slices

1

-1

-1

1

1

-1

1

1

-1

I love eating ice-cream.

He loved walking on the beach.

He didn’t like drinking coffee.

(Source) Labeled Validation Set

-1

Slices

-1

1

-1

1

1

1

1

-1

Model

Mandoline

Page 13: Mandoline: Model Evaluation under

Results

Prop 1: if k slices capture all “relevant” distributional shift

between and , then reweighting with recovers .

● If support shift occurs on irrelevant slices (i.e. slices independent of Y), it can be corrected!

● Dimensionality: reduce from d → k

Page 14: Mandoline: Model Evaluation under

Results

● How to compute ? Use any density ratio estimation method on g(x)○ Kullback-Leibler Importance Estimation Procedure (KLIEP)

■ Extend to correct for noisily-defined, incomplete slices

Prop 1: if k slices capture all “relevant” distributional shift

between and , then reweighting with recovers .

● If support shift occurs on irrelevant slices (i.e. slices independent of Y), it can be corrected!

● Dimensionality: reduce from d → k

Page 15: Mandoline: Model Evaluation under

Results

Page 16: Mandoline: Model Evaluation under

Results

CelebA SNLI → MNLI

Importance weighting on x

On g(x)

Page 17: Mandoline: Model Evaluation under

Summary

Thank you! Contact: [email protected]

Model evaluation under distribution shift:● When user-specified slices capture relevant distribution shift, can reweight

using them● Can mitigate 1) support shift and 2) high dimensionality in standard

importance reweighting● Future steps: slice design - frameworks for how to construct good g?