Three Assertions about Interactive Machine Learning · Three Assertions about Interactive Machine Learning Xiaojin Zhu Department of Computer Sciences University of Wisconsin-Madison

Three Assertionsabout Interactive Machine Learning

Xiaojin Zhu

Department of Computer SciencesUniversity of Wisconsin-Madison

[email protected]

Assertion 1: Humans can be modeled with statisticallearning theory

I Unifying math behind cognitive science and machine learning

Example 1a: Human Rademacher Complexity(grenade, A), (meadow, A), (skull, B), (conflict, B), (queen, B)

I “learning random labels” (x1, σ1) . . . (xn, σn)I Rademacher complexity (similar to VC dimension)

Radn(F ) ≈

∣∣∣∣∣ 2nn∑i=1

σif(xi)

∣∣∣∣∣. . . of our mind!

I Larger Rademacher complexity → worse generalization errorbound (overfitting) [ZRG NIPS09]

rape killer funeral · · · fun laughter joy

0 10 20 30 400

0.5

1

1.5

2

n

Rad

emac

her

com

plex

ity

0 10 20 30 400

0.5

1

1.5

2

n

Rad

emac

her

com

plex

ity

Example 1a: Human Rademacher Complexity(grenade, A), (meadow, A), (skull, B), (conflict, B), (queen, B)

I “learning random labels” (x1, σ1) . . . (xn, σn)I Rademacher complexity (similar to VC dimension)

Radn(F ) ≈

∣∣∣∣∣ 2nn∑i=1

σif(xi)

∣∣∣∣∣. . . of our mind!

I Larger Rademacher complexity → worse generalization errorbound (overfitting) [ZRG NIPS09]

rape killer funeral · · · fun laughter joy

0 10 20 30 400

0.5

1

1.5

2

n

Rad

emac

her

com

plex

ity

0 10 20 30 400

0.5

1

1.5

2

n

Rad

emac

her

com

plex

ity

Example 1b: Human Semi-Supervised Learning

I Humans learn supervised first, then

I . . . decision boundary shifts to distribution trough in test data

I Can be explained by a variety of semi-supervised machinelearning models [GRZ ToCS13]

−3 −2 −1 0 1 2 30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

left−shifted

Gaussian mixture

range examples

test examples

x−1 −0.5 0 0.5 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

perc

ent c

lass

2 r

espo

nses

test−1, alltest−2, L−subjectstest−2, R−subjects

Example 1c: Human Active Learning

[CKNQRZ NIPS08]

Passive learning inf θn supθ∈[0,1] E[|θn − θ|] ≥ 14

(1+2ε1−2ε

)2ε1

n+1

Active learning supθ∈[0,1] E[|θn − θ|] ≤ 2

(√12 +

√ε(1− ε)

)nnoise ε = 0 ε = 0.05 ε = 0.1 ε = 0.2 ε = 0.4

HumanPassive

10 20 30 40−1

−0.5

0

0.5

1

10 20 30 40−1

−0.5

0

0.5

1

10 20 30 40−1

−0.5

0

0.5

1

10 20 30 40−1

−0.5

0

0.5

1

10 20 30 40−1

−0.5

0

0.5

1

HumanActive

10 20 30 40−1

−0.5

0

0.5

1

10 20 30 40−1

−0.5

0

0.5

1

10 20 30 40−1

−0.5

0

0.5

1

10 20 30 40−1

−0.5

0

0.5

1

10 20 30 40−1

−0.5

0

0.5

1

Assertion 2: There is a theoretically optimal way to teach

Human teaches machine (interactive ML)Machine teaches human (education)

Example 2: 1D threshold function

I Passive learning (xi, yi)iid∼ p, risk ≈ O( 1

n)

I Active learning risk ≈ 12n

I Minimum teaching: n = 2. (teaching dimension)

I Alternatively: easy to hard (curriculum learning, fading,parentese)

A formula for optimal teaching

1. World: p(x, y | θ∗), loss function `(f(x), y)

2. Learner: makes prediction f(x | data)

3. Teacher:I clairvoyant, knows everything aboveI can only teach by examples (x, y)I goal: choose the least-effort teaching set D = (x, y)1:n to

minimize the learner’s future loss (risk):

minD

Eθ∗ [`(f(x | D), y)] + effort(D)

I if the future loss approaches Bayes risk, D is a teaching setand n is the (generalized) teaching dimension

[KZM NIPS11, Z arXiv13]

Assertion 3: Even when human teachers are not optimal,they are not iid

. . . and machine learners should take advantage of thatnon-iidness.

Example 3: Feature Volunteering (Interactive ML)

[JZSR ICML13]

Example 3: Sampling with Reduced Replacement





References

R. Castro, C. Kalish, R. Nowak, R. Qian, T. Rogers, and X. Zhu.Human active learning.In Advances in Neural Information Processing Systems (NIPS) 22. 2008.

B. R. Gibson, T. T. Rogers, and X. Zhu.Human semi-supervised learning.Topics in Cognitive Science, 5(1):132–172, 2013.

K.-S. Jun, X. Zhu, B. Settles, and T. Rogers.Learning from human-generated lists.In The 30th International Conference on Machine Learning (ICML), 2013.

F. Khan, X. Zhu, and B. Mutlu.How do humans teach: On curriculum learning and teaching dimension.In Advances in Neural Information Processing Systems (NIPS) 25. 2011.

X. Zhu, T. T. Rogers, and B. Gibson.Human Rademacher complexity.In Advances in Neural Information Processing Systems (NIPS) 23. 2009.

Three Assertions

1. Humans can be modeled with statistical learning theory.

2. There is a theoretically optimal way to teach.

3. Even when human teachers are not optimal, they are not iid.

Capacity

VC-dimension

I F : a family of binary classifiers

I VC-dimension V C(F ): size of the largest set that F canshatter

I With probability at least 1− δ,

supf∈F

R(f)−Rn(f) ≤ 2

√2V C(F ) log n+ V C(F ) log 2e

V C(F ) + log 2δ

n.

I R(f): error of f in the future

I Rn(f): error of f on a training set of size n

Capacity

Rademacher complexity

I σ1, . . . , σn : P (σi = 1) = P (σi = −1) = 12

I Rademacher complexity

Radn(F ) = Eσ,x

(supf∈F

∣∣∣∣∣ 1nn∑i=1

σif(xi)

∣∣∣∣∣).

I With probability at least 1− δ,

supf∈F|Rn(f)−R(f)| ≤ 2Radn(F ) +

√log(2/δ)

2n.

Machine learning → human learning

I f : you categorize x by f(x)

I F : all the classifiers in your mind

I Rn(f): how did you do in class

I R(f): how well can you do outside classI Capacity: can we measure it in humans?

I V C(F ): too brittle (find one dataset of size n) andcombinatorial (verify shattering)

I Others may behave better, e.g., Radn(F )

Overfitting indicator

0 0.5 1 1.5 20

0.5

1

1.5

Rademacher complexity

e −

e^

Shape,40Word,40 Shape,5

Word,5

boundobserved

I e test set error, e training set errorI generalization error bound holdsI actual overfitting tracks bound (nice but not predicted by

theory)

The study of capacity may

I constrain cognitive modelsI understand groups differ in age, health, education, etc.

Human semi-supervised learning, the other way aroundHuman unsupervised learning first

−8 −4.8 1.6 8

p(x)

−8 −1.6 4.8 8

p(x)

−8 −1.6 8

p(x)

−8 −1.6 8

p(x)

−8 −4.8 1.6 8x

time

→

−8 −1.6 4.8 8x

time

→

−8 −1.6 8x

time

→

−8 −1.6 8x

time

→

trough peak uniform converge. . . influences subsequent (identical) supervised learning task

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

mea

n ac

cura

cy ±

std

err

troughconvergeuniformpeak

Human teacher behaviors

strategy boundary curriculum linear positive

“graspability” (n = 31) 0% 48% 42% 10%“lines” (n = 32) 56% 19% 25% 0%

Documents

Three Assertions about Interactive Machine Learning · Three Assertions about Interactive Machine Learning Xiaojin Zhu Department of Computer Sciences University of Wisconsin-Madison