Mathematical Theories of Interaction with Oracles Liu Yang Carnegie Mellon University 1© Liu Yang 2013

Mathematical Theories of Interaction with

Oracles

Liu YangCarnegie Mellon University

1© Liu Yang 2013

Thesis Committee

Avrim Blum (co-chair)Jaime Carbonell (co-chair)Manuel BlumSanjoy Dasgupta (UC, San Diego)Yishay Mansour (Tel Aviv University)Joel Spencer (Courant Institute, NYU)

Outline

• Active Property Testing

- Do we need to imitate human to advance AI?- I see air planes can fly without flapping their wings.

© Liu Yang 2013 3

Property TestingProperty Testing

• Given access to massive dataset: want to quickly determine if a given fn f has some given property P or is far from having it

• Goal: test from very small num of queries.

• One motivation: preprocessing step

before learning

© Liu Yang 2013 4

Property TestingProperty Testing

© Liu Yang 2013 5

• Instance space X = Rn (Distri D over X)

• Tested function f : X->{-1,1}

• A property P of Boolean fn is a subset of all Boolean fns h : X -> {-1,1} (e.g ltf)

• distD(f, P):=ming P Px~D[f(x) ≠g(x)]

• Standard Type of query: membership query (ask for f(x) at arbitrary point x)

Property Testing: An Property Testing: An ExampleExample

• E.g. Union of d Intervals 0----++++----+++++++++-----++---+++--------1

- UINT4 ? Accept! UINT3 ? Depend on ε

- Model selection: testing can tell us how big d need be to be close to target

(double and guess, d = 2, 4, 8, 16, ….)

If fP should accept w/ prob 2/3

If dist(f,P)>ε should reject w/ prob 2/3

6© Liu Yang 2013

Property Testing and Property Testing and Learning : MotivationLearning : Motivation

• What is Property Testing for ? - Quickly tell if the right fn class to use - Estimate complexity of fn without actually learning

• Want to do it with fewer queries than learning

7© Liu Yang 2013

Standard Model uses Standard Model uses Membership QueryMembership Query

• Results of Testing basic Boolean fns using MQ: • Constant QC for UINTd, dictator, ltf, …

However …

8© Liu Yang 2013

Membership Query is Membership Query is Unrealistic for ML Problems: Unrealistic for ML Problems:

An An Object Recognition Object Recognition exampleexample

Recognizing cat/dog ? MQ gives …

Is this a dog or a cat?

9© Liu Yang 2013

An example: movie reviewsAn example: movie reviews Is this a positive or negative

review ? Typical representation in ML (bag-of-words):

• {fell, holding, interest, movie, my, of, short, this}

The original review (human labelers see):

• “This movie fell short of holding my interest.”

© Liu Yang 2013 10

- Object a human expert labels has more structure than internal representation used by alg. - MQs construct ex.s in internal representation.- Can be very difficult to order constructed example’s words so a human can label the example (esp for long reviews)

Passive : Waste Too Many Passive : Waste Too Many Queries Queries

• ML people move on

• Can we SAVE #queries ?

• Passive Model (sample from D) query samples exist in ; but quite wasteful (many examples uninformative)

NATURE

11© Liu Yang 2013

Active Testing Active Testing

12© Liu Yang 2013

Alg can ask for labels but only pts in the poolGoal: small #queries

Pool of unlabeled data (poly-size)

Property TesterProperty Tester

• Definition. Definition. An s-sample, q-query ε-tester for P over the distribution D is a randomized algorithm A that draws s samples from D, sequentially queries for the value of f on q of those samples, and then

1. Accepts w.p. at least 2/3 when f P 2. Rejects w.p. at least 2/3 when

distD(f,P)>ε

cheap

expensive

13© Liu Yang 2013

• Definition. Definition. An s-sample, q-query ε-tester for P over the distribution D is a randomized algorithm A that draws s samples from D, sequentially queries for the value of f on q of those samples, and then

1. Accepts w.p. at least 2/3 when f P 2. Rejects w.p. at least 2/3 when

distD(f,P)>ε

cheap

expensive

14© Liu Yang 2013

Active tester: s = poly(n)Passive tester: s = qMQ tester: s = ∞ (D= Unif)

Active Property TestingActive Property Testing• Testing as preprocessing step of learning • Need an example? where Active testing - get same QC saving as MQ - better in QC than Passive - need fewer queries than Learning

• Union of d Intervals, active testing help! 0----++++----+++++++++-----++---+++--------1

- Testing tells how big d need to be close to target- #Label: Active Testing need O(1), Passive Testing need Θ(√d), Active Learning need Θ(d)

15© Liu Yang 2013

Our Results Our Results

MQ-like on testing UINTd Passive-like on testing Dictator

Active Testing Passive Testing Active Learning

Union of d Intervals O(1) Θ(d1/2) Θ(d)

Dictator Θ(log n) Θ(log n) Θ(log n)

Linear Threshold Fn O(n1/2) ~Θ(n1/2) Θ(n)

Cluster Assumption O(1) Ω(N1/2) Θ(N)

MQ-like Passive-like

NEW !!NEW !!

16© Liu Yang 2013

Testing Unions of IntervalsTesting Unions of Intervals00----++++----+++++++++-----++---+++-------- 1• Theorem. Testing UINTd in the active

testing model can be done using O(1) queries.

Recall: Learning requires Ω(d) examples.

17© Liu Yang 2013

Testing Unions of Intervals Testing Unions of Intervals (cont.)(cont.)

• Suppose uniform distribution• Definition: Fixδ>0. The localδ-noise

sensitivity of fn f: [0, 1]->{0, 1} at x [0; 1] is . The noise sensitivity of f is

• Proposition: Fixδ>0. Let f: [0, 1] -> {0,1} be a union of d intervals. NSδ(f) ≤ dδ.

easyeasy

hard

hard• Lemma: Fix δ= ε2/(32d). Let f : [0, 1] -> {0, 1} be a fn with noise sensitivity bounded by NSδ(f) ≤ dδ(1 + ε/4 ). Then f is ε-close to a union of d intervals.

18© Liu Yang 2013

Easy LemmaEasy Lemma• Lemma. If f is a union of ≤ d intervals,

NSδ(f) ≤ dδ.

Proof sketch:- The probability that x lands within distance

δ of any of the boundaries is at most 2d*2δ.

- The probability that y crosses a boundary given that x is within distance δ of it is 1/4.

- P(f(x)≠f(y)| |x-y|<δ) ≤ (2d*2δ)*(1/4) = dδ.

© Liu Yang 2013 19

Hard LemmaHard Lemma• Lemma. Fix δ = ε2/(32d). If f is ε-

far from a union of d intervals, then NSδ(f) > (1+ε/4)dδ.

Proof strategy:If NSδ(f) is small, do “self-correction”.

g(x) = E[f(y) | yÃ[x-δ,x+δ]], f’(x) = round g(x) to 0 if ≤¿ or to 1 if ≥

1-¿

© Liu Yang 2013 20

Hard LemmaHard Lemma• Lemma. Fix δ = ε2/(32d). If f is ε-

far from a union of d intervals, then NSδ(f) > (1+ε/4)dδ.

Proof strategy:- Argue dist(f,f’) ≤ε/2.- Show f’ is union of ≤ d(1 + ε/2)

intervals.- Implies dist(f’,P) ≤ ε/2.

© Liu Yang 2013 21

zz

----++++----+++++++++-----++---+++-------------++---+++--------

at δ Nr

δδ δδ

!!!

δ

!!!

δ

!!!

22© Liu Yang 2013

Uniform

Testing Unions of IntervalsTesting Unions of Intervals

© Liu Yang 2013 23

• Theorem. Testing UINTd in the active testing model can be done using O(1) queries.

• If non-uniform distribution, use data to stretch/squash the axis, makes the distribution near-uniform

• Total num unlabeled samples: O(d1/2).

Testing Linear Threshold Testing Linear Threshold FnsFns

24© Liu Yang 2013

• Linear Threshold Functions (LTF):

f(x) = sign(<w,x>), for w,x 2 Rn


• Theorem. We can efficiently test LTFs under the Gaussian distribution with Õ(n1/2) labeled examples in both active and passive testing models.

• We have lower bounds of ~Ω (n1/3) for active testing and ~Ω (n1/2) for passive testing.

• Learning LTFs need Ω(n) under Gaussian. So testing is better than learning in this case.

25© Liu Yang 2013


• [MORS’10] => suffices to estimate E[f(x) f(y) <x,y>] up to ± poly(ε).

• Intuition: LTF is characterized by a nice linear relation between angle (<x,y>) and probability of having same label (f(x)f(y)=1).

26© Liu Yang 2013


• [MORS’10] => suffices to estimate E[f(x) f(y) <x,y>] up to ± poly(ε).• Could take m random pairs and use

empirical average. - But most pairs x,y would have <x,y> ≈ n 1/2

(CLT) So would need m = Ω(n) to get within ± poly(ε).• Solution: take O(n1/2) random points

and average f(x)f(y)<x,y> over all O(n) pairs x,y.

- Concentration inequalities for U-statistics [Arcones,95] imply this works.

27© Liu Yang 2013

General Testing DimensionGeneral Testing Dimension

• Testing dim characterize (up to constant factors) the intrinsic #label requests needed to test the given property w.r.t. the given distribution

• All our lower bounds are proved via testing dim

28© Liu Yang 2013

Minimax ArgumentMinimax Argument

• minAlgmaxf P(Alg mistaken) = maxπ0

minAlg P(Alg mistaken)

• wolg, π0=α π + (1-α) π’, π Π0,π’ Πε

• Let πS, π’S be induced distributions on labels of S.

• For a given π0,

minAlgP(Alg makes mistake|S)≤ 1-dS(π, π’)

29© Liu Yang 2013

Passive Testing DimPassive Testing Dim

• Define dpassive largest q in N, s.t.

• Theorem: Sample Complexity of passive testing is Θ(dpassive).

30© Liu Yang 2013

Compare with VC-dimension:Want exists set S s.t. all labelings occur at least once.

Active Testing DimActive Testing Dim• Fair(π,π’,U): distri. of labeled (y; l): w.p.½ choose y~πU, l= 1; w.p.½ choose y~π’U, l= 0.

• err*(H; P): err of optimal fn in H w.r.t data drawn from distri. P over labeled egs.• Given u=poly(n) unlabeled egs, dactive(u):

largest q in N s.t. • Theorem: Active testing w/ failure prob 1/8 using u unlabeled egs needs Ω(dactive(u)) label queries; can be done w/ O(u) unlabeled egs and O(dactive(u)) label queries 31© Liu Yang 2013

Application: Dictator fnsApplication: Dictator fns

• Theorem: For dictator functions under the uniform distribution, dactive(u)=Θ(log n) (for any large-enough u=poly(n)).

• Corollary: Any class that contains dictator functions requires log(n) queries to test in the active model, including poly-size decision trees, functions of low Fourier degree, juntas, DNFs, etc.

32© Liu Yang 2013

Application: Dictator fnsApplication: Dictator fns

• Theorem: For dictator functions under the uniform distribution, dactive(u)=Θ(log n) (for any large-enough u=poly(n)).

• π = unif over dictator fns• π’ = unif over all Boolean fns

33© Liu Yang 2013

Application: LTFs Application: LTFs

• Theorem. For LTFs under the standard n-dim Gaussian distrib, dpassive = Ω((n/logn)1/2) and dactive(u) = Ω((n/logn)1/3) (for any u=poly(n)).

- π: distrib over LTFs obtained by choosing w~N(0, Inxn) and outputting f(x) = sgn(wx). - π’: uniform distrib over all functions.

34© Liu Yang 2013

- Obtain dpassive :bound tvd(distrib of Xw/√n, N(0, Iqxq)).- Obtain dactive: similar to dictator LB but rely on strong concentration bounds on spectrum of random matrices

Open ProblemOpen Problem

• Matching lb/ub for active testing LTF: √n?

• Tolerant Testing ε/2 vs. ε (UINTd, LTF)• Testing LTF under general distrib.

35© Liu Yang 2013

Outline

• Learnability of DNF with Representation Specific Queries

- Liu: We do statistical learning for …

- Marvin: but we haven't not done well at the fundamentals, e.g. knowledge

representation. © Liu Yang 2013 36

Learning DNF formulas

• Poly-sized DNF: # terms = nO(1)

e.g. f=(x1∧x2)∨(x1∧x4)

- Natural form of knowledge representation- PAC-learning DNF appears to be very hard.

37© Liu Yang 2013

Your ticket : n: number of var.s Concept space C: collection of fn h: {0, 1}^n -> {0,1} Unknown target fn f*: the true labeling fn Err(h) = Px~D[h(x) ~= f*(x)] (Distri. D over X)

Best known alg in standard model is exponential over arbitrary distri; Over Unif, no known poly time alg

New Models: Interaction with Oracles

38© Liu Yang 2013

- Boolean queries: K(x, y) = 1 if share some term- Numerical queries: K(x, y) = #terms share

Hi, Tim, do x and y have some term in common ?

Yes!

Imagine …

Query: Similarity about TYPE

39© Liu Yang 2013 Fraud Detection

Type of Query: pair of POSITIVE ex.s from a random dataset, teacher says YES if they share some term; or report how many terms they share.Question: can we efficiently learn DNF with this type of query?

Identity theft

Stolen cards

Stolen cards BIN attack

x

y

What if have similarity info about TYPE ?Fraud detection: fraudulent of same type ? YES! x and y

share a termSkimming

Term 1 of x Term 2 of x Term 3 of x

Warm Up: Disjoint DNF w/Boolean Queries

• Use similarity queries to partition positive ex.s into t buckets, one per term.

• Separately learn a conjunction for each bucket (intersect the pos ex.s in it)

• OR the results

40© Liu Yang 2013

Pos Result 1: Weakly Disjoint DNF w/Boolean Queries

- Distinguishing ex for T1: ex. sat. T1 & no other term

- Weakly disjoint: for each term, poly(n, 1/ε) fraction rand. ex.s sat. it & no other term.

- Neighbor-method: get all its neighbors in the graph and learn a conjunc.

- Neighbor-method w.p. 1-δ, produce an ε-accu. DNF if weakly disjoint.

41© Liu Yang 2013

T1Graph:- Nodes: pos examples- Edge exists if K(.) = 1

Hardness ResultsBoolean Queries

Thm. Learning DNF from random data under arb. distri. w/ Boolean queries is as hard as learning DNF from random data under arb. distri. w/ only labels (no queries).

42© Liu Yang 2013

m

K (giant 1, giant 2) = 1

- Group-learn: tell data from D+ or D-- Reduction from group-learn DNF in std. model to our model - How to use our alg A to group-learn ? - Simulate the oracle by always saying yes whenever there is a query made to two pos ex.s; Given the output of A, we give a group-learn alg for the original problem

n var.sn var.sn var.sn var.sn var.sn var.sn var.sn var.sn var.sn var.s

Hardness ResultsApprox Numerical Queries

Thm. Learning DNF from random data under arbitrary distri. w/ approx-numerical-queries is as hard as learning DNF from random data under arb. distri. w/ only labels i.e. if C is #terms xi and xj sat in common, oracle returns a value in [(1 – τ)C, (1 + τ)C].

© Liu Yang 2013 43

Pos Result 3: learn DNF w/ Numerical Queries

- Sample m = O((t/ε) log(t/(εδ))) landmark points

- Landmark Fi(x) is sum-of-monotone-terms fn (rm terms not sat by pos xi). Fi(·) = K(xi, ·), K is numerical query

- Use subroutine to learn hypo. hi(x) ε/(2m)-accu w.r.t. Fi.

• Subroutine: learn a sum of monotone t terms over unif., using time & samples poly(t, n, 1/ε).

- Combine all hypo.s hi to h: h(x) = 0 if hi(x) = 0 for all i, else h(x) = 1.

© Liu Yang 2013 44

Thm. Under unif distri., w/ numerical queries, can learn any poly(n)-term DNF.

f(x) = T1(x)+T2(x)+ … +Tt(x)

Learn Sum of Monotone Terms

Estimate Fourier coeffi. of S Inclusion check: mag. ≥ ε/(16t)?

otw

x1 | x2 |x3 |x4 |x5 |x6 |x7 |x8 |x9

S= {x1}

Outputx1∧x3∧x4∧x9

S= {x1, x2} S= {x1, x3}S= {x1, x3 x4 }

YES

S= {x1, x3 x4 ,x5} S= {x1, x3 x4 ,x6} S= {x1, x3 x4 ,x7} S= {x1, x3 x4 ,x8} S= {x1, x3 x4 ,x9}

- Greedy:

- Inclusion Check:

- Fourier coeffi. of S:

© Liu Yang 2013 45

Learn Sum of Monotone Terms : Greedy Alg

• Examine each parity fn of size 1 & est its Fourier coeffi. (up to θ/4 accu.). Set θ =ε/(8t)

• Place all coeffi. of mag. ≥ θ/2 into a list L1.• For j = 2, 3, ... repeat: - For each parity fn ΦS in list Lj-1 and each xi not in S,

est Fourier coeffi. of - If est. is ≥ θ/2, add it to list Lj (if not already in) - maintain list Lj: size-j parity fns w/ coeffi. mag. ≥ θ.• Construct fn g: weight sum of parities for identified

coeff. • Output fn h(x) = [g(x)]

© Liu Yang 2013 46

Inclusion check

Other Positive Results Binary Numeric

O(log(n)) terms DNF (any distrib.)

✔

2-term DNF (any distrib.)✔ ✔

DNF: each var in at most O(log(n)) terms (Unif)

✔ ✔

log(n)-Junta (Unif)✔ ✔

log(n)-Junta (any distrib)✔

DNF having ≤ 2O(√log(n)) terms (Unif.)

✔ ✔

Open problems:- learn arbitrary DNF (unif, Boolean queries)?- learn arbitrary DNF (any distri. numerical queries)?

Outline

• Active Learning with a Drifting Active Learning with a Drifting DistributionDistribution

- If not every poem has a proof, can we at least try to make every theorem proved beautiful like a poem?

© Liu Yang 2013 48

Active Learning with Active Learning with a Drifting Distrib: Modela Drifting Distrib: Model

• Scenario: - Unobservable seq. of distrib.s with each - Unobservable time-indep. regular cond. distrib. represent by fn

- : an infinite seq. of indep. r. v., s.t., and cond. distrib. Of Yt given Xt satisfies

• Active learning protocol At each time t, alg is presented with Xt, and is required to

predict a label , then it may optionally request to see true label value Yt

• Interested in cumulative #mistakes up to time T and total #labels requested up to time T

© Liu Yang 2013 49

x1

x2

xtx3

Data Space

D2

D1D3

Dt

x4

D4

© Liu Yang 2013 50

Distrib.Space

Definition and NotationsDefinition and Notations

• Instance space X = Rn • Distribution space of distributions on X• Concept space C of classifiers h: X -> {-1,1} - Assume C has VC dimension vc < ∞

• Dt: Data distrib. on X at t

• Unknown target fn h*: true labeling fn

• Errt (h) = Px~Dt [h(x) ≠ h*(x)]

• In realizable case, h* in C and errt(h*) = 0.

• For ,

© Liu Yang 2013 51

Def: disagreement coefficient, tvd

• The disagreement coefficient of h* under a distri. P on X, is define as, (r > 0)

• Total variation distance of probability measures P and Q on a sigma-algebra of subsets of the sample space is defined via

© Liu Yang 2013 52

Assumptions

• Independence of the Xt variables

• Vc-dim < ∞• Assumption 1 (totally bounded) : is totally bounded (i.e. satisfies ) - For each ε > 0, denote a minimal subset of s.t.

s.t. (i.e. a minimal ε-cover of )

• Assumption 2 (poly-covers)

where c,m ≥ 0 are constants.

Realizable-case Active Learning CAL

© Liu Yang 2013 54

Sublinear Result: Realizable Sublinear Result: Realizable CaseCase

© Liu Yang 2013 55

Theorem. If is totally bounded, then CAL,achieves an expected mistake boundAnd if , then CAL makes an E[#queries]

[Proof Sketch]:Partition D into buckets of diam < eps. Pick a time T_eps past all indices from finite buckets and all the infinite bucket has at least

Number of MistakesNumber of Mistakes

• Alternative scenario: - Let Pi be in bucket i

- Swap the L(ε) samples for bucket i with L(ε) samples from Pi

- L(ε) large enough so E[diam(V)]alternative < sqrt{eps}.

- Note: E[diam(V)] ≤ E[diam(V)]alternative + sumL(ε) t values||P_i – D_t|| < √ε + L(ε)*ε.

So E[diam] -> 0 as T -> ∞ - E[#mistake] - Since

Number of QueriesNumber of Queries

•E[#queries]

•P(make query) = E[P(DIS(Vt-1))]

•Let then and E[#queries] • =>

Explicit Bound: Realizable CaseExplicit Bound: Realizable Case

© Liu Yang 2013 58

Theorem. If poly-covers assumption is satisfied ( )then CAL achieves an expected mistake bound and E[#queries] such that

where [Proof Sketch]Fix any ε >0, and enumerate For t in N, let K(t) be the index k of the closest to Dt. Alternative data sequence: Let be indep., with This way all samples corresp. to distrib.s in a given bucket all came from same distri. Let V’t be the corresponding version spaces.

E[#mistakes]

Classic PAC bound =>

(#previous distrib.s in Dt's bucket)So

(each bucket has at most T samples)So E[#mistakes] Take to get the stated theorem.

To bound E[#queries], again it is

just showed this is So

Again, taking gives the stated result. © Liu Yang 2013 59

Learning with NoiseLearning with NoiseNoise conditionsNoise conditions

•Strictly benign noise condition:

•Special case: Tsybakov's noise conditions •η satisfies strictly benign noise condition and for some c > 0 and α≥0,

•Unif Tsybakov assumption: Tsybakov Assumption is satisfied for all with the same c and α values.

© Liu Yang 2013 60

and

Agnostic CAL [DHM]Agnostic CAL [DHM]

© Liu Yang 2013 61

Based on subroutine:

Tsybakov Tsybakov Noise: Sublinear Noise: Sublinear Results & Explicit BoundResults & Explicit Bound

Theorem. If is totally bounded and η satisfies strictly benign noise condition, then ACAL achieves an excess expected mistake bound and if additionally , then ACAL makes an expected number of queries

© Liu Yang 2013 62

Theorem. If poly-covers Assumption and Unif Tsybakov assumption are satisfied, then ACAL achieves an expected excess number of mistakesACAL achieves expected #mistakes and expected #queries such that, for

Outline

• Transfer LearningTransfer Learning

- Do not ask what Bayesians can do for - Do not ask what Bayesians can do for Machine Learning, ask what Machine Machine Learning, ask what Machine Learning can do for BayesiansLearning can do for Bayesians

Transfer Learning• Principle: solving a new learning problem is easier

given that we’ve solved several already ! • How does it help? - New task directly ``related’’ to previous task [e.g., Ben-David & Schuller 03; Evgeniou, Micchelli, & Pontil 2005] - Previous tasks give us useful sub-concepts [e.g., Thrun 96]

- Can gather statistical info on the variety of concepts [e.g., Baxter 97; Ando & Zhang 04]

• Example: Speech Recognition - After training a few times, figured out the dialects. - Next time, just identify the dialect. - Much easier than training a recognizer from scratch

prior

h1*

x11,y1

1 … x1k,y1

k

Task 1

hT*

xT1,yT

1 … xTk,yT

k

…

Task T

Model of Transfer Learning Motivation: Learners often Not Too Altruistic

h2*

x21,y2

1 … x2k,y2

k

Task 2

Layer 1: draw task i.i.d. from unknown prior

Layer 2: per task, draw data i.i.d. from target

Better Estimate of Prior !!

- Marvin: so you assume learning French is - Marvin: so you assume learning French is similar to learning English?similar to learning English?

- Liu: It indeed seems many English words - Liu: It indeed seems many English words have a French counterpart …have a French counterpart …

Identifiability of priors from joint distribs

• Let prior π be any distribution on C - example: (w, b) ~ multivariate normal

• Target h*π ~ π

• Data X = (X1, X2, …) i.i.d. D indep h*π

• Z(π) = ((X1, h*π (X1), (X2, h*π (X2), …).

• Let [m] = {1, …, m}.

• Denote XI = {Xi}i € I (I : subset of natural numbers)

• ZI (π) = {(Xi, h*π (Xi))}i € I Theorem: Z[VC] (π1) =d Z[VC] (π2) iff π1 = π2.

Identifiability of priors by VC-dim joint distri.

• Threshold:

- for two points x1, x2, if x1 < x2, then

Pr(+,+)=Pr(+.), Pr(-,-)=Pr(.-), Pr(+,-)=0, So Pr(-,+)=Pr(.+)-Pr(++) = Pr(.+)-Pr(+.) - for any k > 1 points, can directly to reduce number of labels in the joint prob from k to 1 P(-----------(-+)+++++++++++++++++)

= P( (-+) ) = P( (.+) ) - P( (++) ) = P( (.+) ) - P( (+.) ) + P( (+-) ) (unrealized labeling !!) = P( (.+) ) - P( (+.) )

---------------------0 1

++++++++++++++++

• Theorem: Z[VC] (π1) =d Z[VC] (π2) iff π1 = π2.

Proof Sketch• Let ρm(h,g) = 1/m Σi=1

m II(h(Xm) ≠ g(Xm)) Then vc < ∞ implies w.p.1 forall h, g € C with h ≠ g limm -> ∞ ρm(h,g) = ρ(h,g) > 0• ρ is a metric on C by assumption, so w.p.1 each h in C labels ∞-seq (X1, X2 …) distinctly

(h(X1), h(X2), …)• => w.p.1 conditional distribution of the label seq Z(π)|

X identifies π => distrib of Z(π) identifies π i.e. Z∞ (π1) =d Z∞ (π2) implies π1 = π2

Identifiability of Priors from Joint Distributions

lower–dim cond distrib

y’ closer to ỹ



Transfer Learning Setting• Collection Π of distribs on C. (known)• Target distrib π* € Π. (unknown)

• Indep target fns h1*, …, hT* ~ π* (unknown)

• Indep i.i.d. D data sets X(t) = (X1(t), X2

(t), …), t €[T].

• Define Z(t) = ((X1(t), ht*(X1

(t))), (X2(t), ht*(X2

(t))), …).

• Learning alg. “gets” Z(1), then produces ĥ1, then “gets” Z(2), then produces ĥ2, etc. in sequence.

• Interested in: values of ρ(ĥt, h*(t)), and the

number of h*t (Xj(t)) value alg. needs to access.

Estimating the prior• Principle: learning would be easier if know π*• Fact: π* is identifiable by distrib of Z[VC]

(t)

• Strategy: Take samples Z[VC](i) from past tasks 1,

…, t-1, use them to estimate distrib of Z[VC](i),

convert that into an estimate π’t-1

of π*,

• Use π’t-1

in a prior-dependent learning alg for

new task ht*• Assume Π is totally bounded in total variation• Can estimate π* at a bounded rate:

|| π* - π’t||< δt converges to 0 (holds whp)

Transfer Learning• Given a prior-dependent learning A(ε, π), with E[# labels accessed] =Λ(ε, π) and producing ĥ with E[ρ(ĥ, h*)]≤εFor t = 1,…, T If δt-1 > ε/4,

run prior-indep learning on Z[VC/ε](t) to get ĥt

Else let π’’t = argminπ € B(π’t-1, δt-1) Λ(ε/2, π) and

run A(ε/2, π’’t) on Z(t) to get ĥt

Theorem: Forall t, E[ρ(ĥt, ht*)] ≤ ε, and

limsupT -> ∞E[#labels accessed]/T ≤Λ(ε/2, π*) + vc.

- Yonatan: I’ll send you an email to - Yonatan: I’ll send you an email to summarize what we just discussed.summarize what we just discussed.

- Liu: Thank you but I now invented a model - Liu: Thank you but I now invented a model to transfer knowledge with provable to transfer knowledge with provable guarantees; guarantees;

so I use that all the time.so I use that all the time.

- Yonatan: But that’s asymptotic guarantee. - Yonatan: But that’s asymptotic guarantee. My life span is finite. So I’m still gonna to My life span is finite. So I’m still gonna to send you an email. send you an email.

Outline

• Online Allocation and Pricing with Online Allocation and Pricing with Economies of ScaleEconomies of Scale

© Liu Yang 2013 77

- Jamie Dimon: Economies of scale are a good thing. - Jamie Dimon: Economies of scale are a good thing. If we didn't have them, we'd still be living in tents If we didn't have them, we'd still be living in tents and eating buffalo.and eating buffalo.

SettingSetting

• Christmas season

- Nov: customer survey

- Dec: purchasing and selling

• Buyers arrive online one at a time w/ val.s on items sampled iid from some unknown distri.

Thrifty Santa Claus Thrifty Santa Claus

• Each shopper wants only one item though it might prefer some items than others

• Minimize total cost to seller

• Buyers: binary valuation• Goal of seller: sat. everyone

Hardness: Set-CoverHardness: Set-Cover

• If costs much more rapidly, then even if all customers' val.s known up front, would be (roughly) a set-cover problem and could not hope to achieve cost o(log n) times optimal.

• Natural case: for each good, cost (to the seller) for ordering T copies is sublinear in T. Production

costMarginal cost

#copies#copies

α = 1

α = 0

α in (0, 1)

α = 1

α = 0 α in (0, 1)

Thrifty Santa Claus : ResultsThrifty Santa Claus : Results

• Mar-cost non-increa, exists optimal strategy? - order items by some perm.; give new buyer earliest item it desires in the perm.

• What if n (#buyers) >> k (#items) AND mar-cost not too rapidly? (rate 1/Tα for 0≤α<1)

- can efficiently perform allocation w/ cost ≤ a const. factor greater than OPT

Algorithm

• Alg: use initial buyers to learn about distri. determine how best to allocate to new buyers.

• If cost fn c(x) = Σi=1 x 1/iα, for α in [0,1)

- run greedy weighted set cover => total cost ≤ 1/(1-α) {± OPT}.

• Essentially smooth variant of set-cover• If ave-cost within some factor of mar-cost,

have a greedy alg w/ const. approx ratio

Sample Complexity Analysis

• How complicated the allocation rule needs to be to achieve good perf.?

Theorem

Outline

• Factor Models for Correlated Factor Models for Correlated Auctions Auctions

© Liu Yang 2013 84

The ProblemThe Problem

• Auctioneer sells good to a group of n buyers.

• Seller wants to maximize his revenue. • Each buyer maximize his utility of getting

good: val. - price• Seller doesn’t know exact val.s of players • He knows distri D from which vec. of val.s

(v1, …, vn) is drawn.

Our ContributionOur Contribution

• When D is a product distri., - Myerson gives dominant strategy

truthful auction• General correlated distr.s, not known - how to create truthful auctions - how to use player j’s bid to capture

info about player i. • What if correlation between buyer val.s

driven by common factors?

ExampleExample• Two firms produce same type of good• Each firm’s “value”: production cost• need to hire workers (W) & rent capital (Z)

• li: #workers firm i needs to produce one unit

• Ki: amount of capital firm needs

• εi:fixed costs unique to firm i.

• firm’s costs: Ci = liW + kiZ + εi

• firms’ costs correlated : hire workers & rent capital from the same pool.

The Factor ModelThe Factor Model

• Factor model as V = F + U where

- V: vec. of observations - λ: matrix of coefficients - F : vec. of factors - U: vec. of idiosyncratic components ind. of each

other & ind. of the factors

DiscussionsDiscussions

• Possible that: - Designer & bidders might not know common

factors - Bidders might only know their val. - seller only knows joint distri. of bidders’ val.s,

• Seller RECOVER factor model by making inferences over observed bids.

• Aggregate info.: common factors inferred from collective knowledge of all players.

The AuctionThe Auction

The Auction (cont.)The Auction (cont.)

• Thm: When correlation follows this factor model, this auction is dominant strategy truthful, ex-post individually rational, and asymptotically optimal.

Dominant Strategy Dominant Strategy TruthfulnessTruthfulness

• Toss a coin & choose between: - 2nd price auction: truthful - mechanism M estimates factors from a

random set of bidders S: bidders in S receive utility 0 regardless of

allocation & price output by M • Players S incentivized truthful for small

incentive they get from participating in 2nd price auction.

Dominant Strategy Dominant Strategy Truthfulness (Cont.)Truthfulness (Cont.)

• Remaining bidders set R = {1, …, n} - S receive incentives from both 2nd price auction and mechanism M.

• M offers them allocation and price vec.s x(bR), p(bR) by running Myerson (bR,VR |^f) on players' bids, and on cond. distri.s estimated for these players.

• No player in R can influence the estimated conditional distri. VR|^f, and Myerson's optimal auction is truthful.

Thanks !

94© Liu Yang 2013

Hofstadter's Law: It always takes longer than you expect, even when you take into account Hofstadter's Law.

Documents

Mathematical Theories of Interaction with Oracles Liu Yang Carnegie Mellon University 1© Liu Yang 2013