37
Learning intersections and thresholds of halfspaces Adam Klivans (MIT/Harvard) Ryan O’Donnell (MIT) Rocco Servedio (Harvard)

Learning intersections and thresholds of halfspaces

  • Upload
    boone

  • View
    46

  • Download
    1

Embed Size (px)

DESCRIPTION

Learning intersections and thresholds of halfspaces. Adam Klivans (MIT/Harvard) Ryan O’Donnell (MIT) Rocco Servedio (Harvard). Learning. We consider the PAC model of [Valiant-84], in which learning a “concept class” C of boolean functions means: - PowerPoint PPT Presentation

Citation preview

Page 1: Learning intersections and thresholds of halfspaces

Learning intersections and thresholds of halfspaces

Adam Klivans (MIT/Harvard)Ryan O’Donnell (MIT)

Rocco Servedio (Harvard)

Page 2: Learning intersections and thresholds of halfspaces

LearningWe consider the PAC model of [Valiant-84],

in which learning a “concept class” C of boolean functions means:

- a function f in C is selected, and also a probability distribution D over {+1,−1}n

- the learning algorithm gets access to random examples <x, f(x)>, where the x’s are drawn from D

- goal: efficiently output a hypothesis h such that w.h.p., Prx←D[ f(x) ≠ h(x)] < ε.

Page 3: Learning intersections and thresholds of halfspaces

Learning exampleExample: C is the class of all conjunctions of

variables.Perhaps the concept selected is:

x1 AND x2 AND x4.

One might see examples:< (+ + − + − +), + >< (− + − + − −), − >< (+ + + − − +), − >

What is a learning algorithm for this class?

Page 4: Learning intersections and thresholds of halfspaces

HalfspacesLet h be a hyperplane in Rn:

h = {x : ∑wi xi = θ}.

h naturally induces a boolean function:f : {+1,−1}n → {+1,−1},

f (x) = sgn(∑wi xi − θ).

We call such a function a boolean halfspace, or a weighted majority. The majority function itself is an example (wi ≡ 1, θ = 0).

i=1

n

Page 5: Learning intersections and thresholds of halfspaces

Learning halfspacesLearning halfspaces is a very old problem;

dates back to models for the brain from the ’50s: [Agmon-54, Rosenblatt-58, Block-62].

The concept class of halfspaces has long been known to be PAC learnable in polynomial time via Linear Programming [BEHW-89].

Indeed, this works over any distribution on Rn, including those singling out {+1,−1}n.

Page 6: Learning intersections and thresholds of halfspaces

Learning halfspacesBasic idea: given a bunch of examples, find a

halfspace which classifies them correctly. By some learning theory technology

(“Occam’s Razor”), this is a good algorithm.Consider the coefficients of a hypothesis

halfspace to be unknowns, a1, …, an, θ.

Each example induces some linear constraints: e.g., < (+ + − + − −), + > induces a1+a2−a3+a4−a5−a6 > θ. Solve LP.

Page 7: Learning intersections and thresholds of halfspaces

Learning intersections of halfspacesThe next logical extension of this, and a very

important one, is learning intersections of halfspaces.

Intersections of halfspaces form a very rich concept class: all convex bodies, CNF formulas…

Learning them is also an important problem for computer vision, study of perceptrons.

But very little is known.

Page 8: Learning intersections and thresholds of halfspaces

Prior work- [Baum91]: poly time algorithm for intersection of

two halfspaces through the origin under symmetric distributions (those satisfying D(x) = D(−x)).

- [BlumKannan,Vempala97] learn an intersection of O(1) halfspaces in poly time over near-uniform distributions on the Euclidean sphere:

- not relevant for boolean halfspaces- [KwekPitt98] gave a polynomial time alg., but

requires membership queries- also not relevant for boolean halfspaces

Page 9: Learning intersections and thresholds of halfspaces

Our resultsTheorem 1: The concept class of

arbitrary functions of k boolean halfspaces over {+1,−1}n

is learnable under the uniform distribution to accuracy 1−ε in time:

nO(k²/ε²).

This is polynomial time if k = O(1), ε = Ω(1).(Prior to this, no algorithm could learn even an intersection of 2

arbitrary boolean halfspaces under the uniform distribution in subexponential time.)

Page 10: Learning intersections and thresholds of halfspaces

Our resultsTheorem 2: The concept class of

intersections of k boolean halfspaces with weight bound W

is learnable under any probability distribution to accuracy 1−ε in time:

nO(k log k log W)/ε.

So if the weights are polynomially bounded, one can learn an intersection of log many halfspaces in quasipolynomial time.

Page 11: Learning intersections and thresholds of halfspaces

More resultsFunction Halfspaces Distrib. Time

any fcn. of k weight W any nO(k² log k logW)/εweight k threshold (e.g., inters. of k)

weight W any nO(k log k logW)/ε

intersection of k weight W any nO(√W log k)/εread-once

intersection of karbitrary uniform nO((log(k)/ε)²)

read-once majority of k

arbitrary uniform nÕ((log(k)/ε) )4

Page 12: Learning intersections and thresholds of halfspaces

Sketch of techniquesFor arbitrary distribution results:

show that functions of low weight halfspaces have low degree polynomial threshold representations.

For uniform distribution results: show that functions of halfspaces have low noise sensitivity.

Both conclusions imply learning results generically.

Page 13: Learning intersections and thresholds of halfspaces

Talk outlinePlan for the rest of the talk:1. Prove nO(k log k log W) bound for learning

intersections of k weight-W halfspaces under arbitrary distributions.

(Sketch other arbit. dist. results.)2. Prove nO(k²/ε²) bound for learning arbitrary

functions of k halfspaces under the uniform distribution.

(Sketch other unif. dist. results.)

Page 14: Learning intersections and thresholds of halfspaces

Polynomial threshold functionsA (multilinear) polynomial p : Rn→ R is a

PTF for f if it sign-represents f :f(x) = sgn(p(x)) for all x {+1,−1}n.

- every boolean halfspace is a degree 1 PTF for itself

- every boolean function has a degree n PTF

By linear programming [KS01]: if every function in a class C has a PTF of degree d, then C is learnable in time nO(d)/ε.

Page 15: Learning intersections and thresholds of halfspaces

PTFs for intersections of halfspacesSuppose f and g are hyperplanes,

f(x) = ∑wi xi−θ, g(x) = ∑wi' xi−θ' .

We would like a PTF for sgn(f) sgn(g).Failed attempt 1:

- try f(x)g(x): is >0 if f(x)>0 and g(x)>0 is >0 if f(x)<0 and g(x)<0

Failed attempt 2:- try f(x)+g(x): is >0 if f(x)>0 and g(x)>0 is <0 if f(x)<0 and g(x)<0 is ?? if f(x)>0 and g(x)<0

Page 16: Learning intersections and thresholds of halfspaces

PTFs for intersections of halfspacesThe solution: apply a (polynomial?) function

to f and g to make them look more like their sign.

Assume ∑|wi| < W. Then for all x {+1,−1}n,

f(x), g(x) [-W,-1] ∪ [1,W].Beigel et al. [BRS95] showed how to

construct a univariate rational function which is an essentially optimal approximator of the sgn function on [-W,-1] ∪ [1,W].

Page 17: Learning intersections and thresholds of halfspaces

BRS’s sgn-approximator

p(x)=(x-1)(x-2)2(x-4)2(x-8)2(x-16)2(x-32)2

q(x) =

Q is a rational function of degree O(log k log W) such that:Q(x) [1, 1+1/k] for x [1,W],Q(x) [-1-1/k, -1] for x [-W,-1].

p(-x)-p(x)p(-x)+p(x)

Page 18: Learning intersections and thresholds of halfspaces

PTFs for intersections of halfspacesNow given weight W halfspaces h1, …, hk,

sgn(Q(h1(x)) + … + Q(hk(x)) − (k−½))is a rational function which sign-represents

the intersection. Once taken to a common denominator, it has degree O(k log k log W).

Easy to get a polynomial: sgn(p/q)=sgn(pq).So we have a PTF for the intersection of k

weight-W halfspaces of degree O(k log k log W). Hence a learning algorithm running in time nO(k log k log W).

Page 19: Learning intersections and thresholds of halfspaces

Talk outlinePlan for the talk:

1. Prove nO(k log k log W) bound for learning intersections of k weight-W halfspaces under arbitrary distributions.

2. Prove nO(k²/ε²) bound for learning functions of k halfspaces under the uniform distribution.

Page 20: Learning intersections and thresholds of halfspaces

Noise sensitivity

Let f : {+1,−1}n → {+1,−1} be a boolean function. Pick x {+1,−1}n uniformly at random, and let y be an ε-corruption of x: flip each bit of x independently with probability ε.

defn: The noise sensitivity of f is:NSε(f) = Pr[f(x) ≠ f(y)].

Page 21: Learning intersections and thresholds of halfspaces

Noise sensitivity examples• Let f be a projection to one bit,

f(x1, …, xn) = x1.

Then NSε(f) = ε.

• Suppose f depends on only k bits.Then NSε(f) ≤ k ε.

• PARITY is the most noise-sensitive function:

NSε(PARITYn) = ½ − ½(1−2ε)n.

Page 22: Learning intersections and thresholds of halfspaces

Noise sensitivity – study and apps.• [Benjamini-Kalai-Schramm-98] – percolation,

low-level circuit complexity• [Kahn-Kalai-Linial-88] – random walks on the

hypercube• [Håstad-97] – probabilistically checkable proofs• [Bshouty-Jackson-Tamon-99] – learning theory

under noise• [O-02] – Yao’s XOR Lemma, average case

hardness of NP• [Bourgain-02, Kindler-Safra-02, FKRSS-02] –

study of juntas, Fourier analysis of boolean fcns.

Page 23: Learning intersections and thresholds of halfspaces

Low noise sens. fast learningWe show that if the noise sensitivity of all f in C is uniformly bounded:

NSε(f) ≤ α(ε),

then C is learnable under the uniform distribution in time:

nO(1)/α (ε/3).

Intuition: if f is not too noise sensitive, nearby points are highly correlated, so a net of examples works.

−1

Page 24: Learning intersections and thresholds of halfspaces

Proof of NS-learning connectionActually, the intuition is wrong. Here is the

proper proof sketch:Low noise sensitivity Fourier spectrum

concentrated at low levels; this uses the formula: NSε(f ) = ½−½ Σ(1−2ε)|S| f(S)2 and a Markovish inequality.

Low level Fourier concentration efficient uniform distribution learning; this is by the “Low degree” Fourier sampling learning algorithm of [Linial-Mansour-Nisan-93].

ˆ

Page 25: Learning intersections and thresholds of halfspaces

Noise sensitivity of halfspaces

Function NSεproof

one boolean halfspace

O(√ε) Y. Peres, ’98

any function of k halfspaces

O(k√ε) union bound

read-once intersection of k halfspaces

O(√ε log k) difficult probabilistic

analysisread-once majority of k halfspaces

Õ((ε log k)¼)

Page 26: Learning intersections and thresholds of halfspaces

ConsequencesLet C be the class of functions of k boolean

halfspaces. Take α(ε) = O(k√ε), so all f C have NSε(f) ≤ α(ε).

α−1(ε/3) = O(ε2/k2).

Hence we get Theorem 1:a uniform distribution

learning algorithm running in time nO(k²/ε²).

Page 27: Learning intersections and thresholds of halfspaces

Noise sensitivity of a halfspaceWe now sketch Peres’s beautiful proof that

the noise sensitivity of a single halfspace is O(√ε).

Suppose the halfspace is f = sgn(∑wi xi−θ). Without (much) loss of generality, one can assume θ = 0. Recall that xi’s are selected randomly from {+1,−1} and the sum is formed; then each xi is flipped indep. with prob. ε. We want to show that the prob. the sums land on opposite sides of 0 – call this a “flop”, prob. P – is O(√ε).

Page 28: Learning intersections and thresholds of halfspaces

Noise sensitivity of a halfspaceWith high probability, the number of flipped

bits is about k := εn. Let’s assume we always flip exactly k random bits, and that k divides n. (Both assumptions are easily removed.)

We now model the problem thus: Pick signs xi at random. Randomly permute the weights. Divide the weights into n/k blocks of size k. Form the n/k block sums, X1= ∑wi xi, X2= ∑wi xi, etc.

i=1…k i=k+1…2k

Page 29: Learning intersections and thresholds of halfspaces

Noise sensitivity of a halfspace

Write S = X1 + … + Xn/k for the initial sum. Because of the permutation, we may assume that the random signs in the first block are the “flips”. Put S' = S − X1, so the sum before flipping is S'+X1, and the sum after flipping is S'−X1. We are trying to bound the probability P that these two sums have opposite signs (a flop). Note that this happens iff |S'| < |X1|.

Page 30: Learning intersections and thresholds of halfspaces

Noise sensitivity of a halfspacesgn(X1) and S' are independent, so:

Pr[sgn(X1) ≠ sgn(S')] = ½.

sgn(X1) and |X1| are independent, so:

Pr[sgn(X1) ≠ sgn(S') | |S'| > |X1|] = ½Pr[sgn(X1) ≠ sgn(S) | |S'| > |X1|] = ½Pr[sgn(X1) ≠ sgn(S) & no flop] = ½(1−P)Pr[sgn(X1) ≠ sgn(S)] = ½(1−P)P = 2 E[½ – I[sgn(X1) ≠ sgn(S)]].

Page 31: Learning intersections and thresholds of halfspaces

Noise sensitivity of a halfspaceOf course, there was nothing special about

block X1 as opposed to any other block. So in fact,

P = 2 E[½ – I[sgn(Xi) ≠ sgn(S)]].

for all i = 1…n/k. Write τ=sgn(S), σi=sgn(Xi), and average:

P = 2 E[½ – (k/n) ∑i I[τ ≠ σi]].

Page 32: Learning intersections and thresholds of halfspaces

Noise sensitivity of a halfspaceP = 2 E[½ – (k/n) ∑i I[τ ≠ σi]]

The quantity inside the expectation is some random variable, a number which is either ½ – (k/n) ∑i I[1 ≠ σi] or ½ – (k/n) ∑i I[−1 ≠ σi].

If I tell you a number is either a or b, then assuredly it’s at most |a| + |b|. Applying this to the expectation, pointwise:

P ≤ 2 E[|½ – (k/n) ∑i I[σi=1]| + |½ – (k/n) ∑i I[σi=−1]|].

Page 33: Learning intersections and thresholds of halfspaces

Noise sensitivity of a halfspaceP ≤ 2 E[ |½ – ε ∑ I[σi=1]| + |½ – ε ∑ I[σi=−1]| ]

But the σi’s are simply independent, uniformly random signs. Hence both quantities in the expectation are merely the expected absolute deviation from the mean in 1/ε samples of an unbiased 0/1 random variable – i.e., O(√ε).

i=1…1/ε i=1…1/ε

Page 34: Learning intersections and thresholds of halfspaces

ExtensionsThis concludes the proof that a single

halfspace has noise sensitivity O(√ε), from which the uniform distribution learning algorithm for functions of k halfspaces follows.

To get the extended learning algorithms, must work harder at analyzing noise sensitivity. Key result: if a halfspace h is biased – say, the probability of + is p < ½, then: NSε(h ) ≤ min{2p, C p (ε log(1/p))½}.

Page 35: Learning intersections and thresholds of halfspaces

Talk outlinePlan for the talk:

1. Prove nO(k log k log W) bound for learning intersections of k weight-W halfspaces under arbitrary distributions.

2. Prove nO(k²/ε²) bound for learning functions of k halfspaces under the uniform distribution.

Page 36: Learning intersections and thresholds of halfspaces

Open technical challenges• Give an upper bound on the degree

necessary for a PTF which represents the AND of two arbitrary halfspaces.(For a new lower bound, see my talk tomorrow!)

• Give a better analysis of the noise sensitivity of the intersection of k halfspaces on n bits. Is it O((ε log k)½)?

Page 37: Learning intersections and thresholds of halfspaces

The huge open problem

It still remains open how to learn an intersection of two arbitrary boolean halfspaces under an arbitrary distribution in subexponential time!