Learning intersections and thresholds of halfspaces

Learning intersections and thresholds of halfspaces

Adam Klivans (MIT/Harvard)Ryan O’Donnell (MIT)

Rocco Servedio (Harvard)

LearningWe consider the PAC model of [Valiant-84],

in which learning a “concept class” C of boolean functions means:

- a function f in C is selected, and also a probability distribution D over {+1,−1}n

- the learning algorithm gets access to random examples <x, f(x)>, where the x’s are drawn from D

- goal: efficiently output a hypothesis h such that w.h.p., Prx←D[ f(x) ≠ h(x)] < ε.

Learning exampleExample: C is the class of all conjunctions of

variables.Perhaps the concept selected is:

x1 AND x2 AND x4.

One might see examples:< (+ + − + − +), + >< (− + − + − −), − >< (+ + + − − +), − >

What is a learning algorithm for this class?

HalfspacesLet h be a hyperplane in Rn:

h = {x : ∑wi xi = θ}.

h naturally induces a boolean function:f : {+1,−1}n → {+1,−1},

f (x) = sgn(∑wi xi − θ).

We call such a function a boolean halfspace, or a weighted majority. The majority function itself is an example (wi ≡ 1, θ = 0).

i=1

n

Learning halfspacesLearning halfspaces is a very old problem;

dates back to models for the brain from the ’50s: [Agmon-54, Rosenblatt-58, Block-62].

The concept class of halfspaces has long been known to be PAC learnable in polynomial time via Linear Programming [BEHW-89].

Indeed, this works over any distribution on Rn, including those singling out {+1,−1}n.

Learning halfspacesBasic idea: given a bunch of examples, find a

halfspace which classifies them correctly. By some learning theory technology

(“Occam’s Razor”), this is a good algorithm.Consider the coefficients of a hypothesis

halfspace to be unknowns, a1, …, an, θ.

Each example induces some linear constraints: e.g., < (+ + − + − −), + > induces a1+a2−a3+a4−a5−a6 > θ. Solve LP.

Learning intersections of halfspacesThe next logical extension of this, and a very

important one, is learning intersections of halfspaces.

Intersections of halfspaces form a very rich concept class: all convex bodies, CNF formulas…

Learning them is also an important problem for computer vision, study of perceptrons.

But very little is known.

Prior work- [Baum91]: poly time algorithm for intersection of

two halfspaces through the origin under symmetric distributions (those satisfying D(x) = D(−x)).

- [BlumKannan,Vempala97] learn an intersection of O(1) halfspaces in poly time over near-uniform distributions on the Euclidean sphere:

- not relevant for boolean halfspaces- [KwekPitt98] gave a polynomial time alg., but

requires membership queries- also not relevant for boolean halfspaces

Our resultsTheorem 1: The concept class of

arbitrary functions of k boolean halfspaces over {+1,−1}n

is learnable under the uniform distribution to accuracy 1−ε in time:

nO(k²/ε²).

This is polynomial time if k = O(1), ε = Ω(1).(Prior to this, no algorithm could learn even an intersection of 2

arbitrary boolean halfspaces under the uniform distribution in subexponential time.)

Our resultsTheorem 2: The concept class of

intersections of k boolean halfspaces with weight bound W

is learnable under any probability distribution to accuracy 1−ε in time:

nO(k log k log W)/ε.

So if the weights are polynomially bounded, one can learn an intersection of log many halfspaces in quasipolynomial time.

More resultsFunction Halfspaces Distrib. Time

any fcn. of k weight W any nO(k² log k logW)/εweight k threshold (e.g., inters. of k)

weight W any nO(k log k logW)/ε

intersection of k weight W any nO(√W log k)/εread-once

intersection of karbitrary uniform nO((log(k)/ε)²)

read-once majority of k

arbitrary uniform nÕ((log(k)/ε) )4

Sketch of techniquesFor arbitrary distribution results:

show that functions of low weight halfspaces have low degree polynomial threshold representations.

For uniform distribution results: show that functions of halfspaces have low noise sensitivity.

Both conclusions imply learning results generically.

Talk outlinePlan for the rest of the talk:1. Prove nO(k log k log W) bound for learning

intersections of k weight-W halfspaces under arbitrary distributions.

(Sketch other arbit. dist. results.)2. Prove nO(k²/ε²) bound for learning arbitrary

functions of k halfspaces under the uniform distribution.

(Sketch other unif. dist. results.)

Polynomial threshold functionsA (multilinear) polynomial p : Rn→ R is a

PTF for f if it sign-represents f :f(x) = sgn(p(x)) for all x {+1,−1}n.

- every boolean halfspace is a degree 1 PTF for itself

- every boolean function has a degree n PTF

By linear programming [KS01]: if every function in a class C has a PTF of degree d, then C is learnable in time nO(d)/ε.

PTFs for intersections of halfspacesSuppose f and g are hyperplanes,

f(x) = ∑wi xi−θ, g(x) = ∑wi' xi−θ' .

We would like a PTF for sgn(f) sgn(g).Failed attempt 1:

- try f(x)g(x): is >0 if f(x)>0 and g(x)>0 is >0 if f(x)<0 and g(x)<0

Failed attempt 2:- try f(x)+g(x): is >0 if f(x)>0 and g(x)>0 is <0 if f(x)<0 and g(x)<0 is ?? if f(x)>0 and g(x)<0

PTFs for intersections of halfspacesThe solution: apply a (polynomial?) function

to f and g to make them look more like their sign.

Assume ∑|wi| < W. Then for all x {+1,−1}n,

f(x), g(x) [-W,-1] ∪ [1,W].Beigel et al. [BRS95] showed how to

construct a univariate rational function which is an essentially optimal approximator of the sgn function on [-W,-1] ∪ [1,W].

BRS’s sgn-approximator

p(x)=(x-1)(x-2)2(x-4)2(x-8)2(x-16)2(x-32)2

q(x) =

Q is a rational function of degree O(log k log W) such that:Q(x) [1, 1+1/k] for x [1,W],Q(x) [-1-1/k, -1] for x [-W,-1].

p(-x)-p(x)p(-x)+p(x)

PTFs for intersections of halfspacesNow given weight W halfspaces h1, …, hk,

sgn(Q(h1(x)) + … + Q(hk(x)) − (k−½))is a rational function which sign-represents

the intersection. Once taken to a common denominator, it has degree O(k log k log W).

Easy to get a polynomial: sgn(p/q)=sgn(pq).So we have a PTF for the intersection of k

weight-W halfspaces of degree O(k log k log W). Hence a learning algorithm running in time nO(k log k log W).

Talk outlinePlan for the talk:

1. Prove nO(k log k log W) bound for learning intersections of k weight-W halfspaces under arbitrary distributions.

2. Prove nO(k²/ε²) bound for learning functions of k halfspaces under the uniform distribution.

Noise sensitivity

Let f : {+1,−1}n → {+1,−1} be a boolean function. Pick x {+1,−1}n uniformly at random, and let y be an ε-corruption of x: flip each bit of x independently with probability ε.

defn: The noise sensitivity of f is:NSε(f) = Pr[f(x) ≠ f(y)].

Noise sensitivity examples• Let f be a projection to one bit,

f(x1, …, xn) = x1.

Then NSε(f) = ε.

• Suppose f depends on only k bits.Then NSε(f) ≤ k ε.

• PARITY is the most noise-sensitive function:

NSε(PARITYn) = ½ − ½(1−2ε)n.

Noise sensitivity – study and apps.• [Benjamini-Kalai-Schramm-98] – percolation,

low-level circuit complexity• [Kahn-Kalai-Linial-88] – random walks on the

hypercube• [Håstad-97] – probabilistically checkable proofs• [Bshouty-Jackson-Tamon-99] – learning theory

under noise• [O-02] – Yao’s XOR Lemma, average case

hardness of NP• [Bourgain-02, Kindler-Safra-02, FKRSS-02] –

study of juntas, Fourier analysis of boolean fcns.

Low noise sens. fast learningWe show that if the noise sensitivity of all f in C is uniformly bounded:

NSε(f) ≤ α(ε),

then C is learnable under the uniform distribution in time:

nO(1)/α (ε/3).

Intuition: if f is not too noise sensitive, nearby points are highly correlated, so a net of examples works.

−1

Proof of NS-learning connectionActually, the intuition is wrong. Here is the

proper proof sketch:Low noise sensitivity Fourier spectrum

concentrated at low levels; this uses the formula: NSε(f ) = ½−½ Σ(1−2ε)|S| f(S)2 and a Markovish inequality.

Low level Fourier concentration efficient uniform distribution learning; this is by the “Low degree” Fourier sampling learning algorithm of [Linial-Mansour-Nisan-93].

ˆ

Noise sensitivity of halfspaces

Function NSεproof

one boolean halfspace

O(√ε) Y. Peres, ’98

any function of k halfspaces

O(k√ε) union bound

read-once intersection of k halfspaces

O(√ε log k) difficult probabilistic

analysisread-once majority of k halfspaces

Õ((ε log k)¼)

ConsequencesLet C be the class of functions of k boolean

halfspaces. Take α(ε) = O(k√ε), so all f C have NSε(f) ≤ α(ε).

α−1(ε/3) = O(ε2/k2).

Hence we get Theorem 1:a uniform distribution

learning algorithm running in time nO(k²/ε²).

Noise sensitivity of a halfspaceWe now sketch Peres’s beautiful proof that

the noise sensitivity of a single halfspace is O(√ε).

Suppose the halfspace is f = sgn(∑wi xi−θ). Without (much) loss of generality, one can assume θ = 0. Recall that xi’s are selected randomly from {+1,−1} and the sum is formed; then each xi is flipped indep. with prob. ε. We want to show that the prob. the sums land on opposite sides of 0 – call this a “flop”, prob. P – is O(√ε).

Noise sensitivity of a halfspaceWith high probability, the number of flipped

bits is about k := εn. Let’s assume we always flip exactly k random bits, and that k divides n. (Both assumptions are easily removed.)

We now model the problem thus: Pick signs xi at random. Randomly permute the weights. Divide the weights into n/k blocks of size k. Form the n/k block sums, X1= ∑wi xi, X2= ∑wi xi, etc.

i=1…k i=k+1…2k

Noise sensitivity of a halfspace

Write S = X1 + … + Xn/k for the initial sum. Because of the permutation, we may assume that the random signs in the first block are the “flips”. Put S' = S − X1, so the sum before flipping is S'+X1, and the sum after flipping is S'−X1. We are trying to bound the probability P that these two sums have opposite signs (a flop). Note that this happens iff |S'| < |X1|.

Noise sensitivity of a halfspacesgn(X1) and S' are independent, so:

Pr[sgn(X1) ≠ sgn(S')] = ½.

sgn(X1) and |X1| are independent, so:

Pr[sgn(X1) ≠ sgn(S') | |S'| > |X1|] = ½Pr[sgn(X1) ≠ sgn(S) | |S'| > |X1|] = ½Pr[sgn(X1) ≠ sgn(S) & no flop] = ½(1−P)Pr[sgn(X1) ≠ sgn(S)] = ½(1−P)P = 2 E[½ – I[sgn(X1) ≠ sgn(S)]].

Noise sensitivity of a halfspaceOf course, there was nothing special about

block X1 as opposed to any other block. So in fact,

P = 2 E[½ – I[sgn(Xi) ≠ sgn(S)]].

for all i = 1…n/k. Write τ=sgn(S), σi=sgn(Xi), and average:

P = 2 E[½ – (k/n) ∑i I[τ ≠ σi]].

Noise sensitivity of a halfspaceP = 2 E[½ – (k/n) ∑i I[τ ≠ σi]]

The quantity inside the expectation is some random variable, a number which is either ½ – (k/n) ∑i I[1 ≠ σi] or ½ – (k/n) ∑i I[−1 ≠ σi].

If I tell you a number is either a or b, then assuredly it’s at most |a| + |b|. Applying this to the expectation, pointwise:

P ≤ 2 E[|½ – (k/n) ∑i I[σi=1]| + |½ – (k/n) ∑i I[σi=−1]|].

Noise sensitivity of a halfspaceP ≤ 2 E[ |½ – ε ∑ I[σi=1]| + |½ – ε ∑ I[σi=−1]| ]

But the σi’s are simply independent, uniformly random signs. Hence both quantities in the expectation are merely the expected absolute deviation from the mean in 1/ε samples of an unbiased 0/1 random variable – i.e., O(√ε).

i=1…1/ε i=1…1/ε

ExtensionsThis concludes the proof that a single

halfspace has noise sensitivity O(√ε), from which the uniform distribution learning algorithm for functions of k halfspaces follows.

To get the extended learning algorithms, must work harder at analyzing noise sensitivity. Key result: if a halfspace h is biased – say, the probability of + is p < ½, then: NSε(h ) ≤ min{2p, C p (ε log(1/p))½}.

Talk outlinePlan for the talk:

1. Prove nO(k log k log W) bound for learning intersections of k weight-W halfspaces under arbitrary distributions.

2. Prove nO(k²/ε²) bound for learning functions of k halfspaces under the uniform distribution.

Open technical challenges• Give an upper bound on the degree

necessary for a PTF which represents the AND of two arbitrary halfspaces.(For a new lower bound, see my talk tomorrow!)

• Give a better analysis of the noise sensitivity of the intersection of k halfspaces on n bits. Is it O((ε log k)½)?

The huge open problem

It still remains open how to learn an intersection of two arbitrary boolean halfspaces under an arbitrary distribution in subexponential time!

Documents

Learning intersections and thresholds of halfspaces