Upload
boone
View
46
Download
1
Embed Size (px)
DESCRIPTION
Learning intersections and thresholds of halfspaces. Adam Klivans (MIT/Harvard) Ryan O’Donnell (MIT) Rocco Servedio (Harvard). Learning. We consider the PAC model of [Valiant-84], in which learning a “concept class” C of boolean functions means: - PowerPoint PPT Presentation
Citation preview
Learning intersections and thresholds of halfspaces
Adam Klivans (MIT/Harvard)Ryan O’Donnell (MIT)
Rocco Servedio (Harvard)
LearningWe consider the PAC model of [Valiant-84],
in which learning a “concept class” C of boolean functions means:
- a function f in C is selected, and also a probability distribution D over {+1,−1}n
- the learning algorithm gets access to random examples <x, f(x)>, where the x’s are drawn from D
- goal: efficiently output a hypothesis h such that w.h.p., Prx←D[ f(x) ≠ h(x)] < ε.
Learning exampleExample: C is the class of all conjunctions of
variables.Perhaps the concept selected is:
x1 AND x2 AND x4.
One might see examples:< (+ + − + − +), + >< (− + − + − −), − >< (+ + + − − +), − >
What is a learning algorithm for this class?
HalfspacesLet h be a hyperplane in Rn:
h = {x : ∑wi xi = θ}.
h naturally induces a boolean function:f : {+1,−1}n → {+1,−1},
f (x) = sgn(∑wi xi − θ).
We call such a function a boolean halfspace, or a weighted majority. The majority function itself is an example (wi ≡ 1, θ = 0).
i=1
n
Learning halfspacesLearning halfspaces is a very old problem;
dates back to models for the brain from the ’50s: [Agmon-54, Rosenblatt-58, Block-62].
The concept class of halfspaces has long been known to be PAC learnable in polynomial time via Linear Programming [BEHW-89].
Indeed, this works over any distribution on Rn, including those singling out {+1,−1}n.
Learning halfspacesBasic idea: given a bunch of examples, find a
halfspace which classifies them correctly. By some learning theory technology
(“Occam’s Razor”), this is a good algorithm.Consider the coefficients of a hypothesis
halfspace to be unknowns, a1, …, an, θ.
Each example induces some linear constraints: e.g., < (+ + − + − −), + > induces a1+a2−a3+a4−a5−a6 > θ. Solve LP.
Learning intersections of halfspacesThe next logical extension of this, and a very
important one, is learning intersections of halfspaces.
Intersections of halfspaces form a very rich concept class: all convex bodies, CNF formulas…
Learning them is also an important problem for computer vision, study of perceptrons.
But very little is known.
Prior work- [Baum91]: poly time algorithm for intersection of
two halfspaces through the origin under symmetric distributions (those satisfying D(x) = D(−x)).
- [BlumKannan,Vempala97] learn an intersection of O(1) halfspaces in poly time over near-uniform distributions on the Euclidean sphere:
- not relevant for boolean halfspaces- [KwekPitt98] gave a polynomial time alg., but
requires membership queries- also not relevant for boolean halfspaces
Our resultsTheorem 1: The concept class of
arbitrary functions of k boolean halfspaces over {+1,−1}n
is learnable under the uniform distribution to accuracy 1−ε in time:
nO(k²/ε²).
This is polynomial time if k = O(1), ε = Ω(1).(Prior to this, no algorithm could learn even an intersection of 2
arbitrary boolean halfspaces under the uniform distribution in subexponential time.)
Our resultsTheorem 2: The concept class of
intersections of k boolean halfspaces with weight bound W
is learnable under any probability distribution to accuracy 1−ε in time:
nO(k log k log W)/ε.
So if the weights are polynomially bounded, one can learn an intersection of log many halfspaces in quasipolynomial time.
More resultsFunction Halfspaces Distrib. Time
any fcn. of k weight W any nO(k² log k logW)/εweight k threshold (e.g., inters. of k)
weight W any nO(k log k logW)/ε
intersection of k weight W any nO(√W log k)/εread-once
intersection of karbitrary uniform nO((log(k)/ε)²)
read-once majority of k
arbitrary uniform nÕ((log(k)/ε) )4
Sketch of techniquesFor arbitrary distribution results:
show that functions of low weight halfspaces have low degree polynomial threshold representations.
For uniform distribution results: show that functions of halfspaces have low noise sensitivity.
Both conclusions imply learning results generically.
Talk outlinePlan for the rest of the talk:1. Prove nO(k log k log W) bound for learning
intersections of k weight-W halfspaces under arbitrary distributions.
(Sketch other arbit. dist. results.)2. Prove nO(k²/ε²) bound for learning arbitrary
functions of k halfspaces under the uniform distribution.
(Sketch other unif. dist. results.)
Polynomial threshold functionsA (multilinear) polynomial p : Rn→ R is a
PTF for f if it sign-represents f :f(x) = sgn(p(x)) for all x {+1,−1}n.
- every boolean halfspace is a degree 1 PTF for itself
- every boolean function has a degree n PTF
By linear programming [KS01]: if every function in a class C has a PTF of degree d, then C is learnable in time nO(d)/ε.
PTFs for intersections of halfspacesSuppose f and g are hyperplanes,
f(x) = ∑wi xi−θ, g(x) = ∑wi' xi−θ' .
We would like a PTF for sgn(f) sgn(g).Failed attempt 1:
- try f(x)g(x): is >0 if f(x)>0 and g(x)>0 is >0 if f(x)<0 and g(x)<0
Failed attempt 2:- try f(x)+g(x): is >0 if f(x)>0 and g(x)>0 is <0 if f(x)<0 and g(x)<0 is ?? if f(x)>0 and g(x)<0
PTFs for intersections of halfspacesThe solution: apply a (polynomial?) function
to f and g to make them look more like their sign.
Assume ∑|wi| < W. Then for all x {+1,−1}n,
f(x), g(x) [-W,-1] ∪ [1,W].Beigel et al. [BRS95] showed how to
construct a univariate rational function which is an essentially optimal approximator of the sgn function on [-W,-1] ∪ [1,W].
BRS’s sgn-approximator
p(x)=(x-1)(x-2)2(x-4)2(x-8)2(x-16)2(x-32)2
q(x) =
Q is a rational function of degree O(log k log W) such that:Q(x) [1, 1+1/k] for x [1,W],Q(x) [-1-1/k, -1] for x [-W,-1].
p(-x)-p(x)p(-x)+p(x)
PTFs for intersections of halfspacesNow given weight W halfspaces h1, …, hk,
sgn(Q(h1(x)) + … + Q(hk(x)) − (k−½))is a rational function which sign-represents
the intersection. Once taken to a common denominator, it has degree O(k log k log W).
Easy to get a polynomial: sgn(p/q)=sgn(pq).So we have a PTF for the intersection of k
weight-W halfspaces of degree O(k log k log W). Hence a learning algorithm running in time nO(k log k log W).
Talk outlinePlan for the talk:
1. Prove nO(k log k log W) bound for learning intersections of k weight-W halfspaces under arbitrary distributions.
2. Prove nO(k²/ε²) bound for learning functions of k halfspaces under the uniform distribution.
Noise sensitivity
Let f : {+1,−1}n → {+1,−1} be a boolean function. Pick x {+1,−1}n uniformly at random, and let y be an ε-corruption of x: flip each bit of x independently with probability ε.
defn: The noise sensitivity of f is:NSε(f) = Pr[f(x) ≠ f(y)].
Noise sensitivity examples• Let f be a projection to one bit,
f(x1, …, xn) = x1.
Then NSε(f) = ε.
• Suppose f depends on only k bits.Then NSε(f) ≤ k ε.
• PARITY is the most noise-sensitive function:
NSε(PARITYn) = ½ − ½(1−2ε)n.
Noise sensitivity – study and apps.• [Benjamini-Kalai-Schramm-98] – percolation,
low-level circuit complexity• [Kahn-Kalai-Linial-88] – random walks on the
hypercube• [Håstad-97] – probabilistically checkable proofs• [Bshouty-Jackson-Tamon-99] – learning theory
under noise• [O-02] – Yao’s XOR Lemma, average case
hardness of NP• [Bourgain-02, Kindler-Safra-02, FKRSS-02] –
study of juntas, Fourier analysis of boolean fcns.
Low noise sens. fast learningWe show that if the noise sensitivity of all f in C is uniformly bounded:
NSε(f) ≤ α(ε),
then C is learnable under the uniform distribution in time:
nO(1)/α (ε/3).
Intuition: if f is not too noise sensitive, nearby points are highly correlated, so a net of examples works.
−1
Proof of NS-learning connectionActually, the intuition is wrong. Here is the
proper proof sketch:Low noise sensitivity Fourier spectrum
concentrated at low levels; this uses the formula: NSε(f ) = ½−½ Σ(1−2ε)|S| f(S)2 and a Markovish inequality.
Low level Fourier concentration efficient uniform distribution learning; this is by the “Low degree” Fourier sampling learning algorithm of [Linial-Mansour-Nisan-93].
ˆ
Noise sensitivity of halfspaces
Function NSεproof
one boolean halfspace
O(√ε) Y. Peres, ’98
any function of k halfspaces
O(k√ε) union bound
read-once intersection of k halfspaces
O(√ε log k) difficult probabilistic
analysisread-once majority of k halfspaces
Õ((ε log k)¼)
ConsequencesLet C be the class of functions of k boolean
halfspaces. Take α(ε) = O(k√ε), so all f C have NSε(f) ≤ α(ε).
α−1(ε/3) = O(ε2/k2).
Hence we get Theorem 1:a uniform distribution
learning algorithm running in time nO(k²/ε²).
Noise sensitivity of a halfspaceWe now sketch Peres’s beautiful proof that
the noise sensitivity of a single halfspace is O(√ε).
Suppose the halfspace is f = sgn(∑wi xi−θ). Without (much) loss of generality, one can assume θ = 0. Recall that xi’s are selected randomly from {+1,−1} and the sum is formed; then each xi is flipped indep. with prob. ε. We want to show that the prob. the sums land on opposite sides of 0 – call this a “flop”, prob. P – is O(√ε).
Noise sensitivity of a halfspaceWith high probability, the number of flipped
bits is about k := εn. Let’s assume we always flip exactly k random bits, and that k divides n. (Both assumptions are easily removed.)
We now model the problem thus: Pick signs xi at random. Randomly permute the weights. Divide the weights into n/k blocks of size k. Form the n/k block sums, X1= ∑wi xi, X2= ∑wi xi, etc.
i=1…k i=k+1…2k
Noise sensitivity of a halfspace
Write S = X1 + … + Xn/k for the initial sum. Because of the permutation, we may assume that the random signs in the first block are the “flips”. Put S' = S − X1, so the sum before flipping is S'+X1, and the sum after flipping is S'−X1. We are trying to bound the probability P that these two sums have opposite signs (a flop). Note that this happens iff |S'| < |X1|.
Noise sensitivity of a halfspacesgn(X1) and S' are independent, so:
Pr[sgn(X1) ≠ sgn(S')] = ½.
sgn(X1) and |X1| are independent, so:
Pr[sgn(X1) ≠ sgn(S') | |S'| > |X1|] = ½Pr[sgn(X1) ≠ sgn(S) | |S'| > |X1|] = ½Pr[sgn(X1) ≠ sgn(S) & no flop] = ½(1−P)Pr[sgn(X1) ≠ sgn(S)] = ½(1−P)P = 2 E[½ – I[sgn(X1) ≠ sgn(S)]].
Noise sensitivity of a halfspaceOf course, there was nothing special about
block X1 as opposed to any other block. So in fact,
P = 2 E[½ – I[sgn(Xi) ≠ sgn(S)]].
for all i = 1…n/k. Write τ=sgn(S), σi=sgn(Xi), and average:
P = 2 E[½ – (k/n) ∑i I[τ ≠ σi]].
Noise sensitivity of a halfspaceP = 2 E[½ – (k/n) ∑i I[τ ≠ σi]]
The quantity inside the expectation is some random variable, a number which is either ½ – (k/n) ∑i I[1 ≠ σi] or ½ – (k/n) ∑i I[−1 ≠ σi].
If I tell you a number is either a or b, then assuredly it’s at most |a| + |b|. Applying this to the expectation, pointwise:
P ≤ 2 E[|½ – (k/n) ∑i I[σi=1]| + |½ – (k/n) ∑i I[σi=−1]|].
Noise sensitivity of a halfspaceP ≤ 2 E[ |½ – ε ∑ I[σi=1]| + |½ – ε ∑ I[σi=−1]| ]
But the σi’s are simply independent, uniformly random signs. Hence both quantities in the expectation are merely the expected absolute deviation from the mean in 1/ε samples of an unbiased 0/1 random variable – i.e., O(√ε).
i=1…1/ε i=1…1/ε
ExtensionsThis concludes the proof that a single
halfspace has noise sensitivity O(√ε), from which the uniform distribution learning algorithm for functions of k halfspaces follows.
To get the extended learning algorithms, must work harder at analyzing noise sensitivity. Key result: if a halfspace h is biased – say, the probability of + is p < ½, then: NSε(h ) ≤ min{2p, C p (ε log(1/p))½}.
Talk outlinePlan for the talk:
1. Prove nO(k log k log W) bound for learning intersections of k weight-W halfspaces under arbitrary distributions.
2. Prove nO(k²/ε²) bound for learning functions of k halfspaces under the uniform distribution.
Open technical challenges• Give an upper bound on the degree
necessary for a PTF which represents the AND of two arbitrary halfspaces.(For a new lower bound, see my talk tomorrow!)
• Give a better analysis of the noise sensitivity of the intersection of k halfspaces on n bits. Is it O((ε log k)½)?
The huge open problem
It still remains open how to learn an intersection of two arbitrary boolean halfspaces under an arbitrary distribution in subexponential time!