Upload
roderick-spencer
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Efficient Learning in High Dimensions with
Trees and Mixtures
Marina MeilaCarnegie Mellon University
· Multidimensional (noisy) data
· Learning tasks - intelligent data analysis· categorization (clustering)· classification· novelty detection· probabilistic reasoning
· Data is changing, growing· Tasks change
need to make learning automatic, efficient
Multidimensional data
Learning
Combining probability and algorithms
· Automatic probability and statistics· Efficient algorithms
· This talk
the tree statistical model
Talk overview
Introduction:statistical models
The tree model
Mixtures of trees Learning Experiments
Accelerated learning
Bayesian learning
Perspective: generative models and decision tasks
A multivariate domain
· Data
Patient1
Patient2
. . . . . . . . . . . .
· Queries
· Diagnose new patient
· Is smoking related to lung cancer?
· Understand the “laws” of the domain
Smoker X rayCoughBronchitis
Smoker X rayCoughLung cancer
Bronchitis
Smoker X rayCoughBronchitis Lung cancer?
Lung cancer?
Smoker X rayCoughLung cancer
Bronchitis
Statistical model
Probabilistic approach
· Smoker, Bronchitis .. (discrete) random variables· Statistical model (joint distribution)
P( Smoker, Bronchitis, Lung cancer, Cough, X ray )
summarizes knowledge about domain
· Queries:· inference
e.g. P( Lung cancer = true | Smoker = true, Cough = false )
· structure of the model• discovering relationships• categorization
Probability table representation
· Query:
P(v1=0 | v2=1) = = = .23
· Curse of dimensionalityif v1, v2, … vn binary variables PV1,V2…Vn table with 2n entries!
· How to represent?· How to query?· How to learn from data?· Structure?
P(v1=0, v2=1)
P(v2=1)
.14 + .03
.14 + .3 + .22 + .33
v1v2 00 01 11 00
v3
0 .01 .14 .22 .01 1 .23 .03 .33 .03
Graphical models
· Structure· vertices = variables· edges = “direct dependencies”
· Parametrization · by local probability tables
· compact parametric representation· efficient computation· learning parameters by simple formula· learning structure is NP-hard
Galaxy type
distance
photometric measuremen
t
size
observed size
spectrum Z (red-shift)dust
Obs spectrum
spectrum Z (red-shift)dust
Obs spectrum
The tree statistical model
· Structure
tree (graph with no cycles)
· Parameters
· probability tables associated to edges
1
2
3 4 5
1
2
3 4 5
T34
T3
T4|3
Tv|u(xv|xu) uv E
T(x) = Tuv(xuxv)
Tv(xv)
uv E
v V
deg v-1T(x) =
• T(x) factors over tree edges
equivalent
Examples
Weight
Temperature
Thrombocyt
BPDNeutropeni
aSuspect Lipid
GestationAcidosis
HyperNa
PulmHemorrh
Coag
Hypertension
-7 -5 -4 -3-6 +4
+6
+7
+5
-2 +1
+2
+3
-1
junction type
+8
· Splice junction domain
· Premature babies’ Bronho-Pulmonary Disease (BPD)
|V| =n
· computing likelihood T(x) ~ n
· conditioning TV-A|A (junction tree algorithm) ~ n
· marginalization Tuv for arbitrary u,v ~ n
· sampling ~ n· fitting to a given distribution ~ n2
• learning from data ~ n2Ndata
· is a simple model
Trees - basic operations
uv ETuv(xuxv)
Tv (xv) v V
deg v -1T(x) =
Querying the model
Estimating the model
The mixture of trees
h = “hidden” variableP( h=k ) = k k = 1, 2 . . . m
· NOT a graphical model· computational efficiency preserved
m
Q(x) = kTk(x) k=1
(Meila 97)
Learning - problem formulation
· Maximum Likelihood learning· given a data set D = { x1, . . . xN }· find the model that best predicts the data
Topt = argmax T(D)
· Fitting a tree to a distribution· given a data set D = { x1, . . . xN }
and distribution P that weights each data point, · find
Topt = argmin KL( P || T )
· KL is Kullbach-Leibler divergence· includes Maximum likelihood learning as a special case
Fitting a tree to a distribution
Topt = argmin KL( P || T )· optimization over structure + parameters
· sufficient statistics
· probability tables Puv = Nuv/N u,v V
· mutual informations Iuv
Iuv = Puv log Puv
PuPv
(Chow & Liu 68)
Fitting a tree to a distribution - solution
· StructureEopt = argmax Iuv
uv E
· found by Maximum Weight Spanning Tree algorithm with edge weights Iuv
· Parameters · copy marginals of P
Tuv = Puv for uv E
I61I23
I12
I34
I63
I45
I56
E step which xi come from T k?
distribution P k(x)
Learning mixtures by the EM algorithm
· Initialize randomly· converges to local maximum of the likelihood
M step fit T k to set of points
min KL( Pk||Tk )
Meila & Jordan ‘97
Remarks
· Learning a tree· solution is globally optimal over structures and
parameters· tractable: running time ~ n2N
· Learning a mixture by the EM algorithm· both E and M steps are exact, tractable· running time
• E step ~ mnN• M step ~ mn2N
· assumes m known· converges to local optimum
Finding structure - the bars problem
Data n=25 learned structure
Structure recovery: 19 out of 20 trials
Hidden variable accuracy: 0.85 +/- 0.08 (ambiguous)
0.95 +/- 0.01 (unambiguous)
Data likelihood [bits/data point] true model 8.58
learned model 9.82 +/-0.95
Experiments - density estimation
· Digits and digit pairs Ntrain = 6000 Nvalid = 2000 Ntest = 5000
n = 64 variables ( m = 16 trees ) n = 128 variables ( m = 32 trees )
Mix
Tre
es
Mix
Tre
es
DNA splice junction classification
· n = 61 variables· class = Intron/Exon, Exon/Intron, Neither
Tree TANB NBSupervised
(DELVE)
Discovering structure
IE junction Intron Exon
15 16 . . . 25 26 27 28 29 30 31
Tree - CT CT CT - - CT A G G
True CT CT CT CT - - CT A G G
EI junction Exon Intron
28 29 30 31 32 33 34 35 36
Tree CA A G G T AG A G -
True CA A G G T AG A G T
(Watson “The molecular biology of the gene” 87)
class
Tree adjacency matrix
Irrelevant variables
61 original variables + 60 “noise” variables
Original Augmented with irrelevant variables
Accelerated tree learning
· Running time for the tree learning algorithm ~ n2N
· Quadratic running time may be too slow:Example: document classification· document = data point --> N = 103-4
· word = variable --> n = 103-4
· sparse data --> #words in document s and s << n,N
· Can sparsity be exploited to create faster algorithms?
Meila ‘99
Sparsity
· assume special value “0” that occurs frequentlysparsity = s # non-zero variables in each data point s
s << n, N
· Idea: ““do not represent / count zerosdo not represent / count zeros”
0 1 0 0 0 0 1 0 0 0 0 1 0 0 0
0 0 0 1 0 0 0 0 0 1 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 1
Linked list
length s
Sparse data
Presort mutual informations
Theorem (Meila,99) If v, v’ are variables that do not cooccur with u in V (i.e. Nuv = Nuv’ = 0 ) then
Nv > Nv’ ==> Iuv > Iuv’
· Consequences
· sort Nv => all edges uv , Nuv = 0 implicitly sorted by Iuv
· these edges need not be represented explicitly· construct black box that outputs next “largest” edge
The black box data structure
list of u , Nuv > 0, sorted by Iuvlist of u , Nuv > 0, sorted by Iuv
list of u, Nuv =0, sorted by Nv list of u, Nuv =0, sorted by Nv (virtual)(virtual)
vv
vv22
vv11
vvnn
F-heap F-heap of size of size
~n~n
next edge next edge uvuv
NNvv
Total running time n log n + s2N + nK log n
(standard alg. running time n2N )
Remarks
· Realistic assumption · Exact algorithm, provably efficient time bounds· Degrades slowly to the standard algorithm if data not sparse· General
· non-integer counts· multi-valued discrete variables
Meila & Jaakkola ‘00Bayesian learning of trees
· Problem· given prior distribution over trees P0(T)
data D = { x1, . . . xN }· find posterior distribution P(T|D)
· Advantages · incorporates prior knowledge · regularization
· Solution · Bayes’ formula P(T|D) = P0(T) T(xi) i=1,N
· practically hard• distribution over structure E and parameters E
hard to represent• computing Z is intractable in general• exception: conjugate priors
1Z
· want priors that factor over tree edges· prior for structure E
P0(E) uv uv E
· prior for tree parameters
P0(E) D( u|v ; N’uv ) uv E
· (hyper) Dirichlet with hyper-parameters N’uv(xuxv), u,v V· posterior is also Dirichlet with hyper-parameters
Nuv(xuxv) + N’uv(xuxv), u,v V
Decomposable priors
T = f( u, v, u|v)
uv E
Decomposable posterior
· Posterior distribution
P(T|D) Wuv
uv E · factored over edges· same form as prior
Wuv = uv D( u|v; N’uv + Nuv )
· Remains to compute the normalization constant
The Matrix tree theorem
· Matrix tree theorem
IfP0(E) = uv, uv 0
uv E
M( ) =
1Z
v
u
v’vv'
uv
uv
Discrete: graph theorycontinuous: Meila & Jaakkola 99
Then Z = det M( )
Remarks on the decomposable prior
· Is a conjugate prior for the tree distribution· Is tractable
· defined by ~ n2 parameters· computed exactly in ~ n3 operations· posterior obtained in ~ n2N + n3 operations· derivatives w.r.t parameters, averaging, . . . ~ n3
· Mixtures of trees with decomposable priors· MAP estimation with EM algorithm tractable
· Other applications· ensembles of trees· maximum entropy distributions on trees
So far . .
· Trees and mixtures of trees are structured statistical models
· Algorithmic techniques enable efficient learning • mixture of trees• accelerated algorithm• matrix tree theorem & Bayesian learning
· Examples of usage· Structure learning· Compression· Classification
Generative models and discrimination
· Trees are generative models · descriptive· can perform many tasks suboptimally
· Maximum Entropy discrimination (Jaakkola,Meila,Jebara,’99)
· optimize for specific tasks· use generative models· combine simple models into ensembles· complexity control - by information theoretic principle
· Discrimination tasks· detecting novelty · diagnosis· classification