Download ppt - Multivariate Analysis Past, Present and Future

Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper 1

Multivariate AnalysisMultivariate AnalysisPast, Present and Future Past, Present and Future

Harrison B. ProsperFlorida State University

PHYSTAT 2003PHYSTAT 200310 September 2003


OutlineOutline

Introduction Historical Note Current Practice Issues Summary


IntroductionIntroduction

Data are invariably multivariate

Particle physics (, , E, f)

Astrophysics (θ, , E, t)


Introduction – II Introduction – II A Textbook ExampleA Textbook Example

Objects Jet 1 (b) 3 Jet 2 3 Jet 3 3 Jet 4 (b) 3 Positron 3 Neutrino 2

17


Introduction – IIIIntroduction – III

Astrophysics/Particle physics: Similarities Events Interesting events occur at random Poisson processes Backgrounds are important Experimental response functions Huge datasets


Introduction – IVIntroduction – IV

Differences In particle physics we control when

events occur and under what conditions

We have detailed predictions of the relative frequency of various outcomes


Introduction – VIntroduction – VAll we do is Count!All we do is Count!

Our experiments are ideal Bernoulli trials At Fermilab, each collision, that is, trial, is

conducted the same way every 400ns400ns

de Finetti’s analysis of exchangeable trials is an de Finetti’s analysis of exchangeable trials is an accurate model of what we doaccurate model of what we do

)()(,

),(

)(),,(),...,(1

01

pfnk

npkPoisson

dfnkBinomialeeP n

)()(,

),(

)(),,(),...,(1

01

pfnk

npkPoisson

dfnkBinomialeeP n

Time →


Introduction – VIIntroduction – VI

Typical analysis tasks Data Compression Clustering and cluster characterization Classification/Discrimination Estimation Model selection/Hypothesis testing

Optimization


Historical NoteHistorical NoteKarl Pearson (1857 – 1936)

P.C. Mahalanobis (1893 – 1972)

R.A. Fisher (1890 – 1962)


Historical Note – Iris DataHistorical Note – Iris Data

Iris Sotosa

Iris Versicolor

R.A. Fisher, The Use of Multiple Measurements in Taxonomic Problems,Annals of Eugenics, v. 7, p. 179-188 (1936)


Iris DataIris Data

Variables X1 Sepal length X2 Sepal width X3 Petal length X4 Petal width

“What linear function of the four measurements will maximize the ratio of the difference between the specific means to the standard deviations within species?” R.A. Fisher


Fisher Linear Discriminant (1936)Fisher Linear Discriminant (1936)

xy BA 1)( xy BA 1)(

4321 1036.101299.79037.5 xxxxy 4321 1036.101299.79037.5 xxxxy

Solution:

bxw

xGaussian

xGaussian

)()(

,|

,|log

12

22

2

1

Which is the same, within a constant, as


Current Practice in Particle PhysicsCurrent Practice in Particle Physics

Reducing number of variables Principal Component Analysis (PCA)

Discrimination/Classification Fisher Linear Discriminant (FLD) Random Grid Search (RGS) Feedforward Neural Network (FNN) Kernel Density Estimation (KDE)


Current Practice – IICurrent Practice – II

Parameter Estimation Maximum Likelihood (ML) Bayesian (KDE and analytical methods)

e.g., see talk by Florencia Canelli (12A)

Weighting Usually 0, 1, referred to as “cutscuts” Sometimes use the R. Barlow method


Points that liebelow the cutsare “cut out”

Cuts (0, 1 weights)Cuts (0, 1 weights)

We refer to ((xx00, , yy00))as a cut-pointcut-point

S = B =

0

0

yy

xx

0y0y

x0x0yy

xx

0011


Apply cuts at each grid point

Grid Search Grid Search

x

yx x

y yi

i

Curse of dimensionality: number of cut-points ~ NbinNbinNdimNdim

S = B =

compute some measure of theireffectivenessand choose mosteffective cuts


Random Grid SearchRandom Grid SearchS

igna

l fra

ctio

n

Background fraction

0

0

1

1

n = # events in samplek = # events after cutsfraction = n/k

Take each point each point ofthe signal classsignal class as a cut-pointa cut-point x x

y yi

i

H.B.P. et al, Proceedings, CHEP 1995 x

y


Example: DExample: DØ Ø Top Discovery (1995)Top Discovery (1995)


Optimal Discrimination Optimal Discrimination

xx

yy

r(x,y) = constantconstant defines the optimaldecision boundarydecision boundary

r p x y s( , | ) p s( )p x y s( , | ) p s( )BayesBayes

DiscriminantDiscriminant


FeedForward Neural NetworksFeedForward Neural Networks

Applications Discrimination Parameter estimation Function and density estimation

Basic Idea Encode mapping (Kolmogorov, 1950s).

using a set of 1-D functions.],..,[)(: 1

1K

N FxfUUf


Example: Example: DDØØ Search for LeptoQuarksSearch for LeptoQuarks

q

g

LQ

q

q

l

LQ


IssuesIssues

Method choice Life is short and data finite; so how

should one choose a method?

Model complexity How to reduce dimensionality of data,

while minimizing loss of “information”? How many model parameters? How should one avoid over-fitting?


Issues – IIssues – III

Model robustness Is a cut on a multivariate discriminant

necessarily more sensitive to modeling errors than a cut on each of its input variables?

What is a practical, but useful, way to assess sensitivity to modeling errors and robustness with respect to assumptions?


Issues - IIIIssues - III

Accuracy of predictions How should one place “error bars” on

multivariate-based results? Is a Bayesian approach useful?

Goodness of fit How can this be done in multiple

dimensions?


SummarySummary

After ~ 80 years of effort we have many powerful methods of analysis

A few of which are now used routinely in physics analyses

The most pressing need is to understand some issues better so that when the data tsunami strikes we can respond sensibly


Minimize the empirical risk function with respect to

i

iiN xntR 21 )],([)(

FNN – Probabilistic InterpretationFNN – Probabilistic Interpretation

Solution (for large N)

dtxtpxtxn )|()(),( dtxtpxtxn )|()(),(

k

kpkxpkpkxpxkpxn )()|(/)()|()|(),( k

kpkxpkpkxpxkpxn )()|(/)()|()|(),( If t(x) = k[1I(x)], where I(x) = 1 if x is of class k, 0 otherwise

D.W. Ruck et al., IEEE Trans. Neural Networks 1(4), 296-298 (1990)E.A. Wan, IEEE Trans. Neural Networks 1(4), 303-305 (1990)


Self Organizing MapSelf Organizing Map

Basic Idea (Kohonen, 1988) Map each of K feature vectors X =

(x1,..,xN)T into one of M regions of interest defined by the vector wm so that all X mapped to a given wm are closer to it than to all remaining wm.

Basically, perform a coarse-graining of the feature space.


Support Vector MachinesSupport Vector Machines

Basic Idea Data that are non-separable in N-

dimensions have a higher chance of being separable if mapped into a space of higher dimension

Use a linear discriminant to partition the high dimensional feature space.

bxwxD )()(

HugeN :


Independent Component AnalysisIndependent Component Analysis

Basic Idea Assume X = (x1,..,xN)T is a linear sum X

= AS of independent sources S = (s1,..,sN)T. Both A, the mixing

matrix, and S are unknown. Find a de-mixing matrix T such that the

components of U = TX are statistically independent