Neural networks in data analysis - KIT - KCETA · ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis – p. 18/38. How to treat issues All issues mentioned

Neural networks in data analysis

Michal Kreps

ISAPP Summer Institute 2009

ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis – p. 1/38

OutlineWhat are the neural networks

Basic principles of NN

For what are they useful

Training of NN

Available packages

Comparison to other methods

Some examples of NN usage

I’m not expert on neural networks, only advanced user whounderstands principles.Goal: Help you to understand principles so that:

→ neural networks don’t look as black box to you

→ you have starting point to use neural networks by yourself

1

2

3

4

5

6

7

8

input layer hidden layer

1

output layer


Tasks to solveDifferent tasks in nature

Some easy to solvealgorithmically, some difficult orimpossible

Reading hand written textLot of personal specificsRecognition often needs towork only with partialinformationPractically impossible inalgorithmic wayBut our brains can solve thistasks extremely fast

Use brain as inspiration


Tasks to solve

Why (astro-)particle physicist would care about recognitionof hand written text?

Because it is classification task on not very regular dataDo we see now letter A or H?

→ Classification tasks are very often in our analysisIs this event signal or background?

→ Neural networks have some advantages in this task:Can learn from known examples (data or simulation)Can evolve to complex non-linear modelsEasily can deal with complicated correlations amongsinputsGoing to more inputs does not cost additional examples


Down to principles

Zoom in

Detailed look shows hugenumber of neurons

Neurons are interconnected,each having many inputsfrom other neurons

To build artificial neu-ral network:

Build model ofneuron

Connect manyneurons together


Model of neuron

Get inputs

Combine input to single quantity

Give output


Model of neuron

Get inputs


Give output


Model of neuron

Get inputs


Give output

S =∑

i wixi


Model of neuron

Get inputs


Give output

S =∑

i wixi

Pass S throughactivation func-tionMost common:A(x) = 2

1+e−x − 1


Building neural network

Complicated structureISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis – p. 8/38

Building neural network

Even more complicated structureISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis – p. 8/38

Building neural network1

2

3

4

5

6

7

8


1

output layer

Or simple structures


Output of Feed-forward NN1

2

3

4

5

6

7

8


1

output layer

Output of node k in second layer:ok = A

(∑

i w1→2ik xi

)

Output of node j in output layer:

oj = A(

∑

k w2→3kj xk

)

oj = A[

∑

k w2→3kj A

(∑

i w1→2ik xi

)

]

More layers usually not needed

If more layers involved, outputfollows same logic

In following, stick to only threelayers as here

Knowledge is stored in weights


Tasks we can solveClassification

Binary decision whether event belongs to one of the twoclassesIs this particle kaon or not?Is this event candidate for Higgs?Will Germany become soccer world champion in 2010?

→ Output layer contains single node

Conditional probability densityGet probability density for each eventWhat is decay time of a decayWhat is energy of particle given the observed showerWhat is probability that customer will cause car accident

→ Output layer contains many nodes


NN work flow


Neural network trainingAfter assembling neural networks, weights w1→2

ik and w2→3kj

unknownInitialised to default valueRandom value

Before using neural network for planned task, need to findbest set of weights

Need some measure, which tells which set of weightsperforms best ⇒ loss function

Training is process, where based on known examples ofevents we search for set of weights which minimise lossfunction

Most important part in using neural networksBad training can result in bad or even wrong solutionsMany traps in the process, so need to be careful


Loss functionPractically any function which allows you to decide, whichneural network is better

→ In particle physics it can be one which results in bestmeasurement

In practical applications, two commonly used functions (Tji

is true value with −1 for one class and 1 for other, oji isactual neural network output)

Quadratic loss functionχ2 =

∑

j wjχ2j =

∑

j12

∑

i (Tji − oji)2

→ χ2 = 0 for correct decisions and χ2 = 2 for wrongEntropy loss functionED =

∑

j wjEjD =

∑

j

∑

i − log(12(1 + Tjioji))

→ ED = 0 for correct decisions and ED = ∞ for wrong


Conditional probability densityClassification is easy case, we need single output node andif output is above chosen value, it is class 1 (signal)otherwise class 2 (background)

Now question is how to obtain probability density?

Use many nodes in output layer, each answering question,whether probability is larger than some value

As example, lets have n = 10 nodes and true value of 0.63Vector of true values for output nodes:(1, 1, 1, 1, 1, 1,−1,−1,−1,−1)Node j answers question whether probability is largerthan j/10

The output vector that represents integrated probabilitydensity

After differentiation we obtain desired probability densityISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis – p. 14/38

Interpretation of NN outputIn classification tasks in general higher NN output meansmore signal like event

→ NN output rescaled to interval [0; 1] can be interpreted asprobability

Quadratic/entropy loss function is usedNN is well trained ⇔ loss function minimised

→ Purity P(o) = Nselected signal(o)/Nselected(o) should be

linear function of NN output

Looking to expression for output

oj = A[

∑

k w2→3kj A

(∑

i w1→2ik xi

)

]

→ NN is function, which maps n-dimensional space to 1-dimensionalspace

Network output-1.0 -0.5 0.0 0.5 1.0

purit

y0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0


How to check quality of trainingDefine quantities:

Signal efficiency:

ǫs = N(selected signal)N(all signal)

Efficiency:

ǫ = N(selected)N(all)

Purity:

P = N(selected signal)N(selected)



Signal efficiency:


Efficiency:


Purity:


Network output-1.0 -0.5 0.0 0.5 1.00

20

40

60

80

100

120

140

160

310×

Distribution of NN outputfor two classes

How good is separation?



Signal efficiency:


Efficiency:


Purity:


Network output-1.0 -0.5 0.0 0.5 1.0

purit

y0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Are we at minimum of theloss function?



Signal efficiency:


Efficiency:


Purity:


Signal efficiency0.0 0.2 0.4 0.6 0.8 1.0

Sig

nal p

urity

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Each point has require-ment on out > cut

Each point has require-ment on out < cut

Training S/B

Best point is (1,1)



Signal efficiency:


Efficiency:


Purity:


Efficiency0.0 0.2 0.4 0.6 0.8 1.0

Sig

nal e

ffici

ency

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Each point has require-ment on out > cut

Best achievable

Random decision


Issues in trainingMain issue: Finding minimum in high dimensional space5 input nodes with 5 nodes in hidden layer and one output nodegives 5x5 + 5 = 30 weightsImagine task to find deepest valley in the Alps (only 2 dimen-sions)

Being put to some randomplace, easy to find someminimum


Issues in trainingMain issue: Finding minimum in high dimensional space5 input nodes with 5 nodes in hidden layer and one output nodegives 5x5 + 5 = 30 weightsImagine task to find deepest valley in the Alps (only 2 dimen-sions)

Being put to some randomplace, easy to find someminimum

But what is in next valley?


Issues in trainingIn training, we start with:

Large number of weights which are far from optimal valueInput values in unknown ranges

→ Large chance for numerical issues in calculation of lossfunction

Statistical issues coming from low statistics in trainingIn many places in physics not that much issue as trainingstatistics can be largeIn industry can be quite big issue to obtain large enoughtraining datasetBut also in physics it can be sometimes very costly toobtain large enough training sample

Learning by heart, when NN has enough weights to do that


How to treat issuesAll issues mentioned can be treated by neural networkpackage in principle

Regularisation for numerical issues at the beginning oftraining

Weight decay - forgettingHelps in overtraining on statistical fluctuationsIdea is that effects which are not significant are notshown often enough to keep weight to remember them

Pruning of connections - remove connections, which don’tcarry significant information

Preprocessingnumerical issuessearch for good minimum by good starting point anddecorrelation


Neurobayes preprocessing

0 0.2 0.4 0.6 0.8 1

even

ts

0

1000

2000

3000

4000

5000

6000input node 5 flat

1 1.0224 1.0449 1.0667 1.0887 1.1107 1.133 1.1551 1.1779 1.2021 1.2272 1.2531 1.2805 1.3099 1.3409 1.3739 1.4096 1.4482 1.4886 1.5328 1.5801 1.6316 1.6865 1.7462 1.8114 1.882 1.9542 2.0361 2.1242 2.2173 2.3178 2.4212 2.5321 2.649 2.7733 2.905 3.045 3.1934 3.3514 3.52 3.706 3.9122 4.1511 4.4197 4.7391 5.1186 5.5871 6.2022 7.121 8.7501 29.709

bin n.10 20 30 40 50 60 70 80 90 100

purit

y

0

0.2

0.4

0.6

0.8

1

1.2 spline fit

−3 −2 −1 0 1 2 3

even

ts

0

10000

20000

30000

40000

50000

final

Bin to bins ofequal statistics

Calculate purityfor each bin anddo spline fit

Use purity in-stead of originalvalue, plus trans-form to µ = 0and RMS = 1


Electron ID at CDF

Comparing two different NN packages with exactly same inputsPreprocessing is important and can improve result significantly


Available packagesNot exhaustive list

NeuroBayeshttp://www.phi-t.deBest I knowCommercial, but fee is small for scientific useDeveloped originally in high energy physics experiment

ROOT/TMVAPart of the ROOT packageCan be used for playing, but I would advise to usesomething else for serious applications

SNNS: Stuttgart Neural Network Simulatorhttp://www.ra.cs.uni-tuebingen.de/SNNS/Written in C, but ROOT interface existBest open source package I know about


http://www.phi-t.de

http://www.ra.cs.uni-tuebingen.de/SNNS/

NN vs. other multivariate techniquesSeveral times I saw statements of type, my favourite methodis better than other multivariate techniques

Usually lot of time is spend to understand and properlytrain favourite methodOther methods are tried just for quick comparison withoutunderstanding or tuningOften you will even not learn any details of how othermethods were setup

With TMVA it is now easy to compare different methods -often I saw that Fischer discriminant won in those

From some tests we found that NN implemented in TMVAis relativily badNo wonder that it is usually worst than other methods

Usually it is useful to understand method you are using, noneof them is black box


Practical tips

No need for several hidden layers, one should be sufficientfor you

Number of nodes in hidden layer should not be very large

→ Risk of learning by heart

Number of nodes in hidden layer should not be very small

→ Not enough capacity to learn your problem

→ Usually O(number of inputs) in hidden layer

Understand your problem, NN gets numbers and can dig in-formation out of them, but you know meaning of those num-bers


Practical tips

Don’t throw away knowledge you acquired about problembefore starting using NN

→ NN can do lot, but you don’t need that it discovers what youalready know

If you already know that given event is background, throwit awayIf you know about good inputs, use them rather than rawdata from which they are calculated

Use neural network to learn what you don’t know yet


Physics examplesOnly mentioning work done by Karlsruhe HEP group

"Stable" particle identification at Delphi, CDF and now Belle

Properties (E , φ, θ, Q-value) of inclusively reconstructed Bmesons

Decay time in semileptonic Bs decays

Heavy quark resonances studies: X (3872), B∗∗, B∗∗

s , Λ∗

c, Σc

B flavour tagging and Bs mixing (Delphi, CDF, Belle)

CPV in Bs → J/ψφ decays (CDF)

B tagging for high pT physics (CDF, CMS)

Single top and Higgs search (CDF)

B hadron full reconstruction (Belle)

Optimisation of KEKB acceleratorISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis – p. 26/38

B∗∗s

at CDF

Neural Network Output-1.0 -0.5 0.0 0.5 1.0

Can

dida

tes

0

50

100

150

200

250

300

350

400

CDF Run 2 -11.0 fb-] K+ Kψ [J/→ -K+ B→**

sB

SignalBackground

]2) [GeV/c-)-M(K+)-M(B-K+M(B0.00 0.05 0.10 0.15 0.20

2C

andi

date

s pe

r 1.

25 M

eV/c

0

5

10

15

20

25

30

35

40 -K+B+K+B

Signal

Background

CDF Run 2 -11.0 fb

Bs1

B∗

s2

Background from dataSignal from MC

First observation of Bs1Most precise masses


X(3872) mass at CDF

Network Output-1.0 -0.5 0.0 0.5 1.0

Can

dida

tes

0

20

40

60

80

100

310×

Signal

Background

CDF II Preliminary -12.4 fb

)2 Mass (GeV/cππψJ/3.4 3.6 3.8 4.0

2C

andi

date

s pe

r 2

MeV

/c

0

10000

20000

30000

40000

50000

60000 Total

Rejected

Selected

CDF II Preliminary -12.4 fb

X (3872) first of charmonium likestatesLot of effort to understand itsnature

Powerful selection at CDF oflargest sample leading to bestmass measurementISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis – p. 28/38

Single top search

KIT Flavor Sep. Output-1 -0.5 0 0.5 1

Eve

nt

Fra

ctio

n

0

0.1

0.2

single toptt

c+WcbWbWc

qWqDibosonZ+jetsQCD

TLC 2Jets 1Tag-1

CDF II Preliminary 3.2 fb

no

rmal

ized

to

un

it a

rea

KIT Flavor Sep. Output-1 -0.5 0 0.5 1

Eve

nt

Fra

ctio

n

0

0.1

0.2

NN Output-1 -0.5 0 0.5 1

Can

did

ate

Eve

nts

0

100

200

300

NN Output-1 -0.5 0 0.5 1

Can

did

ate

Eve

nts

0

100

200

300

-1CDF II Preliminary 3.2 fb

MC

no

rmal

ized

to

SM

pre

dic

tio

n

All Channels

0.6 0.7 0.8 0.9 10

1020

3040

506070

80

0.6 0.7 0.8 0.9 10

1020

3040

506070

80single toptt

c+WcbWbWc

qWqDibosonZ+jetsQCDdata

Complicated analysis withmany different multivariatetechniques

Tagging whether jet is b-quarkjet or not using NN

Final data plot

Part of big program leadingto discovery of single topproductionISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis – p. 29/38

B flavour tagging - CDF

Flavour tagging at hadroncolliders is difficult task

Lot of potential forimprovements

In plot, concentrate oncontinuous lines

Example of not that goodseparation


Training with weights

]2

) [GeV/ccΛ)-Mass(π cΛMass(0.15 0.155 0.16 0.165 0.17 0.175 0.18 0.185 0.19

Can

did

ates

per

0.4

MeV

0

1000

2000

3000

4000

5000

6000

7000

23300≈#Signal Events 472200≈#Background Events

-1CDF Run II preliminary, L = 2.4 fb

One knows statistically whether event is signal or background

Or has sample of mixture of signal and background and puresample for one class

Can use data only as example of applicationISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis – p. 31/38

Training with weights

]2

) [GeV/ccΛ)-Mass(π cΛMass(0.15 0.155 0.16 0.165 0.17 0.175 0.18 0.185 0.19

Can

did

ates

per

0.4

MeV

0

1000

2000

3000

4000

5000

6000

7000

23300≈#Signal Events 472200≈#Background Events


0.15 0.20 0.25C

andi

date

s pe

r 1.

30 M

eV0

500

1000

1500

2000

2500

3000

DataFit Function

+π +cΛ → ++(2455)cΣ

+π +cΛ → ++(2520)cΣ


]2

) [GeV/c+cΛ)-Mass(+π +

cΛMass(0.15 0.20 0.25

data

(dat

a -

fit)

-2

0

2→ Use only data in selec-tion→ Can compare data to

simulation→ Can reweight simula-

tion to match dataISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis – p. 32/38

Examples outside physics

Medicine and Pharma researche.g. effects and undesirable effects of drugs early tumorrecognition

Bankse.g. credit-scoring (Basel II), finance time series prediction,valuation of derivates, risk minimised trading strategies,client valuation

Insurancese.g. risk and cost prediction for individual clients, probabilityof contract cancellation, fraud recognition, justice in tariffs

Trading chain stores:turnover prognosis for individual articles/stores


Data Mining CupWorld‘s largeststudents competition:Data-Mining-Cup

Very successful incompetition with otherdata-mining methods

2005: Fraud detectionin internet trading2006: price predictionin ebay auctions2007: coupon redemp-tion prediction2008: lottery customerbehaviour predictionISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis – p. 34/38

NeuroBayes Turnover prognosis


Individual health costsPilot project for a large private health insurance

Prognosis of costs in fol-lowing year for each personinsured with confidenceintervals

4 years of training, teston following year

Results: Probability den-sity for each customer/tariffcombination

Very good test results!

Potential for an objective cost reduction in health managementISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis – p. 36/38

Prognosis of sport events

Results: Probabilities for home - tie - guest


Conclusions

Neural network is powerful tool for your analysis

They are not genius, but can do marvellous things for genius

Lecture of this kind cannot really make you expert

Hope that I was successful with:Give you enough insight so that neural network is notblack box anymoreTeaching you enough not to be afraid to try

Feel free to talk to me, I’m at Campus Süd, building 30.23,room 9-11


Documents

Neural networks in data analysis - KIT - KCETA · ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis – p. 18/38. How to treat issues All issues mentioned