Data Handling/Statistics There is no substitute for books— — you need professional help! My personal favorites, from which this lecture is drawn: The Cartoon

Data Handling/StatisticsThere is no substitute for books— — you need professional help!

My personal favorites, from which this lecture is drawn:

•The Cartoon Guide to Statistics, L. Gonick & W. Smith •Data Reduction in the Physical Sciences, P. R. Bevington•Workshop Statistics, A. J. Rossman & B. L. Chance•Numerical Recipes, W.H. Press, B.P. Flannery, S.A. Teukolsky and W.T.Vetterling•Origin 6.1 Users Manual, MicroCal Corporation

Outline•Our motto•What those books look like•Stuff you need to be able to look up

•Samples & Populations •Mean, Standard Deviation, Standard Error•Probability•Random Variables•Propagation of Errors

•Stuff you must be able to do on a daily basis•Plot•Fit•Interpret

Our Motto

That which can be taught can be learned.

The “progress” of civilization relies being able to do more and more things while thinking less and

less about them.

An opposing, non-CMC IGERT viewpoint

What those books look likeThe Cartoon Guide toStatistics

The Cartoon Guide toStatistics

In this example, the author provides step-by-step analysisof the statistics of a poll. Similar logic and style tell you how to tell two populationsapart, whether your measley five replicate runs truly represent the situation, etc.

The Cartoon Guide gives an enjoyable account of statistics in scientific and everyday life.

An Introduction to Error Analysis

A very readable text, but with enough math to be rigorous. The

cover says it all – the book’s emphasis is how statistics and

error analysis are important in the everyday.

Author John Taylor is known as “Mr. Wizard” at Univ. of Colorado, for his popular

science lectures aimed at youngsters.

Bevington

Bevington is really good at introducingbasic concepts, along with simple codethat really, really works. Our lab uses alot of Bevington code, often translated from Fortran to Visual Basic.

“Workshop Statistics”

This book has a website full of data that ittells you how to analyze. The test cases areoften pretty interesting, too.

Many little shadow boxes provide info.

“Numerical Recipes”

A more modern and thicker version of Bevington.Code comes in Fortran, C, Basic (others?). Includesadvanced topics like digital filtering, but harder to readon the simpler things. With this plus Bevington and a lot of time, you can fit, smooth, filter practically anything.

Stuff you need to be able to look upSamples vs. Populations

The world as we understand it, based on science.

The world as God

understands it, based

on omniscience.

Statistics is not art but artifice–a bridge to help us understand phenomena, based on limited observations.

Our problem

Sitting behind the target, canwe say with some specific level ofconfidence whether a circle drawn around this single arrow (a measurement) hits the bullseye (the population mean)?

Measuring a molecular weight by one Zimm plot, can we say with any certainty that we have obtainedthe same answer God would havegotten?

Sample

View

Population

View

Average

Variance

Standard deviation

Standard

error of mean

n

ii

n xnn

xxxxx

1

321 1...

n

ii xx

ns

1

22 )(1

1

2ss

)(xE

nx)(

2

n

sSEM

Sample View: direct, experimental, tangible

The single most important thing about this is the reductionIn standard deviation or standard error of the mean accordingTo inverse root n.

) large(for 1~

2 nn

ss

Three times better takes 9 times longer (or costs9 times more, or takes 9 times more disk space). If you remembered nothing else from this lecture, it would be a success!

Population View: conceptual, layered with arcana!

The purple equation in the table is an expression of the centrallimit theorem. If we measure many averages, we do notalways get the same average:

)."Cartoon..." (from " deviation

standard and mean with

on distributi normal a approaches

itself ) large(for then ,deviation standard

and mean with population a from

size of samples random takesone if“

variable!random a itself is

n

xn

n

x

Huh? It means…if you want to estimate , which onlyGod really knows, you should measure many averages, eachinvolving n data points, figure their standard deviation, and multiply by n1/2. This is hard work!

A lot of times, is approximated by s.

If you wanted to estimate the population average ,the best you can do is to measure many averages and averaging those.

A lot of times is approximated by x.

IT’S HARD TO KNOW WHAT GOD DOES.

I think the in the purple equation should be an s, but the equation only works in the limitof large n anyhow, so there is no difference.

You got to compromise, fool!The t-distribution was invented by a statistician named Gosset, who was forcedby his employer (the Guinness brewery!)to publish under a pseudonym. He chose “Student” and his t-distribution is known as student’s t.

The student’s t distribution helps us assign confidence inour imperfect experiments on small samples.

Input: desired confidence level, estimate of population mean (or estimated probability), estimated error of the mean (or probability).

Output: ± something

Probability…is another arcane concept in the “population” category: somethingwe would like to know but cannot. As a concept, it’s wonderful. The true mean of a distribution of mass is given as the probability of that mass times the mass. The standard deviationfollows a similarly simple rule. In what follows, F means anormalized frequency (think mole fraction!) and P is a probabilitydensity. P(x)dx represents the number of things (think molecules) with property x (think mass) between x+dx/2 and x-dx/2.

xall

xall

xFx

xxF

22

)()(

)(

Discrete system

dxxPx

dxxxP

)()(

)(

22

Continuous system

Here’s a normal probability density distribution from“Workshop…” where you use actual data to discover.

of results of results

What it means

Although you don’t usually know the distribution,(either or ) about 68% of your measurements will fall within 1 of ….if the distribution is a “normal”,bell-shaped curve. t-tests allow you to kinda play thisbackwards: given a finite sample size, with someaverage, x, and standard deviation, s—inferior to and respectively—how far away do we think the trueis

Details

No way I could do it better than “Cartoon…” or “Workshop…”

Remember…this is the part of the lecture entitled “things you must be able to look up.”

Propagation of errors Suppose you give 30 people a ruler and ask them to measurethe length and width of a room. Owing to general incompetence, otherwise known as human nature,you will get not one answer but many. Your averages will be L and W, and standard deviations sW and sL. Now, you want to buy carpet, so need area A = L·W.

What is the uncertainty in A due to the measurement errors in L and W?

Answer! There is no telling….but you have several options to estimate it.

A = L·W example

Here are your measured data:

ftW

ftL

219

130

2

22average

22min

22max

65)(560 :area reported

652

490-620 :yuncertaint estimated

5572

490620

4901729

6202031

ft

ftftA

ftftWLA

ftftWLA

You can consider “most” and “least” cases:

Another wayWe can use a formula for how propagates. Suppose some function y (think area) depends on two measured quantities t and s (think length & width). Then the variance in y follows this rule:

22

22

2sty s

y

t

y

Aren’t you glad you took partial differential equations?What??!! You didn’t? Well, sign up. PDE is the bareminimum math for scientists.

Translation in our case, where A = L·W:

2222

22

22

2

WL

WLA

LW

W

A

L

A

Problem: we don’t know W, L, L or W! These are population numbers we could only get if we had the entire planet measure this particular room. We thereforeassume that our measurement set is large enough (n=30)That we can use our measured averages for W and L andour standard deviations for L and W.

2

2

22

2

422222

)65560(

:ncalculatio most/least empirical,our to thisCompare

)63570(

or....

6330)(19

:isreport to value theSo

63

3961)2()30()1()19(

ftA

ftA

ftft

ft

ftftftftft

A

A

Error propagation caveats

The equation, 22

22

2sty s

y

t

y

, assumes

normal behavior. Large systematic errors—for example,3 euroguys who report their values in metric units—are not taken into consideration properly. In many cases, there will be good knowledge a priori about the uncertainty inone or more parameters: in photon counting, if N isthe number of photons detected, then N = (N)1/2 . Systematicerror that is not included in this estimate, so photon folk arewell advised to just repeat experiments to determine real standard deviations that do take systematic errors intoaccount.

Stuff you must know how to do on daily basis

0 2 4 6 8 100

5000

10000

15000

20000

25000

Larger Particle30.9 g/ml

ParameterValueError------------------------------------------------------------A-0.0026744.94619B2.25237E-78.46749E-10------------------------------------------------------------

RSDNP------------------------------------------------------------0.99987118.885921<0.0001------------------------------------------------------------

/H

z

q2/1010cm-2

Plot!!!

r=0.99987r2=0.9997

99.97% of the trend can be explainedby the fitted relation.

Intercept = 0.003 ± 45(i.e., zero!)

The same data

0 2 4 6 8 10 120.0

0.5

1.0

1.5

2.0

2.5

3.0 Larger Particle

30.9 g/ml

twilight users rcueto e739

ParameterValueError

------------------------------------------------------------

A2.2725E-77.62107E-10

B-3.09723E-201.43575E-20

------------------------------------------------------------

RSDNP

------------------------------------------------------------

-0.443552.01583E-9210.044

------------------------------------------------------------

Dap

p /

cm2 s

-1

q2/1010cm-2

How to find this file!

r=0.444r2=0.20

Only 20% of the data can beexplained by the line! While depended on q2, Dapp does not!

What does the famous “ r2 ” really tell us?

Suppose you invented a new

polymer that you hoped was more stable over time

than its predecessor…

So you check.

248

1216243648

110.2110.9108.8109.1109.0108.5110.0109.2

time meltingpoint

248

1216243648

110.2110.9108.8109.1109.0108.5110.0109.2

time meltingpoint

Question:

What describes the data better:

A simple average(meaning things aren’t really

changing over time: it is stable)

OR

A trend(meaning melting point might be

dropping over time)?

How well does the mean describe the data?These are called ‘residuals.’

The sum of the square of all the residuals characterizes how well

the data fit the mean.

i

meanit TTS 2)(

(= 4.6788)

How much better is a fit(i.e., a regression in this case)?

The regression also has residuals.

The sum of their squares is smaller than St.

i

ifitir TTS 2, )(

(= 4.3079)

The r2 value simply compares the fit to the mean, by comparing the

sums of the squares:

r

rt

S

SSr

2

0793.06788.4

3079.46788.42

r

In our case, the fit was NOT a dramatic improvement,

explaining only 7.9% of the variability of the data!

0 10 20 300

5

10

15

20

25

[6/7/01 13:44 "/Rhapp" (2452067)]Linear Regression for BigSilk_Ravgnm:Y = A + B * X

Parameter Value Error------------------------------------------------------------A 20.88925 0.19213B 0.01762 0.01105------------------------------------------------------------

R SD N P------------------------------------------------------------0.62332 0.28434 6 0.18611------------------------------------------------------------

Range of Rg values obsreved in MALLS

(3/5)1/2Rh

Rh/

nm

c/g-ml-1

Plot showing 95% confidence limits. Excel doesn’t excel at this!

Interpreting data: Life on the bleeding edge of cutting technology. Or is that bleating edge?

1E7

10

2E73E6

= 0.324 +/- 0.04d

f = 3.12 +/- 0.44

Rg/

nm

M

The noise level in individual runs is much less thanThe run-to-run variation. That’s why many runs area good idea. More would be good here, but we are still overcoming the shock that we can do this at all!

Correlation Caveat!Correlation Cause. No, Correlation=Association.

Chart Title y = 35.441x + 57.996

R2 = 0.5782

0102030405060708090

0.0000 0.2000 0.4000 0.6000 0.8000 1.0000

TV's per person

Lif

e E

xp

ect

an

cy

Country Life Expectancy People per TV TV's per person

Angola 44 200 0.0050Australia 76.5 2 0.5000

Cambodia 49.5 177 0.0056Canada 76.5 1.7 0.5882China 70 8 0.1250Egypt 60.5 15 0.0667France 78 2.6 0.3846Haiti 53.5 234 0.0043Iraq 67 18 0.0556

Japan 79 1.8 0.5556Madagascar 52.5 92 0.0109

Mexico 72 6.6 0.1515Morocco 64.5 21 0.0476Pakistan 56.5 73 0.0137Russia 69 3.2 0.3125

South Africa 64 11 0.0909SriLanka 71.5 28 0.0357Uganda 51 191 0.0052

United Kingdom 76 3 0.3333United States 75.5 1.3 0.7692

Vietnam 65 29 0.0345Yemen 50 38 0.0263

58% of life expectancy is associated with TV’s. Would we save lives by sending TV’s to Uganda?

Excel does not automatically provide estimates!

Linearize it!

Observant scientists are adept at seeing curvature. Trainyour eye by looking for defects in wallpaper, door trim,lumber bought at Home Depot, etc. And try to straightenout your data, rather than let the computer fit a nonlinear form,which it is quite happy to do!

Chart Title y = -0.1156x + 70.717

R2 = 0.6461

0102030405060708090

0 50 100 150 200 250

People per TV

Lif

e E

xp

ect

an

cy

Linearity isimproved byplotting Life vs.people per TVrather than TV’sper people.

These 4 plots all have theSame slopes, intercepts andr values!

Plots are pictures of science, worth

thousands of words in boring tables.

From whence do those lines come? Least squares fitting.

“Linear Fits”the fitted coefficients appear in linear part

expression. e.g..

y =a+bx+cx2+dx3

An analytical “best fit” exists!

“Nonlinear fits”At least some of the fitted coefficients

appear in transcendental arguments. e.g.,

y =a+be-cx+dcos(ex)Best fit found by trial & error.Beware false solutions! Tryseveral initial guesses!

CURVE FITTING:Fit the trend or fit the points?

Earth’s mean annual temp has natural fluctuations year to year.

To capture a long term trend, we don’t want to fit the points, so use a low-order polynomial regression.

BUT,

The bumps and jiggles in the U.S. population data are ‘real.’

We don’t want to lose them in a simple trend.

REGRESSION: We lost the baby boom!

SINGLE POLYNOMIAL: Does funny things (see 1905).

SPLINE: YES: Lots of individual polynomials give us a smooth fit (especially good for interpolation).

All data points are not created equal.

Since that one point hasso much error (or noise) shouldwe really worry about minimizingits square? No.

n

i i

fiti yy

12

22 )(

We should minimize “chisquared.”

is the # of degrees of freedom n-# of parameters fitted

Goodness of fit parameter that shouldbe unity for a “fit within error”

n

i i

fitireduced

yy

12

22 )(1

Why is a fit based on chisquared so special?

Based on chi: these two curves fit equally well!

Based on |chi| (absolute value): these three curves fit equally well!

Based on max(chi): outliers exert too strong an influence!

2 caveats

•Chi-square lower than unity is meaningless…if you trust your 2 estimates in the first place. •Fitting too many parameters will lower 2 but this may be just doing a better and better job of fitting the noise!•A fit should go smoothly THROUGH the noise, not follow it!•There is such a thing as enforcing a “parsimonious” fit by minimizing a quantity a bit more complicated than 2. This is done when you have a-priori information that the fitted line must be “smooth”.

Achtung! Warning!

This lecture is an example of a very dangerous phenomenon: “what you need to know.” Before you were born, I took a statistics course somewhere in undergraduate school. Most of this stuff I learned from experience….um… experiments. A proper math course, or a course from LSU’s Department of Experimental Statistics would firm up your knowledge greatly.

AND BUY THOSE BOOKS! YOU WILL NEED THEM!

Cool Excel/Origin Demo

Documents

Data Handling/Statistics There is no substitute for books— — you need professional help! My personal favorites, from which this lecture is drawn: The Cartoon