Upload
vanessa-fox
View
217
Download
1
Tags:
Embed Size (px)
Citation preview
Data Handling/StatisticsThere is no substitute for books— — you need professional help!
My personal favorites, from which this lecture is drawn:
•The Cartoon Guide to Statistics, L. Gonick & W. Smith •Data Reduction in the Physical Sciences, P. R. Bevington•Workshop Statistics, A. J. Rossman & B. L. Chance•Numerical Recipes, W.H. Press, B.P. Flannery, S.A. Teukolsky and W.T.Vetterling•Origin 6.1 Users Manual, MicroCal Corporation
Outline•Our motto•What those books look like•Stuff you need to be able to look up
•Samples & Populations •Mean, Standard Deviation, Standard Error•Probability•Random Variables•Propagation of Errors
•Stuff you must be able to do on a daily basis•Plot•Fit•Interpret
Our Motto
That which can be taught can be learned.
The “progress” of civilization relies being able to do more and more things while thinking less and
less about them.
An opposing, non-CMC IGERT viewpoint
What those books look likeThe Cartoon Guide toStatistics
The Cartoon Guide toStatistics
In this example, the author provides step-by-step analysisof the statistics of a poll. Similar logic and style tell you how to tell two populationsapart, whether your measley five replicate runs truly represent the situation, etc.
The Cartoon Guide gives an enjoyable account of statistics in scientific and everyday life.
An Introduction to Error Analysis
A very readable text, but with enough math to be rigorous. The
cover says it all – the book’s emphasis is how statistics and
error analysis are important in the everyday.
Author John Taylor is known as “Mr. Wizard” at Univ. of Colorado, for his popular
science lectures aimed at youngsters.
Bevington
Bevington is really good at introducingbasic concepts, along with simple codethat really, really works. Our lab uses alot of Bevington code, often translated from Fortran to Visual Basic.
“Workshop Statistics”
This book has a website full of data that ittells you how to analyze. The test cases areoften pretty interesting, too.
Many little shadow boxes provide info.
“Numerical Recipes”
A more modern and thicker version of Bevington.Code comes in Fortran, C, Basic (others?). Includesadvanced topics like digital filtering, but harder to readon the simpler things. With this plus Bevington and a lot of time, you can fit, smooth, filter practically anything.
Stuff you need to be able to look upSamples vs. Populations
The world as we understand it, based on science.
The world as God
understands it, based
on omniscience.
Statistics is not art but artifice–a bridge to help us understand phenomena, based on limited observations.
Our problem
Sitting behind the target, canwe say with some specific level ofconfidence whether a circle drawn around this single arrow (a measurement) hits the bullseye (the population mean)?
Measuring a molecular weight by one Zimm plot, can we say with any certainty that we have obtainedthe same answer God would havegotten?
Sample
View
Population
View
Average
Variance
Standard deviation
Standard
error of mean
n
ii
n xnn
xxxxx
1
321 1...
n
ii xx
ns
1
22 )(1
1
2ss
)(xE
nx)(
2
n
sSEM
Sample View: direct, experimental, tangible
The single most important thing about this is the reductionIn standard deviation or standard error of the mean accordingTo inverse root n.
) large(for 1~
2 nn
ss
Three times better takes 9 times longer (or costs9 times more, or takes 9 times more disk space). If you remembered nothing else from this lecture, it would be a success!
Population View: conceptual, layered with arcana!
The purple equation in the table is an expression of the centrallimit theorem. If we measure many averages, we do notalways get the same average:
)."Cartoon..." (from " deviation
standard and mean with
on distributi normal a approaches
itself ) large(for then ,deviation standard
and mean with population a from
size of samples random takesone if“
variable!random a itself is
n
xn
n
x
Huh? It means…if you want to estimate , which onlyGod really knows, you should measure many averages, eachinvolving n data points, figure their standard deviation, and multiply by n1/2. This is hard work!
A lot of times, is approximated by s.
If you wanted to estimate the population average ,the best you can do is to measure many averages and averaging those.
A lot of times is approximated by x.
IT’S HARD TO KNOW WHAT GOD DOES.
I think the in the purple equation should be an s, but the equation only works in the limitof large n anyhow, so there is no difference.
You got to compromise, fool!The t-distribution was invented by a statistician named Gosset, who was forcedby his employer (the Guinness brewery!)to publish under a pseudonym. He chose “Student” and his t-distribution is known as student’s t.
The student’s t distribution helps us assign confidence inour imperfect experiments on small samples.
Input: desired confidence level, estimate of population mean (or estimated probability), estimated error of the mean (or probability).
Output: ± something
Probability…is another arcane concept in the “population” category: somethingwe would like to know but cannot. As a concept, it’s wonderful. The true mean of a distribution of mass is given as the probability of that mass times the mass. The standard deviationfollows a similarly simple rule. In what follows, F means anormalized frequency (think mole fraction!) and P is a probabilitydensity. P(x)dx represents the number of things (think molecules) with property x (think mass) between x+dx/2 and x-dx/2.
xall
xall
xFx
xxF
22
)()(
)(
Discrete system
dxxPx
dxxxP
)()(
)(
22
Continuous system
Here’s a normal probability density distribution from“Workshop…” where you use actual data to discover.
of results of results
What it means
Although you don’t usually know the distribution,(either or ) about 68% of your measurements will fall within 1 of ….if the distribution is a “normal”,bell-shaped curve. t-tests allow you to kinda play thisbackwards: given a finite sample size, with someaverage, x, and standard deviation, s—inferior to and respectively—how far away do we think the trueis
Details
No way I could do it better than “Cartoon…” or “Workshop…”
Remember…this is the part of the lecture entitled “things you must be able to look up.”
Propagation of errors Suppose you give 30 people a ruler and ask them to measurethe length and width of a room. Owing to general incompetence, otherwise known as human nature,you will get not one answer but many. Your averages will be L and W, and standard deviations sW and sL. Now, you want to buy carpet, so need area A = L·W.
What is the uncertainty in A due to the measurement errors in L and W?
Answer! There is no telling….but you have several options to estimate it.
A = L·W example
Here are your measured data:
ftW
ftL
219
130
2
22average
22min
22max
65)(560 :area reported
652
490-620 :yuncertaint estimated
5572
490620
4901729
6202031
ft
ftftA
ftftWLA
ftftWLA
You can consider “most” and “least” cases:
Another wayWe can use a formula for how propagates. Suppose some function y (think area) depends on two measured quantities t and s (think length & width). Then the variance in y follows this rule:
22
22
2sty s
y
t
y
Aren’t you glad you took partial differential equations?What??!! You didn’t? Well, sign up. PDE is the bareminimum math for scientists.
Translation in our case, where A = L·W:
2222
22
22
2
WL
WLA
LW
W
A
L
A
Problem: we don’t know W, L, L or W! These are population numbers we could only get if we had the entire planet measure this particular room. We thereforeassume that our measurement set is large enough (n=30)That we can use our measured averages for W and L andour standard deviations for L and W.
2
2
22
2
422222
)65560(
:ncalculatio most/least empirical,our to thisCompare
)63570(
or....
6330)(19
:isreport to value theSo
63
3961)2()30()1()19(
ftA
ftA
ftft
ft
ftftftftft
A
A
Error propagation caveats
The equation, 22
22
2sty s
y
t
y
, assumes
normal behavior. Large systematic errors—for example,3 euroguys who report their values in metric units—are not taken into consideration properly. In many cases, there will be good knowledge a priori about the uncertainty inone or more parameters: in photon counting, if N isthe number of photons detected, then N = (N)1/2 . Systematicerror that is not included in this estimate, so photon folk arewell advised to just repeat experiments to determine real standard deviations that do take systematic errors intoaccount.
Stuff you must know how to do on daily basis
0 2 4 6 8 100
5000
10000
15000
20000
25000
Larger Particle30.9 g/ml
ParameterValueError------------------------------------------------------------A-0.0026744.94619B2.25237E-78.46749E-10------------------------------------------------------------
RSDNP------------------------------------------------------------0.99987118.885921<0.0001------------------------------------------------------------
/H
z
q2/1010cm-2
Plot!!!
r=0.99987r2=0.9997
99.97% of the trend can be explainedby the fitted relation.
Intercept = 0.003 ± 45(i.e., zero!)
The same data
0 2 4 6 8 10 120.0
0.5
1.0
1.5
2.0
2.5
3.0 Larger Particle
30.9 g/ml
twilight users rcueto e739
ParameterValueError
------------------------------------------------------------
A2.2725E-77.62107E-10
B-3.09723E-201.43575E-20
------------------------------------------------------------
RSDNP
------------------------------------------------------------
-0.443552.01583E-9210.044
------------------------------------------------------------
Dap
p /
cm2 s
-1
q2/1010cm-2
How to find this file!
r=0.444r2=0.20
Only 20% of the data can beexplained by the line! While depended on q2, Dapp does not!
What does the famous “ r2 ” really tell us?
Suppose you invented a new
polymer that you hoped was more stable over time
than its predecessor…
So you check.
248
1216243648
110.2110.9108.8109.1109.0108.5110.0109.2
time meltingpoint
248
1216243648
110.2110.9108.8109.1109.0108.5110.0109.2
time meltingpoint
Question:
What describes the data better:
A simple average(meaning things aren’t really
changing over time: it is stable)
OR
A trend(meaning melting point might be
dropping over time)?
How well does the mean describe the data?These are called ‘residuals.’
The sum of the square of all the residuals characterizes how well
the data fit the mean.
i
meanit TTS 2)(
(= 4.6788)
How much better is a fit(i.e., a regression in this case)?
The regression also has residuals.
The sum of their squares is smaller than St.
i
ifitir TTS 2, )(
(= 4.3079)
The r2 value simply compares the fit to the mean, by comparing the
sums of the squares:
r
rt
S
SSr
2
0793.06788.4
3079.46788.42
r
In our case, the fit was NOT a dramatic improvement,
explaining only 7.9% of the variability of the data!
0 10 20 300
5
10
15
20
25
[6/7/01 13:44 "/Rhapp" (2452067)]Linear Regression for BigSilk_Ravgnm:Y = A + B * X
Parameter Value Error------------------------------------------------------------A 20.88925 0.19213B 0.01762 0.01105------------------------------------------------------------
R SD N P------------------------------------------------------------0.62332 0.28434 6 0.18611------------------------------------------------------------
Range of Rg values obsreved in MALLS
(3/5)1/2Rh
Rh/
nm
c/g-ml-1
Plot showing 95% confidence limits. Excel doesn’t excel at this!
Interpreting data: Life on the bleeding edge of cutting technology. Or is that bleating edge?
1E7
10
2E73E6
= 0.324 +/- 0.04d
f = 3.12 +/- 0.44
Rg/
nm
M
The noise level in individual runs is much less thanThe run-to-run variation. That’s why many runs area good idea. More would be good here, but we are still overcoming the shock that we can do this at all!
Correlation Caveat!Correlation Cause. No, Correlation=Association.
Chart Title y = 35.441x + 57.996
R2 = 0.5782
0102030405060708090
0.0000 0.2000 0.4000 0.6000 0.8000 1.0000
TV's per person
Lif
e E
xp
ect
an
cy
Country Life Expectancy People per TV TV's per person
Angola 44 200 0.0050Australia 76.5 2 0.5000
Cambodia 49.5 177 0.0056Canada 76.5 1.7 0.5882China 70 8 0.1250Egypt 60.5 15 0.0667France 78 2.6 0.3846Haiti 53.5 234 0.0043Iraq 67 18 0.0556
Japan 79 1.8 0.5556Madagascar 52.5 92 0.0109
Mexico 72 6.6 0.1515Morocco 64.5 21 0.0476Pakistan 56.5 73 0.0137Russia 69 3.2 0.3125
South Africa 64 11 0.0909SriLanka 71.5 28 0.0357Uganda 51 191 0.0052
United Kingdom 76 3 0.3333United States 75.5 1.3 0.7692
Vietnam 65 29 0.0345Yemen 50 38 0.0263
58% of life expectancy is associated with TV’s. Would we save lives by sending TV’s to Uganda?
Excel does not automatically provide estimates!
Linearize it!
Observant scientists are adept at seeing curvature. Trainyour eye by looking for defects in wallpaper, door trim,lumber bought at Home Depot, etc. And try to straightenout your data, rather than let the computer fit a nonlinear form,which it is quite happy to do!
Chart Title y = -0.1156x + 70.717
R2 = 0.6461
0102030405060708090
0 50 100 150 200 250
People per TV
Lif
e E
xp
ect
an
cy
Linearity isimproved byplotting Life vs.people per TVrather than TV’sper people.
These 4 plots all have theSame slopes, intercepts andr values!
Plots are pictures of science, worth
thousands of words in boring tables.
From whence do those lines come? Least squares fitting.
“Linear Fits”the fitted coefficients appear in linear part
expression. e.g..
y =a+bx+cx2+dx3
An analytical “best fit” exists!
“Nonlinear fits”At least some of the fitted coefficients
appear in transcendental arguments. e.g.,
y =a+be-cx+dcos(ex)Best fit found by trial & error.Beware false solutions! Tryseveral initial guesses!
CURVE FITTING:Fit the trend or fit the points?
Earth’s mean annual temp has natural fluctuations year to year.
To capture a long term trend, we don’t want to fit the points, so use a low-order polynomial regression.
BUT,
The bumps and jiggles in the U.S. population data are ‘real.’
We don’t want to lose them in a simple trend.
REGRESSION: We lost the baby boom!
SINGLE POLYNOMIAL: Does funny things (see 1905).
SPLINE: YES: Lots of individual polynomials give us a smooth fit (especially good for interpolation).
All data points are not created equal.
Since that one point hasso much error (or noise) shouldwe really worry about minimizingits square? No.
n
i i
fiti yy
12
22 )(
We should minimize “chisquared.”
is the # of degrees of freedom n-# of parameters fitted
Goodness of fit parameter that shouldbe unity for a “fit within error”
n
i i
fitireduced
yy
12
22 )(1
Why is a fit based on chisquared so special?
Based on chi: these two curves fit equally well!
Based on |chi| (absolute value): these three curves fit equally well!
Based on max(chi): outliers exert too strong an influence!
2 caveats
•Chi-square lower than unity is meaningless…if you trust your 2 estimates in the first place. •Fitting too many parameters will lower 2 but this may be just doing a better and better job of fitting the noise!•A fit should go smoothly THROUGH the noise, not follow it!•There is such a thing as enforcing a “parsimonious” fit by minimizing a quantity a bit more complicated than 2. This is done when you have a-priori information that the fitted line must be “smooth”.
Achtung! Warning!
This lecture is an example of a very dangerous phenomenon: “what you need to know.” Before you were born, I took a statistics course somewhere in undergraduate school. Most of this stuff I learned from experience….um… experiments. A proper math course, or a course from LSU’s Department of Experimental Statistics would firm up your knowledge greatly.
AND BUY THOSE BOOKS! YOU WILL NEED THEM!
Cool Excel/Origin Demo