View
217
Download
0
Category
Preview:
Citation preview
Statistics Overview
Biologists say, “If you need to use statistics, you don’t have enough data.”
Engineers say, “If you know enough about statistics, you don’t need much data.”
• Probability distribution
• Problems with statisticians’ notation
• Hypothesis testing
• Regression analysis
• Model fitting
• Outlier rejection
• Data presentation
• Experimental designSir Ronald Aylmer Fisher
(1890-1962)
Probability Distribution Functions
If I make a measurement of a variable, how do I know how that sample relates to the mean?
y(x)
x
p[y
(x)]
y(x)
µ
Probability p that a value selected at random from a Gaussian distribution with mean μ and variance σ2 will have value x
µ is the mean of the distribution, given by
σ is the standard deviation of the distribution, given by
Probability P that random variable X will fall between a and b
Normal (Gaussian) Probability
µ= 0; σ= 1
µ= 0; σ= 2
µ= 0; σ= 3
µ= 4; σ= 1
Central Limit Theorem
s 2 is called the variancem and s 2 are first two moments of the PDF
Other Distributions
Continuous
• Normal (Gaussian), Cauchy, Chi-square, exponential, F, gamma, Laplace, log-normal, Pareto, Student’s t, uniform, Weibull, Beta
• Von Mises distribution - the independent variable varies from -π ≤ θ ≤ π (i.e. θ is an angle)
Discrete
• Bernoulli, binomial, discrete uniform, geometric, hypergeometric,negative binomial, Poisson
Oriented muscle cells
Statistics Notation
What is written
Random variable x
Probability p(x)
Problem: x is really a dependent variable
How you should read it
Random variable y(x)
Probability p[y(x)]
Probability distributions are valid for one value of the independent variable only
Example
Measure reaction rate constant k at temperatures T1 and T2
Statisticians would note this as measuring random variable k, then compute the probability p(k)
Really, your measurements measured k(T1), so the implied probabilities p[k(T1)] only apply at T=T1
Example
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
1,600,000
Bir
ths
in 2
00
3
Birth Weight (lb)
United States
Germany
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0-1
.1
1.1
-2.2
2.2
-3.3
3.3
-4.4
4.4
-5.5
5.5
-6.6
6.6
-7.7
7.7
-8.8
8.8
-9.9
9.9
-11
.0
>11N
orm
aliz
ed
Bir
ths
in 2
00
3:
p[w
(co
un
try,
ye
ar)]
Birth Weight (lb)
United States
Germany
Data from data.un.org
Hypothesis Testing
Null Hypothesis H0
Assume that two dependent variables are drawn from distributions with the same mean μ
Test this hypothesis with t-test
• t-test gives the probability that the means are “different”
• If they are “different,” then H0 is false
H0 : μ1 = μ2
µ1 = 0σ1 = 2
µ2 = 0σ2 = 1
µ2 = 1σ2 = 1
µ2 = -5σ2 = 1
Testing the Hypothesis
(Student’s) T-testDetermine whether two sets of data come from “different” distributions
• “Different” = “there exists a statistically significant difference between the two”
• Statistical significance based on p value
– significantly different if p < α
– Usually, α = 0.05
• ttest() in Excel
• ttest() or ttest2() in MATLAB
ANalysis Of VAriance (ANOVA)
t-test for the case of more than one independent variable (e.g. y(x,t))
Result is again a p value telling you whether the independent variable makes a statistically significant difference in the dependent variables
Available in the Data Analysis Toolpack in Excel
T-Tests
http://www.socialresearchmethods.net/kb/stat_t.php
ANalysis Of VAriance (ANOVA)
See course manual p. 10-11 for a full description
Sum of squares Mean squares
Essentially a t-test for more than two samples
Total
Error/Residual
Treatment
Regression AnalysisLinear Regression
Fit a line to data containing noise using the least squares method
• Minimize the sum of squared residuals
• Model with one independent variable
• Model with p-1 independent variables
• Goodness of fit
– Fraction of variance in data which is explained by model
Nonlinear Regression
Fit an arbitrary function to data containing noise, again using least squares method
R2 isn’t necessarily a good measure of goodness of fit
• L2 norm (Euclidean distance)
• Relative error in L2 norm
Regression AnalysisQualitative Verification
All of these methods assume that the error ε is normally distributed
•Check by looking at plot of residuals
•Residuals should be randomly distributed around axis r = 0
Nonlinear Regression
Fit an arbitrary function to data containing noise, again using least squares method
R2 isn’t necessarily a good measure of goodness of fit
• L2 norm (Euclidean distance)
• Relative error in L2 norm
Model FittingIn Excel
Add trendline – Excel does everything for you
• Only works if you want to use an available function
Goal seek
• Only works for unconstrained, one parameter models
Solver
• Can use for constrained, multiple parameter models
• Uses Quasi-Newton or conjugate gradient method
In MATLAB
Built-in functions
• Newton-Raphson method (fzero)
• Nelder-Mead simplex (fminsearch)
Optimization toolboxes
• Levenberg-Marquadt/Quasi-Newton (fminunc or fmincon)
• Simulated annealing
• Genetic algorithm (GA)
Curve fitting toolbox
Custom algorithm
All methods work by minimizing some error
•
Model Fitting in Excel
Model Fitting in MATLAB
Outlier RejectionWhat is an outlier?
An outlier is a data point which disagrees with the other data and cannot be reproduced
Caused by measurement error, incorrect value of independent variable (i.e. user error), noise, chance, or lack of control or understanding of the process
Example:
y(x)=[1.2, 1.3, 5.0, 1.1, 1.2]T
μ = 1.96; σ = 1.70
When is a point an outlier?
Dixon’s Q Test
• Very simple – just look up a value in a table to see if it’s an outlier
Chauvenet’s Criterion
• Simple, less rigorous
• If p(xi)<1/(2n), throw it out
Grubb’s Test; Peirce’s Criterion
• Both utilize more rigorous methods
• See paper
Without outlier: μ = 1.20; σ = 0.08
What makes a good figure?Clearly relates independent and dependent variables using axes and trend
lines
• Units!!
• Proper scaling– Use log scales if variable(s) vary over
orders of magnitude
Symbols and text are large and different
Resolution is sufficiently high
Error bars (if applicable)
Efficient use of space
Utilizes significant figures appropriately
Compares data with applicable model predictions
Contains enough information to get the point(s) across, but not so much that the message is lost or confused
Captioned such that it is understood without reading the text
Reilly et al., Experimental Eye Research, 2008.
Presentation of Data
Which of these figures is better?
ambiguous
Was
ted
sp
ace
Significant figures
Fuzzy text
Error bars
Goodness-of-Fit
LegendFrom a journal article which was rejected.
Units
Presentation of Data
Reilly et al., Biomacromolecules, 2008.
Tiffany and Koretz, International Journal of Biological Molecules, 2002.
Statistical Experimental Design
Design an experiment using statistical methods to minimize the number of data points required to get the desired information.
Analyze an experiment using statistical methods to maximize the information yield from any set of experiments
Recommended