Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say,...

Statistics Overview

Biologists say, “If you need to use statistics, you don’t have enough data.”

Engineers say, “If you know enough about statistics, you don’t need much data.”

• Probability distribution

• Problems with statisticians’ notation

• Hypothesis testing

• Regression analysis

• Model fitting

• Outlier rejection

• Data presentation

• Experimental designSir Ronald Aylmer Fisher

(1890-1962)

Probability Distribution Functions

If I make a measurement of a variable, how do I know how that sample relates to the mean?

Probability p that a value selected at random from a Gaussian distribution with mean μ and variance σ2 will have value x

µ is the mean of the distribution, given by

σ is the standard deviation of the distribution, given by

Probability P that random variable X will fall between a and b

Normal (Gaussian) Probability

µ= 0; σ= 1

µ= 0; σ= 2

µ= 0; σ= 3

µ= 4; σ= 1

Central Limit Theorem

s 2 is called the variancem and s 2 are first two moments of the PDF

Other Distributions

Continuous

• Normal (Gaussian), Cauchy, Chi-square, exponential, F, gamma, Laplace, log-normal, Pareto, Student’s t, uniform, Weibull, Beta

• Von Mises distribution - the independent variable varies from -π ≤ θ ≤ π (i.e. θ is an angle)

Discrete

• Bernoulli, binomial, discrete uniform, geometric, hypergeometric,negative binomial, Poisson

Oriented muscle cells

Statistics Notation

What is written

Random variable x

Probability p(x)

Problem: x is really a dependent variable

How you should read it

Random variable y(x)

Probability p[y(x)]

Probability distributions are valid for one value of the independent variable only

Example

Measure reaction rate constant k at temperatures T1 and T2

Statisticians would note this as measuring random variable k, then compute the probability p(k)

Really, your measurements measured k(T1), so the implied probabilities p[k(T1)] only apply at T=T1

Example

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1,400,000

1,600,000

Birth Weight (lb)

United States

Germany

Birth Weight (lb)

United States

Germany

Data from data.un.org

Hypothesis Testing

Null Hypothesis H0

Assume that two dependent variables are drawn from distributions with the same mean μ

Test this hypothesis with t-test

• t-test gives the probability that the means are “different”

• If they are “different,” then H0 is false

H0 : μ1 = μ2

µ1 = 0σ1 = 2

µ2 = 0σ2 = 1

µ2 = 1σ2 = 1

µ2 = -5σ2 = 1

Testing the Hypothesis

(Student’s) T-testDetermine whether two sets of data come from “different” distributions

• “Different” = “there exists a statistically significant difference between the two”

• Statistical significance based on p value

– significantly different if p < α

– Usually, α = 0.05

• ttest() in Excel

• ttest() or ttest2() in MATLAB

ANalysis Of VAriance (ANOVA)

t-test for the case of more than one independent variable (e.g. y(x,t))

Result is again a p value telling you whether the independent variable makes a statistically significant difference in the dependent variables

Available in the Data Analysis Toolpack in Excel

T-Tests

http://www.socialresearchmethods.net/kb/stat_t.php

ANalysis Of VAriance (ANOVA)

See course manual p. 10-11 for a full description

Sum of squares Mean squares

Essentially a t-test for more than two samples

Error/Residual

Treatment

Regression AnalysisLinear Regression

Fit a line to data containing noise using the least squares method

• Minimize the sum of squared residuals

• Model with one independent variable

• Model with p-1 independent variables

• Goodness of fit

– Fraction of variance in data which is explained by model

Nonlinear Regression

Fit an arbitrary function to data containing noise, again using least squares method

R2 isn’t necessarily a good measure of goodness of fit

• L2 norm (Euclidean distance)

• Relative error in L2 norm

Regression AnalysisQualitative Verification

All of these methods assume that the error ε is normally distributed

•Check by looking at plot of residuals

•Residuals should be randomly distributed around axis r = 0

Nonlinear Regression

Fit an arbitrary function to data containing noise, again using least squares method

R2 isn’t necessarily a good measure of goodness of fit

• L2 norm (Euclidean distance)

• Relative error in L2 norm

Model FittingIn Excel

Add trendline – Excel does everything for you

• Only works if you want to use an available function

Goal seek

• Only works for unconstrained, one parameter models

Solver

• Can use for constrained, multiple parameter models

• Uses Quasi-Newton or conjugate gradient method

In MATLAB

Built-in functions

• Newton-Raphson method (fzero)

• Nelder-Mead simplex (fminsearch)

Optimization toolboxes

• Levenberg-Marquadt/Quasi-Newton (fminunc or fmincon)

• Simulated annealing

• Genetic algorithm (GA)

Curve fitting toolbox

Custom algorithm

All methods work by minimizing some error

Model Fitting in Excel

Model Fitting in MATLAB

Outlier RejectionWhat is an outlier?

An outlier is a data point which disagrees with the other data and cannot be reproduced

Caused by measurement error, incorrect value of independent variable (i.e. user error), noise, chance, or lack of control or understanding of the process

Example:

y(x)=[1.2, 1.3, 5.0, 1.1, 1.2]T

μ = 1.96; σ = 1.70

When is a point an outlier?

Dixon’s Q Test

• Very simple – just look up a value in a table to see if it’s an outlier

Chauvenet’s Criterion

• Simple, less rigorous

• If p(xi)<1/(2n), throw it out

Grubb’s Test; Peirce’s Criterion

• Both utilize more rigorous methods

• See paper

Without outlier: μ = 1.20; σ = 0.08

What makes a good figure?Clearly relates independent and dependent variables using axes and trend

• Units!!

• Proper scaling– Use log scales if variable(s) vary over

orders of magnitude

Symbols and text are large and different

Resolution is sufficiently high

Error bars (if applicable)

Efficient use of space

Utilizes significant figures appropriately

Compares data with applicable model predictions

Contains enough information to get the point(s) across, but not so much that the message is lost or confused

Captioned such that it is understood without reading the text

Reilly et al., Experimental Eye Research, 2008.

Presentation of Data

Which of these figures is better?

ambiguous

Significant figures

Fuzzy text

Error bars

Goodness-of-Fit

LegendFrom a journal article which was rejected.

Presentation of Data

Reilly et al., Biomacromolecules, 2008.

Tiffany and Koretz, International Journal of Biological Molecules, 2002.

Statistical Experimental Design

Design an experiment using statistical methods to minimize the number of data points required to get the desired information.

Analyze an experiment using statistical methods to maximize the information yield from any set of experiments

Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say,...

Documents

U.S. Energy- Overview and Key Statistics

Variability. Statistics means never having to say you're certain. Statistics - Chapter 42

Statistics Introduction to Statistics. Section 1.1 An Overview of Statistics

1.1 An Overview of Statistics

Lebanon - Telecoms Market Overview & Statistics

Descriptive Statistics: Overview

Digital (Internet, Social, Mobile) Japan Statistics Overview

Statistics Quick Overview

An overview of Probability, Statistics and Stochastic ...userhome.brooklyn.cuny.edu/dpinheiro/preprints/Notes_PSSP.pdfAn overview of Probability, Statistics and Stochastic Processes

Overview of Global ICT Statistics - ITU · Overview of Global ICT Statistics Susan Teltscher Head, ICT Data and Statistics Division ITU/BDT 13 October 2014 . Role of statistics in

1-1. Overview of Statistics What is Statistics? What is Statistics? Why Study Statistics? Why Study Statistics? Uses of Statistics Uses of Statistics

what the experts say statIstICs - icf.at · News Issue 62 | October 2008 MIddle east treNds what the experts say ICF CONgress dubaI, 21-25 OCt. 2008 statIstICs

Overview statistics of Head Injury in Malaysia

Overview of European Microdata from Official Statistics

California Almond Industry Overview Key Industry Statistics

Causal inference in statistics: An overview

OVERVIEW OF KEY STATISTICS BY PILLAR

FERTILIZER STATISTICS OVERVIEW NIGERIA

CHAPTER OVERVIEW Say Hello to Inferential Statistics The Idea of Statistical Significance Significance Versus Meaningfulness Meta-analysis

UK charity tax statistics overview€¦ ·