Download ppt - Course in Statistics and Data analysis Course B, September 2009 Stephan Frickenhaus

Course in Statisticsand Data analysis

Course B, September 2009

Stephan Frickenhaus

Outline theses

my experience is:

Many young researchers lack knowledge of analysis tools, so producing/sampling data is not the problem but analysing gets a problem right before publication.

Once, appropriate tools are known (and: Excel is not approriate for analysis), still knowledge of methods/concepts may be missing.

This course tries to tackle both …

schedule

Day 1: 8.9., 10:00 - 16:00 Room E4005 The probability distribution, The p-value concept, statistical tests in R Day 2: 9.9., 10:00 - 16:00 Room E4005 Multivariate Analysis, Correlation tests, ANOVA, Ordination with factors and environmental data Cluster-Analysis (maybe as start of Day 3) Day 3: 10.9., 10:00 - 16:00 Glaskasten F User-driven interactive: bring your project data and we work on it

Contents / Setup

• Tool-based (program „R“) course– Install „R“ from www.r-project.org

• Exploring data analysis– Graphically– Numerically

• Exploring what significance really is– Statistics tests no longer as black-boxes

http://www.r-project.org/

DAY1 – Lecture part I

• With each type of data we have different methods to analyse, give examples!

Data

Numerical (metric) data

Nominal (class) data

Ordinal (ranked) data

Linear: Length in cmCircular: Angle in degree

Sex, Colour, Species

Age group, school class, phase in cell-division

examplestype of data

First steps from data …

• Plot in a co-ordinate system (scatter-plot),histogram, boxplot

• Count in a table, barplot, piechart

• Count in a table, with an axis, barplot

Linear: Length in cmCircular: Angle in degree

Sex, Colour, Species

Age group, school class, phase in cell-division

… to methods

• Check for groups, trends, correlations

• Check for differences, ratios

• Check for differences, ratios, relation to order

• Plot in a co-ordinate system (scatter-plot),histogram, boxplot

• Count in a table, barplot, piechart

• Count in a table, with an axis, barplot

met

ricno

miin

alor

dina

l

…to combinations of data

• X-Y-Plotsmet

ricno

miin

alor

dina

l

met

ricm

etric

• X-Y-plot with colors=class

met

ric

met

ric

• Class=color in scatter plot

• Check for groups/clusters

…towards models: multivariate data

• Organize data in tables

• Keep data of same measurement in ONE row

• Distinguish groups in extra column by nominal data

• Before discussing, what we can do with such a table, lets do first steps in the tool R!

Start Practice with R

www.r-project.org

http://ftp5.gwdg.de/pub/misc/cran/

http://www.r-project.org/








Lecture part II

• What, if the summary of data is not enough? E.g., we want to say, whether an observed mean value is probably greater than 0.5?

• It is not enough to conclude„We clearly find mean(x)<mean(y)“

because this may be an outcome due to small sample sizes, and in reality the means may be equal, and there is maybe no effect at all.

• We must define some terms to learn how to be more quantitative about such statements, like „with 1% error we can exclude that x and y are from the same population“

Some terms…• Population :

– all individuals of the kind measured– If we measure them all, we know exactly the mean

value etc., the true mean– Some times we do not have it accessible– Sometimes we think it has infinitely many individuals

• Sample :– A subset of individuals from a population– It has, e.g., a sample mean that is not equal to the

true mean (the mean of the population)– sample size : number of individuals picked

…more terms, for real numbered variables X

Probability density function p(x) the probability to pick

samples xi from X in theinterval [a,b]

Cumulative distribution function cdf(x) probability to

pick an x below a

)Pr()( bxadXXpb

a

a

axdXXpacdf )Pr()()(

p(x) prob density function

x

p(x)

a b

)Pr( bxa

p(x)>=0 1)(

dxxp

Need not be symmetric!

Full range of X makes 100%

cumulative distr. function

x

cdf(x)

1

cdf starts from 0 at the minimal possible value of X,

reaches 1 at the maximal possible value of X. Here p drops to 0.

cdf is monotonically increasing, because it integrates a p≥0.

min(X) max(X)

Mean E and Standard deviation S

x

p(x)

E(X), need not be at the maximum of p(x)

S(X) measures somehow the width of p(x), i.e., the scattering of x around E(x).

Long-tail distributions

x

p(x)

Some rare samples will have very large values x !

When we have few samples, we pick from these rare values maybe none!

What is a statistics test?

• Example: We have a sample x of size 6.

• How probable is it, that the mean of the sample x is between 2 and 2.5, although E(X)=0?

• To answer this: – 1) we repeat many times taking samples of

size 6 and count how often.– 2) we need an assumption about the

probability density of X and then integrate a statistics distribution of mean(x) to measure Pr(2<mean(x)<2.5)

May be too expensive

LATER:Can I check what the pdf of X is?

…influence of sample size on the mean

• repeat a sampling from X with sd(X)=1.0 at different sizes N

• Take sample means• How do repeated means

vary (standard deviation)

• Result…• For high N, sd(mean)

goes

(central limit theorem)NXsd /)(

How for low N ???

Its given by the t-statistics

t = mean(x)/(sd(x)/sqrt(N)),

which depends on sample size N.

A first test:Test the influence of sample size

• How do I know how many samples I need to make a correct statement about the mean like E(X)≥0.89?

• „correct“ is to be quantified as the „type-I error“:How probable is it that I see the same or more extreme value by chance alone, i.e., although the population mean is 0 ?

Concept of the Null-Hypothesis

How shure can I be to exclude, that the population mean is not zero, also when I find a sample mean of m=0.89.

So, we evaluate how probable such an outcome is, when a certain pdf(X), e.g., the normal distribution, which has an E(X)=0.

To evaluate this Pr, we need a test-statistic t for it and a distribution pdf(t) to integrate for Pr.

T-statistics

• T has a complicated mathematical, its graph is similar to bell-shaped curve.

• It has for small sample size N longer tails (green)

Pr(T>=3)Blue area= Pr(T<3)

T is known in R

Test for sample x=c(1,2)

Pr(t<3), for n=2

Upper boundary 3?

t=mean(x)/sd(x)*sqrt(2)

=1.5/0.707*1.44=3.0

)1,3()()3Pr(3

dfptdttdtt

Sample size -1

So, ~90 from 100 repeated samples will give mean below 1.5

1-pt(3,df=1) = 0.1024164

is the chance to have mean(x) greater 1.5 ! (remember, N=2),

Under the assumption that x is drawn from a population with mean 0 !

Now the test itself:

• We have a sample size 2

The Null-Hypothesis

Our sample is from a population with mean 0.

The test that checks this is in R…

Ignore this 0