Course in Statisticsand Data analysis
Course B, September 2009
Stephan Frickenhaus
Outline theses
my experience is:
Many young researchers lack knowledge of analysis tools, so producing/sampling data is not the problem but analysing gets a problem right before publication.
Once, appropriate tools are known (and: Excel is not approriate for analysis), still knowledge of methods/concepts may be missing.
This course tries to tackle both …
schedule
Day 1: 8.9., 10:00 - 16:00 Room E4005 The probability distribution, The p-value concept, statistical tests in R Day 2: 9.9., 10:00 - 16:00 Room E4005 Multivariate Analysis, Correlation tests, ANOVA, Ordination with factors and environmental data Cluster-Analysis (maybe as start of Day 3) Day 3: 10.9., 10:00 - 16:00 Glaskasten F User-driven interactive: bring your project data and we work on it
Contents / Setup
• Tool-based (program „R“) course– Install „R“ from www.r-project.org
• Exploring data analysis– Graphically– Numerically
• Exploring what significance really is– Statistics tests no longer as black-boxes
DAY1 – Lecture part I
• With each type of data we have different methods to analyse, give examples!
Data
Numerical (metric) data
Nominal (class) data
Ordinal (ranked) data
Linear: Length in cmCircular: Angle in degree
Sex, Colour, Species
Age group, school class, phase in cell-division
examplestype of data
First steps from data …
• Plot in a co-ordinate system (scatter-plot),histogram, boxplot
• Count in a table, barplot, piechart
• Count in a table, with an axis, barplot
Linear: Length in cmCircular: Angle in degree
Sex, Colour, Species
Age group, school class, phase in cell-division
… to methods
• Check for groups, trends, correlations
• Check for differences, ratios
• Check for differences, ratios, relation to order
• Plot in a co-ordinate system (scatter-plot),histogram, boxplot
• Count in a table, barplot, piechart
• Count in a table, with an axis, barplot
met
ricno
miin
alor
dina
l
…to combinations of data
• X-Y-Plotsmet
ricno
miin
alor
dina
l
met
ricm
etric
• X-Y-plot with colors=class
met
ric
met
ric
• Class=color in scatter plot
• Check for groups/clusters
…towards models: multivariate data
• Organize data in tables
• Keep data of same measurement in ONE row
• Distinguish groups in extra column by nominal data
• Before discussing, what we can do with such a table, lets do first steps in the tool R!
Start Practice with R
www.r-project.org
http://ftp5.gwdg.de/pub/misc/cran/
Lecture part II
• What, if the summary of data is not enough? E.g., we want to say, whether an observed mean value is probably greater than 0.5?
• It is not enough to conclude„We clearly find mean(x)<mean(y)“
because this may be an outcome due to small sample sizes, and in reality the means may be equal, and there is maybe no effect at all.
• We must define some terms to learn how to be more quantitative about such statements, like „with 1% error we can exclude that x and y are from the same population“
Some terms…• Population :
– all individuals of the kind measured– If we measure them all, we know exactly the mean
value etc., the true mean– Some times we do not have it accessible– Sometimes we think it has infinitely many individuals
• Sample :– A subset of individuals from a population– It has, e.g., a sample mean that is not equal to the
true mean (the mean of the population)– sample size : number of individuals picked
…more terms, for real numbered variables X
Probability density function p(x) the probability to pick
samples xi from X in theinterval [a,b]
Cumulative distribution function cdf(x) probability to
pick an x below a
)Pr()( bxadXXpb
a
a
axdXXpacdf )Pr()()(
p(x) prob density function
x
p(x)
a b
)Pr( bxa
p(x)>=0 1)(
dxxp
Need not be symmetric!
Full range of X makes 100%
cumulative distr. function
x
cdf(x)
1
cdf starts from 0 at the minimal possible value of X,
reaches 1 at the maximal possible value of X. Here p drops to 0.
cdf is monotonically increasing, because it integrates a p≥0.
min(X) max(X)
Mean E and Standard deviation S
x
p(x)
E(X), need not be at the maximum of p(x)
S(X) measures somehow the width of p(x), i.e., the scattering of x around E(x).
Long-tail distributions
x
p(x)
Some rare samples will have very large values x !
When we have few samples, we pick from these rare values maybe none!
What is a statistics test?
• Example: We have a sample x of size 6.
• How probable is it, that the mean of the sample x is between 2 and 2.5, although E(X)=0?
• To answer this: – 1) we repeat many times taking samples of
size 6 and count how often.– 2) we need an assumption about the
probability density of X and then integrate a statistics distribution of mean(x) to measure Pr(2<mean(x)<2.5)
May be too expensive
LATER:Can I check what the pdf of X is?
…influence of sample size on the mean
• repeat a sampling from X with sd(X)=1.0 at different sizes N
• Take sample means• How do repeated means
vary (standard deviation)
• Result…• For high N, sd(mean)
goes
(central limit theorem)NXsd /)(
How for low N ???
Its given by the t-statistics
t = mean(x)/(sd(x)/sqrt(N)),
which depends on sample size N.
A first test:Test the influence of sample size
• How do I know how many samples I need to make a correct statement about the mean like E(X)≥0.89?
• „correct“ is to be quantified as the „type-I error“:How probable is it that I see the same or more extreme value by chance alone, i.e., although the population mean is 0 ?
Concept of the Null-Hypothesis
How shure can I be to exclude, that the population mean is not zero, also when I find a sample mean of m=0.89.
So, we evaluate how probable such an outcome is, when a certain pdf(X), e.g., the normal distribution, which has an E(X)=0.
To evaluate this Pr, we need a test-statistic t for it and a distribution pdf(t) to integrate for Pr.
T-statistics
• T has a complicated mathematical, its graph is similar to bell-shaped curve.
• It has for small sample size N longer tails (green)
Pr(T>=3)Blue area= Pr(T<3)
T is known in R
Test for sample x=c(1,2)
Pr(t<3), for n=2
Upper boundary 3?
t=mean(x)/sd(x)*sqrt(2)
=1.5/0.707*1.44=3.0
)1,3()()3Pr(3
dfptdttdtt
Sample size -1
So, ~90 from 100 repeated samples will give mean below 1.5
1-pt(3,df=1) = 0.1024164
is the chance to have mean(x) greater 1.5 ! (remember, N=2),
Under the assumption that x is drawn from a population with mean 0 !
Now the test itself:
• We have a sample size 2
The Null-Hypothesis
Our sample is from a population with mean 0.
The test that checks this is in R…
Ignore this 0