Upload
alhaitham-gho
View
221
Download
0
Embed Size (px)
Citation preview
7/28/2019 Lecture in R
1/13
1/13/20
Introduction to R
Descriptive Statistics
Some basic concepts
Populations vs. Samples
A population in statistics in the complete set of elements
under study
All counties in the United States
All graduate students at CU in a particular semester
All flow gauges along Boulder Creek
A sample is a subset of the elements of the population 25% of the counties in from each State in the U.S.
10 graduate students from each department at CU
Often not practical to study/measure each element
7/28/2019 Lecture in R
2/13
1/13/20
Descriptive vs. Inferential Statistics
Descriptive
Describe and summarize the characteristics of the sample
Often used to draw preliminary conclusions
Inferential
Infer something about the population from the random sample
What is a random sample?
Observations drawn independently
More observations implies more representative of population
(in contrast to nonrandom or convenience samples such as snowball surveys,
people at a meeting, topographically dependent field samples)
Descriptive Analysis
Measures of extremes
minimum and maximum
Measures of location or central tendency
mean, median, mode
Measures of dispersion & variability
variance, standard deviation, range, interquartile range,skewness
Shape & visualization
tables, frequencies, histograms, stem & leaf plots, boxplot
7/28/2019 Lecture in R
3/13
1/13/20
Descriptives in R
Sample dataset: ChickWeight - weight vs. age of chicks on
different diets> data(ChickWeight)
The easiest way is to use summary()> summary(ChickWeight$weight)
Min. 1st Qu. Median Mean 3rd Qu. Max.
35.0 63.0 103.0 121.8 163.8 373.0
What do you get?
Extremes (min, max)
Measures of central tendency (mean, median)
Measure of dispersion (interquartile range)
More measures of dispersion or spread
The variance is a measure of spread or dispersion aroundthe mean
> var(ChickWeight$weight)
[1] 5051.223
The standard deviation is the square root of the variance> sd(ChickWeight$weight)
[1] 71.07196
The interquartile range is the difference between the firstand third quartiles
> IQR(ChickWeight$weigth)
[1] 100.75
7/28/2019 Lecture in R
4/13
1/13/20
The normal probability distribution
Many statistical tests make the assumption that a variable
is normally distributed
Before you choose which statistics to use, need to check the
distribution of your data
Evaluating the data
What happens to these measures if our data is not
normally distributed?
7/28/2019 Lecture in R
5/13
1/13/20
Assessing normality
There are several methods of assessing whether data are
normally distributed or not. They fall into two broad
categories: graphical and statistical. The most common
are:
Graphical
Q-Q probability plots
Histogram
Boxplot
Statistical
Kolmogorov-Smirnov test
Shapiro-Wilks test
Graphical methods: histogram> hist(ChickWeight$weight, col="red",
breaks=seq(0,400,by=20), xlab="Weight")
7/28/2019 Lecture in R
6/13
1/13/20
Graphical methods: boxplot> boxplot(ChickWeight$weight, col="red", ylab="Weight")
Especially useful for non-
normally distributed data
and for examining
extreme values in the
tails
Interquartile
Range
3rd quartile
1st quartile
median
+1.5 x interquartile range
-1.5 x interquartile range
extreme values
Graphical methods: quantile-quanitle (qq) plot
> qqnorm(ChickWeight$weight)
> qqline(ChickWeight$weight)
7/28/2019 Lecture in R
7/13
1/13/20
Statistical methods> shapiro.test(ChickWeight$weight)
Shapiro-Wilk normality test
data: ChickWeight$weight
W = 0.9087, p-value < 2.2e-16
> ks.test(ChickWeight$weight,"pnorm")
One-sample Kolmogorov-Smirnov test
data: ChickWeight$weight
D = 1, p-value < 2.2e-16
alternative hypothesis: two-sided
What are we testing?
The Kolmogorov-Smirnov and Shapiro-Wilks testscalculate the probability that the sample was drawn froma normal population distribution
The hypotheses used are:
H0: The sample data are not significantly different than a normalpopulation
Ha: The sample data are significantly different than a normal
population
So when testing for normality:
p > 0.05 mean the data are normal
p < 0.05 mean the data are NOT normal
7/28/2019 Lecture in R
8/13
1/13/20
Skewed Data
> shapiro.test(swiss$Education)
Shapiro-Wilk normality test
data: swiss$Education
W = 0.7482, p-value = 1.312e-07
Normal Data
> shapiro.test(swiss$Infant.Mortality)
Shapiro-Wilk normality test
data: swiss$Infant.Mortality
W = 0.9776, p-value = 0.4978
7/28/2019 Lecture in R
9/13
1/13/20
Final Words Concerning Normality Testing
1. Since it IS a test, state a null and alternate hypothesis.
2. If you perform a normality test, do not ignore the
results
3. If the data are not normal, use non-parametric tests (or
transform the data)
4. If the data are normal, use parametric tests
AND MOST IMPORTANTLY:
5. If you have groups of data, you MUST test each groupfor normality
Before we move on to inferential
statistics
lets discuss hypothesis testing
7/28/2019 Lecture in R
10/13
1/13/20
Why hypotheses?
Statistical procedures for addressing research questions
involves formulating a concise statement of the
hypothesis to be tested
The hypothesis to be tested is referred to as the null
hypothesis (abbreviated H0) because it is a statement of
no difference
Hypothesis testing starts with the assumption that the
null hypothesis is true that there is/are no difference(s)
The alternative
Along with the null hypothesis we must also state an
alternate hypothesis (abbreviated Ha)
The alternate hypothesis is a statement that a difference
exists
If a null hypothesis is rejected, then we tentatively accept
the alternate hypothesis and conclude that there is a
difference
7/28/2019 Lecture in R
11/13
1/13/20
Hypothesis Example
H0
: There is no difference in lung cancer between men
and women
Ha: There is a difference in lung cancer between ment and
women
In ALL cases, if the calculated value is GREATER than the
critical value then reject H0 and accept Ha
Types of Error
A Type I error accepts an alternate hypothesis when the
results can be attributed to chance
So in effect we are stating that there is a difference when none
actually exists
A Type II error is only an error in the sense that we fail to
correctly reject a (false) null hypothesis
It is not an error in the sense that an incorrect conclusion wasdrawn since no conclusion is drawn
It is the lesser of two evils
7/28/2019 Lecture in R
12/13
1/13/20
How do we evaluate our hypotheses?
When a statistical test is performed there are twodistinct, but related results: Test statistic: this is the value that is calculated (depends on
what test you uset, f, chi-squared, etc.)
Probability: this is the probability of having that particular teststatistic given the characteristics of the data set
Alpha Level ()
The alpha level is the probability of incorrectly rejectingthe null hypothesis
Also called alpha or significance level
By convention we typically use 0.05 or 0.01 (5% or 1%) as our alphalevel
Alpha level of 0.05
An of 0.05 is equal to 1.96 sd from the mean
7/28/2019 Lecture in R
13/13
1/13/20
How do we present out hypotheses?
Every time a statistical test is performed, you need to include the
following:
1. A statement of the alpha level
2. Null and alternate hypotheses
3. A high power summary statement that contains: The test performed
The calculated test statistic
The exact probability or probability range
Students taking statistics courses in geography at CU
Boulder reported studying more hours for tests (mean =121, SD = 14.2) than did CU college students in psychology,t(33) = 2.10, p = 0.034