Lecture in R

7/28/2019 Lecture in R

1/13

1/13/20

Introduction to R

Descriptive Statistics

Some basic concepts

Populations vs. Samples

A population in statistics in the complete set of elements

under study

All counties in the United States

All graduate students at CU in a particular semester

All flow gauges along Boulder Creek

A sample is a subset of the elements of the population 25% of the counties in from each State in the U.S.

10 graduate students from each department at CU

Often not practical to study/measure each element


2/13

1/13/20

Descriptive vs. Inferential Statistics

Descriptive

Describe and summarize the characteristics of the sample

Often used to draw preliminary conclusions

Inferential

Infer something about the population from the random sample

What is a random sample?

Observations drawn independently

More observations implies more representative of population

(in contrast to nonrandom or convenience samples such as snowball surveys,

people at a meeting, topographically dependent field samples)

Descriptive Analysis

Measures of extremes

minimum and maximum

Measures of location or central tendency

mean, median, mode

Measures of dispersion & variability

variance, standard deviation, range, interquartile range,skewness

Shape & visualization

tables, frequencies, histograms, stem & leaf plots, boxplot


3/13

1/13/20

Descriptives in R

Sample dataset: ChickWeight - weight vs. age of chicks on

different diets> data(ChickWeight)

The easiest way is to use summary()> summary(ChickWeight$weight)

Min. 1st Qu. Median Mean 3rd Qu. Max.

35.0 63.0 103.0 121.8 163.8 373.0

What do you get?

Extremes (min, max)

Measures of central tendency (mean, median)

Measure of dispersion (interquartile range)

More measures of dispersion or spread

The variance is a measure of spread or dispersion aroundthe mean

> var(ChickWeight$weight)

[1] 5051.223

The standard deviation is the square root of the variance> sd(ChickWeight$weight)

[1] 71.07196

The interquartile range is the difference between the firstand third quartiles

> IQR(ChickWeight$weigth)

[1] 100.75


4/13

1/13/20

The normal probability distribution

Many statistical tests make the assumption that a variable

is normally distributed

Before you choose which statistics to use, need to check the

distribution of your data

Evaluating the data

What happens to these measures if our data is not

normally distributed?


5/13

1/13/20

Assessing normality

There are several methods of assessing whether data are

normally distributed or not. They fall into two broad

categories: graphical and statistical. The most common

are:

Graphical

Q-Q probability plots

Histogram

Boxplot

Statistical

Kolmogorov-Smirnov test

Shapiro-Wilks test

Graphical methods: histogram> hist(ChickWeight$weight, col="red",

breaks=seq(0,400,by=20), xlab="Weight")


6/13

1/13/20

Graphical methods: boxplot> boxplot(ChickWeight$weight, col="red", ylab="Weight")

Especially useful for non-

normally distributed data

and for examining

extreme values in the

tails

Interquartile

Range

3rd quartile

1st quartile

median

+1.5 x interquartile range

-1.5 x interquartile range

extreme values

Graphical methods: quantile-quanitle (qq) plot

> qqnorm(ChickWeight$weight)

> qqline(ChickWeight$weight)


7/13

1/13/20

Statistical methods> shapiro.test(ChickWeight$weight)

Shapiro-Wilk normality test

data: ChickWeight$weight

W = 0.9087, p-value < 2.2e-16

> ks.test(ChickWeight$weight,"pnorm")

One-sample Kolmogorov-Smirnov test

data: ChickWeight$weight

D = 1, p-value < 2.2e-16

alternative hypothesis: two-sided

What are we testing?

The Kolmogorov-Smirnov and Shapiro-Wilks testscalculate the probability that the sample was drawn froma normal population distribution

The hypotheses used are:

H0: The sample data are not significantly different than a normalpopulation

Ha: The sample data are significantly different than a normal

population

So when testing for normality:

p > 0.05 mean the data are normal

p < 0.05 mean the data are NOT normal


8/13

1/13/20

Skewed Data

> shapiro.test(swiss$Education)


data: swiss$Education

W = 0.7482, p-value = 1.312e-07

Normal Data

> shapiro.test(swiss$Infant.Mortality)


data: swiss$Infant.Mortality

W = 0.9776, p-value = 0.4978


9/13

1/13/20

Final Words Concerning Normality Testing

1. Since it IS a test, state a null and alternate hypothesis.

2. If you perform a normality test, do not ignore the

results

3. If the data are not normal, use non-parametric tests (or

transform the data)

4. If the data are normal, use parametric tests

AND MOST IMPORTANTLY:

5. If you have groups of data, you MUST test each groupfor normality

Before we move on to inferential

statistics

lets discuss hypothesis testing


10/13

1/13/20

Why hypotheses?

Statistical procedures for addressing research questions

involves formulating a concise statement of the

hypothesis to be tested

The hypothesis to be tested is referred to as the null

hypothesis (abbreviated H0) because it is a statement of

no difference

Hypothesis testing starts with the assumption that the

null hypothesis is true that there is/are no difference(s)

The alternative

Along with the null hypothesis we must also state an

alternate hypothesis (abbreviated Ha)

The alternate hypothesis is a statement that a difference

exists

If a null hypothesis is rejected, then we tentatively accept

the alternate hypothesis and conclude that there is a

difference


11/13

1/13/20

Hypothesis Example

H0

: There is no difference in lung cancer between men

and women

Ha: There is a difference in lung cancer between ment and

women

In ALL cases, if the calculated value is GREATER than the

critical value then reject H0 and accept Ha

Types of Error

A Type I error accepts an alternate hypothesis when the

results can be attributed to chance

So in effect we are stating that there is a difference when none

actually exists

A Type II error is only an error in the sense that we fail to

correctly reject a (false) null hypothesis

It is not an error in the sense that an incorrect conclusion wasdrawn since no conclusion is drawn

It is the lesser of two evils


12/13

1/13/20

How do we evaluate our hypotheses?

When a statistical test is performed there are twodistinct, but related results: Test statistic: this is the value that is calculated (depends on

what test you uset, f, chi-squared, etc.)

Probability: this is the probability of having that particular teststatistic given the characteristics of the data set

Alpha Level ()

The alpha level is the probability of incorrectly rejectingthe null hypothesis

Also called alpha or significance level

By convention we typically use 0.05 or 0.01 (5% or 1%) as our alphalevel

Alpha level of 0.05

An of 0.05 is equal to 1.96 sd from the mean


13/13

1/13/20

How do we present out hypotheses?

Every time a statistical test is performed, you need to include the

following:

1. A statement of the alpha level

2. Null and alternate hypotheses

3. A high power summary statement that contains: The test performed

The calculated test statistic

The exact probability or probability range

Students taking statistics courses in geography at CU

Boulder reported studying more hours for tests (mean =121, SD = 14.2) than did CU college students in psychology,t(33) = 2.10, p = 0.034

Documents

Lecture in R