Lecture in R

Embed Size (px)

Citation preview

  • 7/28/2019 Lecture in R

    1/13

    1/13/20

    Introduction to R

    Descriptive Statistics

    Some basic concepts

    Populations vs. Samples

    A population in statistics in the complete set of elements

    under study

    All counties in the United States

    All graduate students at CU in a particular semester

    All flow gauges along Boulder Creek

    A sample is a subset of the elements of the population 25% of the counties in from each State in the U.S.

    10 graduate students from each department at CU

    Often not practical to study/measure each element

  • 7/28/2019 Lecture in R

    2/13

    1/13/20

    Descriptive vs. Inferential Statistics

    Descriptive

    Describe and summarize the characteristics of the sample

    Often used to draw preliminary conclusions

    Inferential

    Infer something about the population from the random sample

    What is a random sample?

    Observations drawn independently

    More observations implies more representative of population

    (in contrast to nonrandom or convenience samples such as snowball surveys,

    people at a meeting, topographically dependent field samples)

    Descriptive Analysis

    Measures of extremes

    minimum and maximum

    Measures of location or central tendency

    mean, median, mode

    Measures of dispersion & variability

    variance, standard deviation, range, interquartile range,skewness

    Shape & visualization

    tables, frequencies, histograms, stem & leaf plots, boxplot

  • 7/28/2019 Lecture in R

    3/13

    1/13/20

    Descriptives in R

    Sample dataset: ChickWeight - weight vs. age of chicks on

    different diets> data(ChickWeight)

    The easiest way is to use summary()> summary(ChickWeight$weight)

    Min. 1st Qu. Median Mean 3rd Qu. Max.

    35.0 63.0 103.0 121.8 163.8 373.0

    What do you get?

    Extremes (min, max)

    Measures of central tendency (mean, median)

    Measure of dispersion (interquartile range)

    More measures of dispersion or spread

    The variance is a measure of spread or dispersion aroundthe mean

    > var(ChickWeight$weight)

    [1] 5051.223

    The standard deviation is the square root of the variance> sd(ChickWeight$weight)

    [1] 71.07196

    The interquartile range is the difference between the firstand third quartiles

    > IQR(ChickWeight$weigth)

    [1] 100.75

  • 7/28/2019 Lecture in R

    4/13

    1/13/20

    The normal probability distribution

    Many statistical tests make the assumption that a variable

    is normally distributed

    Before you choose which statistics to use, need to check the

    distribution of your data

    Evaluating the data

    What happens to these measures if our data is not

    normally distributed?

  • 7/28/2019 Lecture in R

    5/13

    1/13/20

    Assessing normality

    There are several methods of assessing whether data are

    normally distributed or not. They fall into two broad

    categories: graphical and statistical. The most common

    are:

    Graphical

    Q-Q probability plots

    Histogram

    Boxplot

    Statistical

    Kolmogorov-Smirnov test

    Shapiro-Wilks test

    Graphical methods: histogram> hist(ChickWeight$weight, col="red",

    breaks=seq(0,400,by=20), xlab="Weight")

  • 7/28/2019 Lecture in R

    6/13

    1/13/20

    Graphical methods: boxplot> boxplot(ChickWeight$weight, col="red", ylab="Weight")

    Especially useful for non-

    normally distributed data

    and for examining

    extreme values in the

    tails

    Interquartile

    Range

    3rd quartile

    1st quartile

    median

    +1.5 x interquartile range

    -1.5 x interquartile range

    extreme values

    Graphical methods: quantile-quanitle (qq) plot

    > qqnorm(ChickWeight$weight)

    > qqline(ChickWeight$weight)

  • 7/28/2019 Lecture in R

    7/13

    1/13/20

    Statistical methods> shapiro.test(ChickWeight$weight)

    Shapiro-Wilk normality test

    data: ChickWeight$weight

    W = 0.9087, p-value < 2.2e-16

    > ks.test(ChickWeight$weight,"pnorm")

    One-sample Kolmogorov-Smirnov test

    data: ChickWeight$weight

    D = 1, p-value < 2.2e-16

    alternative hypothesis: two-sided

    What are we testing?

    The Kolmogorov-Smirnov and Shapiro-Wilks testscalculate the probability that the sample was drawn froma normal population distribution

    The hypotheses used are:

    H0: The sample data are not significantly different than a normalpopulation

    Ha: The sample data are significantly different than a normal

    population

    So when testing for normality:

    p > 0.05 mean the data are normal

    p < 0.05 mean the data are NOT normal

  • 7/28/2019 Lecture in R

    8/13

    1/13/20

    Skewed Data

    > shapiro.test(swiss$Education)

    Shapiro-Wilk normality test

    data: swiss$Education

    W = 0.7482, p-value = 1.312e-07

    Normal Data

    > shapiro.test(swiss$Infant.Mortality)

    Shapiro-Wilk normality test

    data: swiss$Infant.Mortality

    W = 0.9776, p-value = 0.4978

  • 7/28/2019 Lecture in R

    9/13

    1/13/20

    Final Words Concerning Normality Testing

    1. Since it IS a test, state a null and alternate hypothesis.

    2. If you perform a normality test, do not ignore the

    results

    3. If the data are not normal, use non-parametric tests (or

    transform the data)

    4. If the data are normal, use parametric tests

    AND MOST IMPORTANTLY:

    5. If you have groups of data, you MUST test each groupfor normality

    Before we move on to inferential

    statistics

    lets discuss hypothesis testing

  • 7/28/2019 Lecture in R

    10/13

    1/13/20

    Why hypotheses?

    Statistical procedures for addressing research questions

    involves formulating a concise statement of the

    hypothesis to be tested

    The hypothesis to be tested is referred to as the null

    hypothesis (abbreviated H0) because it is a statement of

    no difference

    Hypothesis testing starts with the assumption that the

    null hypothesis is true that there is/are no difference(s)

    The alternative

    Along with the null hypothesis we must also state an

    alternate hypothesis (abbreviated Ha)

    The alternate hypothesis is a statement that a difference

    exists

    If a null hypothesis is rejected, then we tentatively accept

    the alternate hypothesis and conclude that there is a

    difference

  • 7/28/2019 Lecture in R

    11/13

    1/13/20

    Hypothesis Example

    H0

    : There is no difference in lung cancer between men

    and women

    Ha: There is a difference in lung cancer between ment and

    women

    In ALL cases, if the calculated value is GREATER than the

    critical value then reject H0 and accept Ha

    Types of Error

    A Type I error accepts an alternate hypothesis when the

    results can be attributed to chance

    So in effect we are stating that there is a difference when none

    actually exists

    A Type II error is only an error in the sense that we fail to

    correctly reject a (false) null hypothesis

    It is not an error in the sense that an incorrect conclusion wasdrawn since no conclusion is drawn

    It is the lesser of two evils

  • 7/28/2019 Lecture in R

    12/13

    1/13/20

    How do we evaluate our hypotheses?

    When a statistical test is performed there are twodistinct, but related results: Test statistic: this is the value that is calculated (depends on

    what test you uset, f, chi-squared, etc.)

    Probability: this is the probability of having that particular teststatistic given the characteristics of the data set

    Alpha Level ()

    The alpha level is the probability of incorrectly rejectingthe null hypothesis

    Also called alpha or significance level

    By convention we typically use 0.05 or 0.01 (5% or 1%) as our alphalevel

    Alpha level of 0.05

    An of 0.05 is equal to 1.96 sd from the mean

  • 7/28/2019 Lecture in R

    13/13

    1/13/20

    How do we present out hypotheses?

    Every time a statistical test is performed, you need to include the

    following:

    1. A statement of the alpha level

    2. Null and alternate hypotheses

    3. A high power summary statement that contains: The test performed

    The calculated test statistic

    The exact probability or probability range

    Students taking statistics courses in geography at CU

    Boulder reported studying more hours for tests (mean =121, SD = 14.2) than did CU college students in psychology,t(33) = 2.10, p = 0.034