Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Learn to Use the
Kolmogorov–Smirnov Test in R
With Data From the Opinions and
Lifestyle Survey (Well-Being
Module) (2015)
© 2019 SAGE Publications, Ltd. All Rights Reserved.
This PDF has been generated from SAGE Research Methods Datasets.
Learn to Use the
Kolmogorov–Smirnov Test in R
With Data From the Opinions and
Lifestyle Survey (Well-Being
Module) (2015)
How-to Guide for R
Introduction
In this guide, you will learn how to produce a one-sample and two-sample
Kolmogorov–Smirnov (K–S) test in R using a practical example to illustrate the
process. You will find links to the example dataset, and you are encouraged to
replicate this example. An additional practice example is suggested at the end of
this guide. The example assumes that you have already opened the data file in R.
Contents
1. Kolmogorov–Smirnov Test: One-Sample and Two-Sample
2. An Example in R: General Health and Happiness
2.1 The R Procedure
2.2 Exploring the R Output
3. Your Turn
1 Kolmogorov–Smirnov Test: One-Sample and Two-Sample
This example introduces the K–S test. The K–S test is a test of the equality of two
SAGE
2019 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
2
Page 2 of 9 Learn to Use the Kolmogorov–Smirnov Test in R With Data From the
Opinions and Lifestyle Survey (Well-Being Module) (2015)
distributions, and there are two types of test. There is the one-sample K–S test
which is used to test the normality of a selected continuous variable, and there
is the two-sample K–S test which is used to test whether two samples have the
same distribution or not.
2 An Example in R: General Health and Happiness
This example uses a subset of data from the 2015 Opinions and Lifestyle Survey
(Well-Being Module). This extract includes 2,048 respondents. Please note that
the original dataset is larger than this, but it has been “cleaned” to include only
those who have responded to our dependent variable. The two variables we
examine are:
• How is your health in general? (RECODEDHealth)
• Overall, how happy did you feel yesterday? (MCZ_3)
The first variable, RECODEDHealth, is coded 1 if a respondent reports their
health as “good” and 2 if “bad.” The second variable, MCZ_3, is not coded
as it is at the interval or continuous level. It is measured on a scale of 0–10,
increasing in happiness. Given that our independent variable is dichotomous and
categorical and our dependent variable is interval or continuous, the K–S test is
an appropriate test for these data.
Data Cleaning
Prior to opening the CSV file used in this example in R and performing the
analyses outlined in this guide, you will need to do a small amount of preliminary
cleaning to the data file. This is because this file contains two codes, 98 and 99,
that represent “missing” data. However, you need to change this to “NA” (“Non
Applicable”) so that R can read it as “missing” data and not include it in your
analysis, potentially skewing your results. To change the codes you need to open
SAGE
2019 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
2
Page 3 of 9 Learn to Use the Kolmogorov–Smirnov Test in R With Data From the
Opinions and Lifestyle Survey (Well-Being Module) (2015)
the CSV file and highlight all variable columns. Then on the “Home” toolbar, click
on “Find & Select” and choose “Replace” from the dropdown menu. In the “Find
and Replace” dialog box that opens insert “98” into “Find what” and “NA” into
“Replace with.” Then click “Replace All.” This replaces all the 98 values with “NA.”
You then repeat the process for the other missing code (i.e., 99). Once you have
done this, save the file and then proceed as outlined in this guide.
2.1 The R Procedure
R is a free, open-source software and computing platform. Unlike many other
statistical software packages, it does not operate with drop-down menus. Instead,
users submit lines of code that execute commands and functions built into R. It is
a good idea to save your code in a simple text file that R users generally refer to
as a script file. We provide a script file with this example that executes all of the
operations described here. If you are new to using R, we suggest you start with the
introduction manual (http://cran.r-project.org/doc/manuals/r-release/R-intro.html).
Another useful introductory guide is Andy Field’s Discovering Statistics Using R
(2012, SAGE).
For this example, we must first load the dataset into R and then attach the
dataset so R can directly access the variables stored inside the data file. It is
possible to import data from a variety of software packages, including SPSS,
STATA, Minitab, and Excel. It is best to import data from software packages in
file formats that are R-friendly, such as tab-delimited text (.txt in Excel or .dat in
SPSS) or comma-separated files (.csv). This guide will use a CSV file (dataset-
ons-mcz-2015-subset3.csv). If you want to find R code for importing other file
types, then you can find these online easily.
If you are using the CSV file provided with this example, the code looks like this
(assuming the data file is already saved in your working directory):
ons_mcz <- read.csv(“dataset-ons-mcz-2015-subset3.csv”)
SAGE
2019 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
2
Page 4 of 9 Learn to Use the Kolmogorov–Smirnov Test in R With Data From the
Opinions and Lifestyle Survey (Well-Being Module) (2015)
The code reads in the dataset and assigns it to an object named “ons_mcz.”
Prior to running the K–S test or indeed any statistical test, it is good practice to
examine each variable on its own; this is univariate analysis. This allows us an
opportunity to describe the variable and get an initial “feel” for our data. We can
start by running measures of central tendency descriptives of MCZ_3. In order to
do this, we need to first load (you may have to install this first if you have not used
it before) the pastecs package and then use the following code:
stat.desc(ons_mcz$MCZ_3)
round(stat.desc(ons_mcz$MCZ_3), digits = 3)
The first line of code generates the measures of central tendency descriptives
for the variable, and the second line of code (using the round() function sets the
number of places that we want. The code generates the following results (NB
some of the output has been removed to save space):
The mean is 7.62 and the median is 8, suggesting that the majority of the
respondents are “happy.” If the data are normally distributed, then from the
standard deviation, we would assume that two thirds of scores fall between 5.43
and 9.81, suggesting that majority of respondents are “happy” to “very happy.” We
can also visually assess the data by running a histogram of the variable; the code
and results are shown here:
hist(ons_mcz$MCZ_3)
SAGE
2019 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
2
Page 5 of 9 Learn to Use the Kolmogorov–Smirnov Test in R With Data From the
Opinions and Lifestyle Survey (Well-Being Module) (2015)
The data do not appear to be normally distributed.
We should then run a frequency table of RECODEDHealth; to do this, we need to
load (you may have to install this first if you have not used it before) the summary
tools package. The code and results for our data are shown here:
library (“summarytools”)
freq(ons_mcz$RECODEDHealth, order = “freq”)
SAGE
2019 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
2
Page 6 of 9 Learn to Use the Kolmogorov–Smirnov Test in R With Data From the
Opinions and Lifestyle Survey (Well-Being Module) (2015)
The majority of the respondents (76.8%) rate their health as “good,” while only
23.2% of respondents rate their health as “bad.” It should be noted that there are
41 missing cases. This variable has two groups of unequal size.
Now that we have examined our variables individually, we can run K–S tests to
assess the normality of our data.
One-Sample K–S Test
To run a one-sample K–S test in R, load (you may have to install this first if you
have not used it before) the nortest package and then use the following code:
ks.test(ons_mcz$MCZ_3, “pnorm”)
Two-Sample K–S Test
To run a two-sample K–S test in R, load (you may have to install this first if you
have not used it before) the nortest package and then use the following code:
ks.test(ons_mcz$MCZ_3, ons_mcz$RECODEDHealth)
2.2 Exploring the R Output
One-Sample K–S Test
The R code outlined previously produces the following results:
SAGE
2019 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
2
Page 7 of 9 Learn to Use the Kolmogorov–Smirnov Test in R With Data From the
Opinions and Lifestyle Survey (Well-Being Module) (2015)
The results show that D(2,040) = 0.96532, p < .01, meaning that there is a
statistically significant deviation from normality. Therefore, we can reject the null
hypothesis of no deviation from normality in relation to the variable MCZ_3.
People’s level of general happiness is not normally distributed, which confirms the
earlier interpretation of the histogram of MCZ_3.
Two-Sample K–S Test
The R code outlined previously produces the following results:
The results show that D = 0.9667, p < .01, meaning that there is a statistically
significant deviation in distribution between the two groups; the levels of general
happiness in each group come from different distributions. Therefore, we can
reject the null hypothesis of no difference in distributions between those who rate
their health as “good” and those who rate their health as “bad.”
3 Your Turn
Download this sample dataset and see whether you can replicate these results.
The sample dataset also includes another variable called MCZ_13, which relates
SAGE
2019 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
2
Page 8 of 9 Learn to Use the Kolmogorov–Smirnov Test in R With Data From the
Opinions and Lifestyle Survey (Well-Being Module) (2015)
to how satisfied the respondent was in relation to the balance between the time
spent on paid job and the time spent on other aspects of life. See whether you
can reproduce the results presented here for the MCZ_3 variable, and then try
producing your own two-sample K–S test substituting MCZ_3 for MCZ_13 in the
analysis.
SAGE
2019 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
2
Page 9 of 9 Learn to Use the Kolmogorov–Smirnov Test in R With Data From the
Opinions and Lifestyle Survey (Well-Being Module) (2015)