Learn to Use the Kolmogorov–Smirnov Test in R With Data

Learn to Use the

Kolmogorov–Smirnov Test in R

With Data From the Opinions and

Lifestyle Survey (Well-Being

Module) (2015)

© 2019 SAGE Publications, Ltd. All Rights Reserved.

This PDF has been generated from SAGE Research Methods Datasets.

Learn to Use the

Kolmogorov–Smirnov Test in R

With Data From the Opinions and

Lifestyle Survey (Well-Being

Module) (2015)

How-to Guide for R

Introduction

In this guide, you will learn how to produce a one-sample and two-sample

Kolmogorov–Smirnov (K–S) test in R using a practical example to illustrate the

process. You will find links to the example dataset, and you are encouraged to

replicate this example. An additional practice example is suggested at the end of

this guide. The example assumes that you have already opened the data file in R.

Contents

1. Kolmogorov–Smirnov Test: One-Sample and Two-Sample

2. An Example in R: General Health and Happiness

2.1 The R Procedure

2.2 Exploring the R Output

3. Your Turn

1 Kolmogorov–Smirnov Test: One-Sample and Two-Sample

This example introduces the K–S test. The K–S test is a test of the equality of two

SAGE

2019 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods Datasets Part

2

Page 2 of 9 Learn to Use the Kolmogorov–Smirnov Test in R With Data From the

Opinions and Lifestyle Survey (Well-Being Module) (2015)

distributions, and there are two types of test. There is the one-sample K–S test

which is used to test the normality of a selected continuous variable, and there

is the two-sample K–S test which is used to test whether two samples have the

same distribution or not.

2 An Example in R: General Health and Happiness

This example uses a subset of data from the 2015 Opinions and Lifestyle Survey

(Well-Being Module). This extract includes 2,048 respondents. Please note that

the original dataset is larger than this, but it has been “cleaned” to include only

those who have responded to our dependent variable. The two variables we

examine are:

• How is your health in general? (RECODEDHealth)

• Overall, how happy did you feel yesterday? (MCZ_3)

The first variable, RECODEDHealth, is coded 1 if a respondent reports their

health as “good” and 2 if “bad.” The second variable, MCZ_3, is not coded

as it is at the interval or continuous level. It is measured on a scale of 0–10,

increasing in happiness. Given that our independent variable is dichotomous and

categorical and our dependent variable is interval or continuous, the K–S test is

an appropriate test for these data.

Data Cleaning

Prior to opening the CSV file used in this example in R and performing the

analyses outlined in this guide, you will need to do a small amount of preliminary

cleaning to the data file. This is because this file contains two codes, 98 and 99,

that represent “missing” data. However, you need to change this to “NA” (“Non

Applicable”) so that R can read it as “missing” data and not include it in your

analysis, potentially skewing your results. To change the codes you need to open

SAGE



2



the CSV file and highlight all variable columns. Then on the “Home” toolbar, click

on “Find & Select” and choose “Replace” from the dropdown menu. In the “Find

and Replace” dialog box that opens insert “98” into “Find what” and “NA” into

“Replace with.” Then click “Replace All.” This replaces all the 98 values with “NA.”

You then repeat the process for the other missing code (i.e., 99). Once you have

done this, save the file and then proceed as outlined in this guide.

2.1 The R Procedure

R is a free, open-source software and computing platform. Unlike many other

statistical software packages, it does not operate with drop-down menus. Instead,

users submit lines of code that execute commands and functions built into R. It is

a good idea to save your code in a simple text file that R users generally refer to

as a script file. We provide a script file with this example that executes all of the

operations described here. If you are new to using R, we suggest you start with the

introduction manual (http://cran.r-project.org/doc/manuals/r-release/R-intro.html).

Another useful introductory guide is Andy Field’s Discovering Statistics Using R

(2012, SAGE).

For this example, we must first load the dataset into R and then attach the

dataset so R can directly access the variables stored inside the data file. It is

possible to import data from a variety of software packages, including SPSS,

STATA, Minitab, and Excel. It is best to import data from software packages in

file formats that are R-friendly, such as tab-delimited text (.txt in Excel or .dat in

SPSS) or comma-separated files (.csv). This guide will use a CSV file (dataset-

ons-mcz-2015-subset3.csv). If you want to find R code for importing other file

types, then you can find these online easily.

If you are using the CSV file provided with this example, the code looks like this

(assuming the data file is already saved in your working directory):

ons_mcz <- read.csv(“dataset-ons-mcz-2015-subset3.csv”)

SAGE



2



http://cran.r-project.org/doc/manuals/r-release/R-intro.html

The code reads in the dataset and assigns it to an object named “ons_mcz.”

Prior to running the K–S test or indeed any statistical test, it is good practice to

examine each variable on its own; this is univariate analysis. This allows us an

opportunity to describe the variable and get an initial “feel” for our data. We can

start by running measures of central tendency descriptives of MCZ_3. In order to

do this, we need to first load (you may have to install this first if you have not used

it before) the pastecs package and then use the following code:

stat.desc(ons_mcz$MCZ_3)

round(stat.desc(ons_mcz$MCZ_3), digits = 3)

The first line of code generates the measures of central tendency descriptives

for the variable, and the second line of code (using the round() function sets the

number of places that we want. The code generates the following results (NB

some of the output has been removed to save space):

The mean is 7.62 and the median is 8, suggesting that the majority of the

respondents are “happy.” If the data are normally distributed, then from the

standard deviation, we would assume that two thirds of scores fall between 5.43

and 9.81, suggesting that majority of respondents are “happy” to “very happy.” We

can also visually assess the data by running a histogram of the variable; the code

and results are shown here:

hist(ons_mcz$MCZ_3)

SAGE



2



The data do not appear to be normally distributed.

We should then run a frequency table of RECODEDHealth; to do this, we need to

load (you may have to install this first if you have not used it before) the summary

tools package. The code and results for our data are shown here:

library (“summarytools”)

freq(ons_mcz$RECODEDHealth, order = “freq”)

SAGE



2



The majority of the respondents (76.8%) rate their health as “good,” while only

23.2% of respondents rate their health as “bad.” It should be noted that there are

41 missing cases. This variable has two groups of unequal size.

Now that we have examined our variables individually, we can run K–S tests to

assess the normality of our data.

One-Sample K–S Test

To run a one-sample K–S test in R, load (you may have to install this first if you

have not used it before) the nortest package and then use the following code:

ks.test(ons_mcz$MCZ_3, “pnorm”)

Two-Sample K–S Test

To run a two-sample K–S test in R, load (you may have to install this first if you

have not used it before) the nortest package and then use the following code:

ks.test(ons_mcz$MCZ_3, ons_mcz$RECODEDHealth)

2.2 Exploring the R Output

One-Sample K–S Test

The R code outlined previously produces the following results:

SAGE



2



The results show that D(2,040) = 0.96532, p < .01, meaning that there is a

statistically significant deviation from normality. Therefore, we can reject the null

hypothesis of no deviation from normality in relation to the variable MCZ_3.

People’s level of general happiness is not normally distributed, which confirms the

earlier interpretation of the histogram of MCZ_3.

Two-Sample K–S Test

The R code outlined previously produces the following results:

The results show that D = 0.9667, p < .01, meaning that there is a statistically

significant deviation in distribution between the two groups; the levels of general

happiness in each group come from different distributions. Therefore, we can

reject the null hypothesis of no difference in distributions between those who rate

their health as “good” and those who rate their health as “bad.”

3 Your Turn

Download this sample dataset and see whether you can replicate these results.

The sample dataset also includes another variable called MCZ_13, which relates

SAGE



2



to how satisfied the respondent was in relation to the balance between the time

spent on paid job and the time spent on other aspects of life. See whether you

can reproduce the results presented here for the MCZ_3 variable, and then try

producing your own two-sample K–S test substituting MCZ_3 for MCZ_13 in the

analysis.

SAGE



2



Documents

Learn to Use the Kolmogorov–Smirnov Test in R With Data