Introduction to R: Lesson 2 - Manipulating Data

Introduction to R: Lesson 2 - Manipulating Data

Andrew Jaffe

9/13/10

Reminder

Here is the course website:

http://www.biostat.jhsph.edu/~ajaffe/rseminar.html

There is a running collection of functions that we have covered in class

Dataset

For the remaining sessions, we’re going to learn R by using data from the Baltimore Dog Study

Data collection is ongoing, and dataset will be updated weekly

http://metrodog.blogspot.com/

Overview

Importing Data Examining Data Recoding Variables Exporting Data

Importing Data

Here is a link to the data:

http://www.biostat.jhsph.edu/~ajaffe/files/lecture_2_data.csv

So how do we get it into R? Two options! Both involve read.table()

Importing Data

read.table(filename, header = F, sep = "", as.is = !stringsAsFactors, …) In functions, "…" means additional

parameters can be passed/usedThese are some of the options associated

with this functions – all can be seen typing ?read.table in the console

Importing Data

filename: the path to your file, in quotes If no path is specified (ie "C:\Docs\data.txt" or

"\Users\Andrew\data.txt"), then R will look in your working directory for the file (ie "data.txt")

For PCs, you need double backslashes to designate paths (ie "C:\\Docs\\data.txt")

Basically, a single backslash is the 'escape' character

Importing Data

filename - you can:Write out the full file path using quotes and

the correct syntaxManually set your working directory to where

your script and files are located [setwd()]Or, if your script and files are in the same

place, use Notepad++. It sets the script's location to be the working directory

Importing Data

header – default is falseDoes the first row of your file contain column

names? If so, include 'header = T' in your read.table() call

Importing Data

sep = "" – what character separates columns?

The escape character followed by the delimiter is used here:Tab: "\t"Newline/Enter/Return: "\n"Ampersand: "\&", etc

Importing Data

CSV is an exceptionA special case of read.table() exists:

read.csv(), which takes all of the same parameters, except defaults sep = ","

Analogously, read.delim() defaults sep = "\t"

Importing data

as.is = F (as stringsAsFactors=T) : should character strings be treated as factors?

I prefer character strings as characters (ie as.is = T) and not factorsEasier to manipulate, search, and matchYou can always change to factors later

Importing Data

Let's open up a new script:Notepad++ : File NewMac: File New Document

Save it somewhere you can find later Write a header (using #) If Mac, use setwd() and include the folder

you put the script

Importing Data

Let's get our data R Option 1: remember ‘scan’ from last

session?

file = "http://www.biostat.jhsph.edu/~ajaffe/files/lecture_2_data.csv"

Importing Data

Option 2: Right click on the link to the data on the webpage, and save it as a csv file in the same folder as your script

file = "lecture_2_data.csv"

Importing Data

Either way:

dat <- read.csv(file, header = T, as.is=T)

Overview


Examining Data

What are the dimensions of the dataset?

Examining Data

What are the dimensions of the dataset?

> dim(dat)

[1] 1000 7Rows Columns

Examining Data

What variables are included? What are their names?

Examining Data


> head(dat) id age sex height weight dog dog_type1 1 40 F 63.5 134.5 no <NA>2 2 36 M 65.6 191.6 no <NA>3 3 69 M 68.2 170.0 no <NA>4 4 56 F 62.9 134.5 no <NA>5 5 66 F 63.7 133.4 no <NA>6 6 84 M 70.8 200.6 no <NA>

Examining Data


> names(dat)[1] "id" "age" "sex" "height" [5] "weight" "dog" "dog_type"

Examining Data

What class of data is 'id'? 'dog_type'?

Examining Data

What class of data is 'id'? 'dog_type'?

> class(dat$id)[1] "integer"> class(dat$dog_type)[1] "character"

Examining Data

What class of data is 'id'? 'dog_type'?> str(dat)'data.frame': 1000 obs. of 7 variables: $ id : int 1 2 3 4 5 6 7 8 9 10 ... $ age : int 40 36 69 56 66 84 40 73 76 38 ... $ sex : chr "F" "M" "M" "F" ... $ height : num 63.5 65.6 68.2 62.9 63.7 70.8 67 67 62.6 62.2 ... $ weight : num 134 192 170 134 133 ... $ dog : chr "no" "no" "no" "no" ... $ dog_type: chr NA NA NA NA ...

Examining Data

How many total participants are there?How many men and how many women?

Examining Data


> length(unique(dat$id))[1] 1000

> unique(c(1,1,2,2,3))[1] 1 2 3> length(unique(c(1,1,2,2,3)))[1] 3> length(c(1,1,2,2,3))[1] 5

Examining Data


> table(dat$sex) F M 493 507

Examining Data

How many people have dogs?

Examining Data

How many people have dogs?

> table(dat$dog) no yes 518 482

Examining Data

How many different types of dogs are there? How many of each?

Examining Data

How many different types of dogs are there? How many of each?

> table(dat$dog_type)

husky lab poodle retriever 113 125 111 133

Overview


Recoding Data

Missingness: represented by 'NA' [default] read.table(…,na.strings = "NA",…) – you

can change based on your data 'NA' is NOT a character string:

> x = rep(NA,3)> x[1] NA NA NA> class(x)[1] "logical"

Recoding Data

NA values are essentially ignored, except when you use certain functions

> x = c(NA, 1, NA, 3, 4)> x*2[1] NA 2 NA 6 8> mean(x)[1] NA> mean(x, na.rm = TRUE)[1] 2.666667

Recoding Data

is.na() tests for missing entriesReturns TRUE or FALSE at each entry

> x = c(NA, 1, NA, 3, 4)> x[1] NA 1 NA 3 4> class(x)[1] "numeric"> is.na(x)[1] TRUE FALSE TRUE FALSE FALSE

Recoding Data

which() returns the indices for entries that are TRUE

> which(is.na(x))[1] 1 3

Recoding Data

'!' means 'not':

> which(!is.na(x))[1] 2 4 5> x[1] NA 1 NA 3 4> Index = which(!is.na(x))> x[Index][1] 1 3 4

Recoding Data

‘which’ is implicit when you subset using ‘is.na’ (or !is.na)

# in one step> x[!is.na(x)][1] 1 3 4

Recoding Data

Renaming binary variables – ex: change sex from M/F to 0/1

> head(dat$sex)[1] "F" "M" "M" "F" "F" "M"> bin.sex = ifelse(dat$sex=="F",1,0)> head(bin.sex)[1] 1 0 0 1 1 0

Recoding Data

?ifelse: ifelse(test, yes, no) test - an object which can be coerced to

logical mode (ie TRUE or FALSE)yes - return values for true elements of testno - return values for false elements of test

Recoding Data

Logical characters: ==, !=, <, >, <=, >= Also: is.[type] – ie: is.na, is.character,

is.data.frame, is.numeric, etc…

> x = c(1,3,7,9)> x > 3[1] FALSE FALSE TRUE TRUE> x == 3[1] FALSE TRUE FALSE FALSE

Recoding Data

> bin.sex = ifelse(dat$sex=="F",1,0)> head(dat$sex == "F")[1] TRUE FALSE FALSE TRUE TRUE FALSE> head(bin.sex)[1] 1 0 0 1 1 0

Recoding Data

Analogously, creating a cut-point in continuous data:

> head(dat$age)[1] 40 36 69 56 66 84> bin.age = ifelse(dat$age < 50, 0, 1)> head(bin.age)[1] 0 0 1 1 1 1

Overview


Exporting Data

write.table(data, filename, quote = T, row.names = T, col.names = T, sep = " ")

'data' is an R object – ie 'dat' in our case 'filename' is similar to read.table – you

should include a '.txt' in the filename 'quote' puts character strings in quotes (I

like setting that to be FALSE [or F])

Exporting Data

row.names: includes the row.names in the output, which is usually just a sequence from 1 to nrow(dat) – I prefer FALSE, as excel automatically has row indices

col.names: include the header names in the output file? Depending on the data, I usually use TRUE

Practice

Make a 2 x 2 table of sex and dog Create a 'BMI' variable using height and

weightHint: BMI = weight[lbs]*703/(height[in])^2

Create an 'overweight' variable, which gives the value 1 for people with BMI > 30 and 0 otherwise

Documents

Introduction to R: Lesson 2 - Manipulating Data