48
Introduction to R: Lesson 2 - Manipulating Data Andrew Jaffe 9/13/10

Introduction to R: Lesson 2 - Manipulating Data

Embed Size (px)

DESCRIPTION

Introduction to R: Lesson 2 - Manipulating Data. Andrew Jaffe 9/13/10. Reminder. Here is the course website: http://www.biostat.jhsph.edu/~ajaffe/rseminar.html There is a running collection of functions that we have covered in class. Dataset. - PowerPoint PPT Presentation

Citation preview

Page 1: Introduction to R:  Lesson 2 - Manipulating Data

Introduction to R: Lesson 2 - Manipulating Data

Andrew Jaffe

9/13/10

Page 2: Introduction to R:  Lesson 2 - Manipulating Data

Reminder

Here is the course website:

http://www.biostat.jhsph.edu/~ajaffe/rseminar.html

There is a running collection of functions that we have covered in class

Page 3: Introduction to R:  Lesson 2 - Manipulating Data

Dataset

For the remaining sessions, we’re going to learn R by using data from the Baltimore Dog Study

Data collection is ongoing, and dataset will be updated weekly

http://metrodog.blogspot.com/

Page 4: Introduction to R:  Lesson 2 - Manipulating Data

Overview

Importing Data Examining Data Recoding Variables Exporting Data

Page 5: Introduction to R:  Lesson 2 - Manipulating Data

Importing Data

Here is a link to the data:

http://www.biostat.jhsph.edu/~ajaffe/files/lecture_2_data.csv

So how do we get it into R? Two options! Both involve read.table()

Page 6: Introduction to R:  Lesson 2 - Manipulating Data

Importing Data

read.table(filename, header = F, sep = "", as.is = !stringsAsFactors, …) In functions, "…" means additional

parameters can be passed/usedThese are some of the options associated

with this functions – all can be seen typing ?read.table in the console

Page 7: Introduction to R:  Lesson 2 - Manipulating Data

Importing Data

filename: the path to your file, in quotes If no path is specified (ie "C:\Docs\data.txt" or

"\Users\Andrew\data.txt"), then R will look in your working directory for the file (ie "data.txt")

For PCs, you need double backslashes to designate paths (ie "C:\\Docs\\data.txt")

Basically, a single backslash is the 'escape' character

Page 8: Introduction to R:  Lesson 2 - Manipulating Data

Importing Data

filename - you can:Write out the full file path using quotes and

the correct syntaxManually set your working directory to where

your script and files are located [setwd()]Or, if your script and files are in the same

place, use Notepad++. It sets the script's location to be the working directory

Page 9: Introduction to R:  Lesson 2 - Manipulating Data

Importing Data

header – default is falseDoes the first row of your file contain column

names? If so, include 'header = T' in your read.table() call

Page 10: Introduction to R:  Lesson 2 - Manipulating Data

Importing Data

sep = "" – what character separates columns?

The escape character followed by the delimiter is used here:Tab: "\t"Newline/Enter/Return: "\n"Ampersand: "\&", etc

Page 11: Introduction to R:  Lesson 2 - Manipulating Data

Importing Data

CSV is an exceptionA special case of read.table() exists:

read.csv(), which takes all of the same parameters, except defaults sep = ","

Analogously, read.delim() defaults sep = "\t"

Page 12: Introduction to R:  Lesson 2 - Manipulating Data

Importing data

as.is = F (as stringsAsFactors=T) : should character strings be treated as factors?

I prefer character strings as characters (ie as.is = T) and not factorsEasier to manipulate, search, and matchYou can always change to factors later

Page 13: Introduction to R:  Lesson 2 - Manipulating Data

Importing Data

Let's open up a new script:Notepad++ : File NewMac: File New Document

Save it somewhere you can find later Write a header (using #) If Mac, use setwd() and include the folder

you put the script

Page 14: Introduction to R:  Lesson 2 - Manipulating Data

Importing Data

Let's get our data R Option 1: remember ‘scan’ from last

session?

file = "http://www.biostat.jhsph.edu/~ajaffe/files/lecture_2_data.csv"

Page 15: Introduction to R:  Lesson 2 - Manipulating Data

Importing Data

Option 2: Right click on the link to the data on the webpage, and save it as a csv file in the same folder as your script

file = "lecture_2_data.csv"

Page 16: Introduction to R:  Lesson 2 - Manipulating Data

Importing Data

Either way:

dat <- read.csv(file, header = T, as.is=T)

Page 17: Introduction to R:  Lesson 2 - Manipulating Data

Overview

Importing Data Examining Data Recoding Variables Exporting Data

Page 18: Introduction to R:  Lesson 2 - Manipulating Data

Examining Data

What are the dimensions of the dataset?

Page 19: Introduction to R:  Lesson 2 - Manipulating Data

Examining Data

What are the dimensions of the dataset?

> dim(dat)

[1] 1000 7Rows Columns

Page 20: Introduction to R:  Lesson 2 - Manipulating Data

Examining Data

What variables are included? What are their names?

Page 21: Introduction to R:  Lesson 2 - Manipulating Data

Examining Data

What variables are included? What are their names?

> head(dat) id age sex height weight dog dog_type1 1 40 F 63.5 134.5 no <NA>2 2 36 M 65.6 191.6 no <NA>3 3 69 M 68.2 170.0 no <NA>4 4 56 F 62.9 134.5 no <NA>5 5 66 F 63.7 133.4 no <NA>6 6 84 M 70.8 200.6 no <NA>

Page 22: Introduction to R:  Lesson 2 - Manipulating Data

Examining Data

What variables are included? What are their names?

> names(dat)[1] "id" "age" "sex" "height" [5] "weight" "dog" "dog_type"

Page 23: Introduction to R:  Lesson 2 - Manipulating Data

Examining Data

What class of data is 'id'? 'dog_type'?

Page 24: Introduction to R:  Lesson 2 - Manipulating Data

Examining Data

What class of data is 'id'? 'dog_type'?

> class(dat$id)[1] "integer"> class(dat$dog_type)[1] "character"

Page 25: Introduction to R:  Lesson 2 - Manipulating Data

Examining Data

What class of data is 'id'? 'dog_type'?> str(dat)'data.frame': 1000 obs. of 7 variables: $ id : int 1 2 3 4 5 6 7 8 9 10 ... $ age : int 40 36 69 56 66 84 40 73 76 38 ... $ sex : chr "F" "M" "M" "F" ... $ height : num 63.5 65.6 68.2 62.9 63.7 70.8 67 67 62.6 62.2 ... $ weight : num 134 192 170 134 133 ... $ dog : chr "no" "no" "no" "no" ... $ dog_type: chr NA NA NA NA ...

Page 26: Introduction to R:  Lesson 2 - Manipulating Data

Examining Data

How many total participants are there?How many men and how many women?

Page 27: Introduction to R:  Lesson 2 - Manipulating Data

Examining Data

How many total participants are there?How many men and how many women?

> length(unique(dat$id))[1] 1000

> unique(c(1,1,2,2,3))[1] 1 2 3> length(unique(c(1,1,2,2,3)))[1] 3> length(c(1,1,2,2,3))[1] 5

Page 28: Introduction to R:  Lesson 2 - Manipulating Data

Examining Data

How many total participants are there?How many men and how many women?

> table(dat$sex) F M 493 507

Page 29: Introduction to R:  Lesson 2 - Manipulating Data

Examining Data

How many people have dogs?

Page 30: Introduction to R:  Lesson 2 - Manipulating Data

Examining Data

How many people have dogs?

> table(dat$dog) no yes 518 482

Page 31: Introduction to R:  Lesson 2 - Manipulating Data

Examining Data

How many different types of dogs are there? How many of each?

Page 32: Introduction to R:  Lesson 2 - Manipulating Data

Examining Data

How many different types of dogs are there? How many of each?

> table(dat$dog_type)

husky lab poodle retriever 113 125 111 133

Page 33: Introduction to R:  Lesson 2 - Manipulating Data

Overview

Importing Data Examining Data Recoding Variables Exporting Data

Page 34: Introduction to R:  Lesson 2 - Manipulating Data

Recoding Data

Missingness: represented by 'NA' [default] read.table(…,na.strings = "NA",…) – you

can change based on your data 'NA' is NOT a character string:

> x = rep(NA,3)> x[1] NA NA NA> class(x)[1] "logical"

Page 35: Introduction to R:  Lesson 2 - Manipulating Data

Recoding Data

NA values are essentially ignored, except when you use certain functions

> x = c(NA, 1, NA, 3, 4)> x*2[1] NA 2 NA 6 8> mean(x)[1] NA> mean(x, na.rm = TRUE)[1] 2.666667

Page 36: Introduction to R:  Lesson 2 - Manipulating Data

Recoding Data

is.na() tests for missing entriesReturns TRUE or FALSE at each entry

> x = c(NA, 1, NA, 3, 4)> x[1] NA 1 NA 3 4> class(x)[1] "numeric"> is.na(x)[1] TRUE FALSE TRUE FALSE FALSE

Page 37: Introduction to R:  Lesson 2 - Manipulating Data

Recoding Data

which() returns the indices for entries that are TRUE

> which(is.na(x))[1] 1 3

Page 38: Introduction to R:  Lesson 2 - Manipulating Data

Recoding Data

'!' means 'not':

> which(!is.na(x))[1] 2 4 5> x[1] NA 1 NA 3 4> Index = which(!is.na(x))> x[Index][1] 1 3 4

Page 39: Introduction to R:  Lesson 2 - Manipulating Data

Recoding Data

‘which’ is implicit when you subset using ‘is.na’ (or !is.na)

# in one step> x[!is.na(x)][1] 1 3 4

Page 40: Introduction to R:  Lesson 2 - Manipulating Data

Recoding Data

Renaming binary variables – ex: change sex from M/F to 0/1

> head(dat$sex)[1] "F" "M" "M" "F" "F" "M"> bin.sex = ifelse(dat$sex=="F",1,0)> head(bin.sex)[1] 1 0 0 1 1 0

Page 41: Introduction to R:  Lesson 2 - Manipulating Data

Recoding Data

?ifelse: ifelse(test, yes, no) test - an object which can be coerced to

logical mode (ie TRUE or FALSE)yes - return values for true elements of testno - return values for false elements of test

Page 42: Introduction to R:  Lesson 2 - Manipulating Data

Recoding Data

Logical characters: ==, !=, <, >, <=, >= Also: is.[type] – ie: is.na, is.character,

is.data.frame, is.numeric, etc…

> x = c(1,3,7,9)> x > 3[1] FALSE FALSE TRUE TRUE> x == 3[1] FALSE TRUE FALSE FALSE

Page 43: Introduction to R:  Lesson 2 - Manipulating Data

Recoding Data

> bin.sex = ifelse(dat$sex=="F",1,0)> head(dat$sex == "F")[1] TRUE FALSE FALSE TRUE TRUE FALSE> head(bin.sex)[1] 1 0 0 1 1 0

Page 44: Introduction to R:  Lesson 2 - Manipulating Data

Recoding Data

Analogously, creating a cut-point in continuous data:

> head(dat$age)[1] 40 36 69 56 66 84> bin.age = ifelse(dat$age < 50, 0, 1)> head(bin.age)[1] 0 0 1 1 1 1

Page 45: Introduction to R:  Lesson 2 - Manipulating Data

Overview

Importing Data Examining Data Recoding Variables Exporting Data

Page 46: Introduction to R:  Lesson 2 - Manipulating Data

Exporting Data

write.table(data, filename, quote = T, row.names = T, col.names = T, sep = " ")

'data' is an R object – ie 'dat' in our case 'filename' is similar to read.table – you

should include a '.txt' in the filename 'quote' puts character strings in quotes (I

like setting that to be FALSE [or F])

Page 47: Introduction to R:  Lesson 2 - Manipulating Data

Exporting Data

row.names: includes the row.names in the output, which is usually just a sequence from 1 to nrow(dat) – I prefer FALSE, as excel automatically has row indices

col.names: include the header names in the output file? Depending on the data, I usually use TRUE

Page 48: Introduction to R:  Lesson 2 - Manipulating Data

Practice

Make a 2 x 2 table of sex and dog Create a 'BMI' variable using height and

weightHint: BMI = weight[lbs]*703/(height[in])^2

Create an 'overweight' variable, which gives the value 1 for people with BMI > 30 and 0 otherwise