Upload
jeremy-powers
View
36
Download
3
Embed Size (px)
DESCRIPTION
Introduction to R: Lesson 2 - Manipulating Data. Andrew Jaffe 9/13/10. Reminder. Here is the course website: http://www.biostat.jhsph.edu/~ajaffe/rseminar.html There is a running collection of functions that we have covered in class. Dataset. - PowerPoint PPT Presentation
Citation preview
Introduction to R: Lesson 2 - Manipulating Data
Andrew Jaffe
9/13/10
Reminder
Here is the course website:
http://www.biostat.jhsph.edu/~ajaffe/rseminar.html
There is a running collection of functions that we have covered in class
Dataset
For the remaining sessions, we’re going to learn R by using data from the Baltimore Dog Study
Data collection is ongoing, and dataset will be updated weekly
http://metrodog.blogspot.com/
Overview
Importing Data Examining Data Recoding Variables Exporting Data
Importing Data
Here is a link to the data:
http://www.biostat.jhsph.edu/~ajaffe/files/lecture_2_data.csv
So how do we get it into R? Two options! Both involve read.table()
Importing Data
read.table(filename, header = F, sep = "", as.is = !stringsAsFactors, …) In functions, "…" means additional
parameters can be passed/usedThese are some of the options associated
with this functions – all can be seen typing ?read.table in the console
Importing Data
filename: the path to your file, in quotes If no path is specified (ie "C:\Docs\data.txt" or
"\Users\Andrew\data.txt"), then R will look in your working directory for the file (ie "data.txt")
For PCs, you need double backslashes to designate paths (ie "C:\\Docs\\data.txt")
Basically, a single backslash is the 'escape' character
Importing Data
filename - you can:Write out the full file path using quotes and
the correct syntaxManually set your working directory to where
your script and files are located [setwd()]Or, if your script and files are in the same
place, use Notepad++. It sets the script's location to be the working directory
Importing Data
header – default is falseDoes the first row of your file contain column
names? If so, include 'header = T' in your read.table() call
Importing Data
sep = "" – what character separates columns?
The escape character followed by the delimiter is used here:Tab: "\t"Newline/Enter/Return: "\n"Ampersand: "\&", etc
Importing Data
CSV is an exceptionA special case of read.table() exists:
read.csv(), which takes all of the same parameters, except defaults sep = ","
Analogously, read.delim() defaults sep = "\t"
Importing data
as.is = F (as stringsAsFactors=T) : should character strings be treated as factors?
I prefer character strings as characters (ie as.is = T) and not factorsEasier to manipulate, search, and matchYou can always change to factors later
Importing Data
Let's open up a new script:Notepad++ : File NewMac: File New Document
Save it somewhere you can find later Write a header (using #) If Mac, use setwd() and include the folder
you put the script
Importing Data
Let's get our data R Option 1: remember ‘scan’ from last
session?
file = "http://www.biostat.jhsph.edu/~ajaffe/files/lecture_2_data.csv"
Importing Data
Option 2: Right click on the link to the data on the webpage, and save it as a csv file in the same folder as your script
file = "lecture_2_data.csv"
Importing Data
Either way:
dat <- read.csv(file, header = T, as.is=T)
Overview
Importing Data Examining Data Recoding Variables Exporting Data
Examining Data
What are the dimensions of the dataset?
Examining Data
What are the dimensions of the dataset?
> dim(dat)
[1] 1000 7Rows Columns
Examining Data
What variables are included? What are their names?
Examining Data
What variables are included? What are their names?
> head(dat) id age sex height weight dog dog_type1 1 40 F 63.5 134.5 no <NA>2 2 36 M 65.6 191.6 no <NA>3 3 69 M 68.2 170.0 no <NA>4 4 56 F 62.9 134.5 no <NA>5 5 66 F 63.7 133.4 no <NA>6 6 84 M 70.8 200.6 no <NA>
Examining Data
What variables are included? What are their names?
> names(dat)[1] "id" "age" "sex" "height" [5] "weight" "dog" "dog_type"
Examining Data
What class of data is 'id'? 'dog_type'?
Examining Data
What class of data is 'id'? 'dog_type'?
> class(dat$id)[1] "integer"> class(dat$dog_type)[1] "character"
Examining Data
What class of data is 'id'? 'dog_type'?> str(dat)'data.frame': 1000 obs. of 7 variables: $ id : int 1 2 3 4 5 6 7 8 9 10 ... $ age : int 40 36 69 56 66 84 40 73 76 38 ... $ sex : chr "F" "M" "M" "F" ... $ height : num 63.5 65.6 68.2 62.9 63.7 70.8 67 67 62.6 62.2 ... $ weight : num 134 192 170 134 133 ... $ dog : chr "no" "no" "no" "no" ... $ dog_type: chr NA NA NA NA ...
Examining Data
How many total participants are there?How many men and how many women?
Examining Data
How many total participants are there?How many men and how many women?
> length(unique(dat$id))[1] 1000
> unique(c(1,1,2,2,3))[1] 1 2 3> length(unique(c(1,1,2,2,3)))[1] 3> length(c(1,1,2,2,3))[1] 5
Examining Data
How many total participants are there?How many men and how many women?
> table(dat$sex) F M 493 507
Examining Data
How many people have dogs?
Examining Data
How many people have dogs?
> table(dat$dog) no yes 518 482
Examining Data
How many different types of dogs are there? How many of each?
Examining Data
How many different types of dogs are there? How many of each?
> table(dat$dog_type)
husky lab poodle retriever 113 125 111 133
Overview
Importing Data Examining Data Recoding Variables Exporting Data
Recoding Data
Missingness: represented by 'NA' [default] read.table(…,na.strings = "NA",…) – you
can change based on your data 'NA' is NOT a character string:
> x = rep(NA,3)> x[1] NA NA NA> class(x)[1] "logical"
Recoding Data
NA values are essentially ignored, except when you use certain functions
> x = c(NA, 1, NA, 3, 4)> x*2[1] NA 2 NA 6 8> mean(x)[1] NA> mean(x, na.rm = TRUE)[1] 2.666667
Recoding Data
is.na() tests for missing entriesReturns TRUE or FALSE at each entry
> x = c(NA, 1, NA, 3, 4)> x[1] NA 1 NA 3 4> class(x)[1] "numeric"> is.na(x)[1] TRUE FALSE TRUE FALSE FALSE
Recoding Data
which() returns the indices for entries that are TRUE
> which(is.na(x))[1] 1 3
Recoding Data
'!' means 'not':
> which(!is.na(x))[1] 2 4 5> x[1] NA 1 NA 3 4> Index = which(!is.na(x))> x[Index][1] 1 3 4
Recoding Data
‘which’ is implicit when you subset using ‘is.na’ (or !is.na)
# in one step> x[!is.na(x)][1] 1 3 4
Recoding Data
Renaming binary variables – ex: change sex from M/F to 0/1
> head(dat$sex)[1] "F" "M" "M" "F" "F" "M"> bin.sex = ifelse(dat$sex=="F",1,0)> head(bin.sex)[1] 1 0 0 1 1 0
Recoding Data
?ifelse: ifelse(test, yes, no) test - an object which can be coerced to
logical mode (ie TRUE or FALSE)yes - return values for true elements of testno - return values for false elements of test
Recoding Data
Logical characters: ==, !=, <, >, <=, >= Also: is.[type] – ie: is.na, is.character,
is.data.frame, is.numeric, etc…
> x = c(1,3,7,9)> x > 3[1] FALSE FALSE TRUE TRUE> x == 3[1] FALSE TRUE FALSE FALSE
Recoding Data
> bin.sex = ifelse(dat$sex=="F",1,0)> head(dat$sex == "F")[1] TRUE FALSE FALSE TRUE TRUE FALSE> head(bin.sex)[1] 1 0 0 1 1 0
Recoding Data
Analogously, creating a cut-point in continuous data:
> head(dat$age)[1] 40 36 69 56 66 84> bin.age = ifelse(dat$age < 50, 0, 1)> head(bin.age)[1] 0 0 1 1 1 1
Overview
Importing Data Examining Data Recoding Variables Exporting Data
Exporting Data
write.table(data, filename, quote = T, row.names = T, col.names = T, sep = " ")
'data' is an R object – ie 'dat' in our case 'filename' is similar to read.table – you
should include a '.txt' in the filename 'quote' puts character strings in quotes (I
like setting that to be FALSE [or F])
Exporting Data
row.names: includes the row.names in the output, which is usually just a sequence from 1 to nrow(dat) – I prefer FALSE, as excel automatically has row indices
col.names: include the header names in the output file? Depending on the data, I usually use TRUE
Practice
Make a 2 x 2 table of sex and dog Create a 'BMI' variable using height and
weightHint: BMI = weight[lbs]*703/(height[in])^2
Create an 'overweight' variable, which gives the value 1 for people with BMI > 30 and 0 otherwise