50
Introduction to R Harvard MIT Data Center April 19, 2013 e Institute for Quantitative Social Science at Harvard University (Harvard MIT Data Center) Introduction to R April 19, 2013 1 / 50

Introduction to the R Statistical Computing Environment

  • Upload
    izahn

  • View
    6.748

  • Download
    3

Embed Size (px)

DESCRIPTION

Get an introduction to R, the open-source system for statistical computation and graphics. With hands-on exercises, learn how to import and manage datasets, create R objects, and conduct basic statistical analyses. Full workshop materials can be downloaded from http://projects.iq.harvard.edu/rtc/event/introduction-r

Citation preview

Page 1: Introduction to the R Statistical Computing Environment

Introduction to R

Harvard MIT Data Center

April 19, 2013

�e Institutefor Quantitative Social Scienceat Harvard University

(Harvard MIT Data Center) Introduction to R April 19, 2013 1 / 50

Page 2: Introduction to the R Statistical Computing Environment

Outline

1 Getting Started with R and RStudio

2 Finding Help

3 Data and Functions

4 Loading and Saving Data

5 Data Manipulation

6 Basic Statistics and Graphs

7 Wrap-up

(Harvard MIT Data Center) Introduction to R April 19, 2013 2 / 50

Page 3: Introduction to the R Statistical Computing Environment

Getting Started with R and RStudio

Topic

1 Getting Started with R and RStudio

2 Finding Help

3 Data and Functions

4 Loading and Saving Data

5 Data Manipulation

6 Basic Statistics and Graphs

7 Wrap-up

(Harvard MIT Data Center) Introduction to R April 19, 2013 3 / 50

Page 4: Introduction to the R Statistical Computing Environment

Getting Started with R and RStudio

Materials and setup

USERNAME: dataclassPASSWORD: dataclassFind class materials at

http://projects.iq.harvard.edu/rtc/event/introduction-r

Download the zip file at the bottom of the page and unzip on yourdesktop!

(Harvard MIT Data Center) Introduction to R April 19, 2013 4 / 50

Page 5: Introduction to the R Statistical Computing Environment

Getting Started with R and RStudio

Why R?

Vast capabilities, wide range of statistical and graphical techniquesVery popular in academia, growing popularity in business:http://r4stats.com/articles/popularity/

Written primarily by statisticiansFREE as in free speech: collaborative developmentFREE as in free beer: no costExcellent community support: mailing list, blogs, tutorialsEasy to extend by writing new functionsExcellent graphical user interfaces (though you may have to install them)

(Harvard MIT Data Center) Introduction to R April 19, 2013 5 / 50

Page 6: Introduction to the R Statistical Computing Environment

Getting Started with R and RStudio

R GUI alternatives (no GUI)

The old-school way is to run R directly in a terminal

But hardly anybody does it that way anymore!

(Harvard MIT Data Center) Introduction to R April 19, 2013 6 / 50

Page 7: Introduction to the R Statistical Computing Environment

Getting Started with R and RStudio

R GUI alternatives (Windows default)

The default windows GUI is not very goodNo parentheses matching or syntax highlightingNo work-space browser

(Harvard MIT Data Center) Introduction to R April 19, 2013 7 / 50

Page 8: Introduction to the R Statistical Computing Environment

Getting Started with R and RStudio

R GUI Alternatives (Rstudio on Mac)

Rstudio has many useful features, including parentheses matching andauto-completion

(Harvard MIT Data Center) Introduction to R April 19, 2013 8 / 50

Page 9: Introduction to the R Statistical Computing Environment

Getting Started with R and RStudio

R GUI Alternatives (Emacs with ESS)

Emacs + ESS is a very powerful combination, but can be difficult to set up

(Harvard MIT Data Center) Introduction to R April 19, 2013 9 / 50

Page 10: Introduction to the R Statistical Computing Environment

Finding Help

Topic

1 Getting Started with R and RStudio

2 Finding Help

3 Data and Functions

4 Loading and Saving Data

5 Data Manipulation

6 Basic Statistics and Graphs

7 Wrap-up

(Harvard MIT Data Center) Introduction to R April 19, 2013 10 / 50

Page 11: Introduction to the R Statistical Computing Environment

Finding Help

Getting help in R

Start html help, search/browse using web browserat the R console:help.start()

or use the help menu from you GUI

Look up the documentation for a functionhelp(topicName)

?topicName

Look up documentation for a packagehelp(package="packageName")

Search documentation from R (not always the best way. . . google oftenworks better)help.search("topicName")

(Harvard MIT Data Center) Introduction to R April 19, 2013 11 / 50

Page 12: Introduction to the R Statistical Computing Environment

Finding Help

Reading R documentation

Beginners sometimes find the docs intimidating. It is much easier once youknow the structure:Description What does this function do?

Usage Generic example showing default argumentsArguments What are the inputs to this function?

Value What is the output (or result) of this function?Details Additional details to help you understand how it works

References Citations for, e.g., the algorithms used in the functionSee Also Other related functionsExamples Examples of the function in action; type

example(functionName) to run themIn many cases just looking at the usage section is enough

(Harvard MIT Data Center) Introduction to R April 19, 2013 12 / 50

Page 13: Introduction to the R Statistical Computing Environment

Finding Help

R packages and libraries

There are thousands of R packages that extend R’s capabilities. To getstarted, check out http://cran.r-project.org/web/views/.

To view available packages:library()

To see what packages are loaded:search()

To load a package:library("packageName")

Install new package:install.packages("packageName")

(Harvard MIT Data Center) Introduction to R April 19, 2013 13 / 50

Page 14: Introduction to the R Statistical Computing Environment

Finding Help

Things to keep in mind

Case sensitive, like Stata (unlike SAS)Comments can be put almost anywhere, starting with a hash mark (’#’);everything to the end of the line is a commentThe command prompt “>” indicates that R is ready to receive commandsIf a command is not complete at the end of a line, R will give a differentprompt, ‘+’ by defaultParentheses must always match (first thing to check if you get an error)R Does not care about spaces between commands or argumentsNames should start with a letter and should not contain spacesCan use “.” in object names (e.g., “my.data”)Use / instead of \ in path names

(Harvard MIT Data Center) Introduction to R April 19, 2013 14 / 50

Page 15: Introduction to the R Statistical Computing Environment

Finding Help

Exercise 0

1 Find the R console prompt and type ‘2+2 <enter>’2 Look up the help page for the “mean” topic3 Start the html help browser. From the browser, search for “linear model”4 Use google and search for “R linear model”. Click on the first link.5 Go to http://cran.r-project.org/web/views/ and skim the topic

closest to your field/interests

(Harvard MIT Data Center) Introduction to R April 19, 2013 15 / 50

Page 16: Introduction to the R Statistical Computing Environment

Data and Functions

Topic

1 Getting Started with R and RStudio

2 Finding Help

3 Data and Functions

4 Loading and Saving Data

5 Data Manipulation

6 Basic Statistics and Graphs

7 Wrap-up

(Harvard MIT Data Center) Introduction to R April 19, 2013 16 / 50

Page 17: Introduction to the R Statistical Computing Environment

Data and Functions

Assignment

Values can be assigned names and used in subsequent operationsThe <- operator (less than followed by a dash) is used to save valuesThe name on the left gets the value on the right.

> x <- 11 # Assign the value 10 to a variable named x> x + 1 # Add 1 to x[1] 12> y <- x + 1 # Assign y the value x + 1> y[1] 12

Saved variables can be listed, overwritten and deleted

> ls() # List variables in workspace[1] "x" "y"> x # Print the value of x[1] 11> x <- 100 # Overwrite x. Note that no warning is given!> x[1] 100> rm(x) # Delete x> ls()[1] "y"

(Harvard MIT Data Center) Introduction to R April 19, 2013 17 / 50

Page 18: Introduction to the R Statistical Computing Environment

Data and Functions

Functions

Using R is mostly about applying functions to variables. Functionstake variable as input argumentsperform operationsreturn values which can be assignedoptionally perform side-effects such as writing a file to disk or opening agraphics window

The general form for calling R functions isFunctionName(arg.1, arg.2, ... arg.n)

Arguments can be matched by position or nameExamples:

> #?sqrt> a <- sqrt(y) # Call the sqrt function with argument x=y> round(a, digits = 2) # Call round() with arguments x=x and digits=2[1] 3.46> # Functions can be nested so an alternative is> round(sqrt(y), digits = 5) # Take sqrt of a and round[1] 3.4641

(Harvard MIT Data Center) Introduction to R April 19, 2013 18 / 50

Page 19: Introduction to the R Statistical Computing Environment

Data and Functions

Vectors and class

Values can be combined into vectors using the c() function

> num.var <- c(1, 2, 3, 4) # numeric vector> char.var <- c("1", "2", "3", "4") # character vector> log.var <- c(TRUE, TRUE, FALSE, TRUE) # logical vector> char.var2 <- c(num.var, char.var) # numbers coverted to character>

Vectors have a class which determines how functions treat them> class(num.var)[1] "numeric"> mean(num.var) # take the mean of a numeric vector[1] 2.5> class(char.var)[1] "character"> mean(char.var) # cannot average characters[1] NA> class(char.var2)[1] "character"

(Harvard MIT Data Center) Introduction to R April 19, 2013 19 / 50

Page 20: Introduction to the R Statistical Computing Environment

Data and Functions

Vector conversion and info

Vectors can be converted from one class to another> class(char.var2)[1] "character"> num.var2 <- as.numeric(char.var2) # convert to numeric> class(num.var2)[1] "numeric"> mean(as.numeric(char.var2)) # now we can calculate the mean[1] 2.5> as.numeric(c("a", "b", "c")) # cannot convert letters to numeric[1] NA NA NA

In addition to class, you can examine the length()and str() ucture of vectors

> ls() # list objects in our workspace[1] "a" "char.var" "char.var2" "log.var" "num.var"[6] "num.var2" "y"> length(char.var) # how many elements in char.var?[1] 4> str(num.var2) # what is the structure of num.var2?num [1:8] 1 2 3 4 1 2 3 4

(Harvard MIT Data Center) Introduction to R April 19, 2013 20 / 50

Page 21: Introduction to the R Statistical Computing Environment

Data and Functions

Factor vectors

Factors are stored as numbers, but have character labels. Factors are usefulfor

Modeling (automatically contrast coded)Sorting/presenting values in arbitrary order

> lmh.var <- c("l", "m", "h")> sort(lmh.var) # not in order![1] "h" "l" "m"> lmh.factor.var <- factor(lmh.var,+ levels=c("l", "m", "h"))> sort(lmh.factor.var) # sorted correctly[1] l m hLevels: l m h

(Harvard MIT Data Center) Introduction to R April 19, 2013 21 / 50

Page 22: Introduction to the R Statistical Computing Environment

Data and Functions

Missing values

R represents missing values with NA

> x.missing <- c(1, 2, NA, 4, 5)> is.na(x.missing)[1] FALSE FALSE TRUE FALSE FALSE

R does not exclude missing values by default

> mean(x.missing)[1] NA> mean(x.missing, na.rm=TRUE)[1] 3

(Harvard MIT Data Center) Introduction to R April 19, 2013 22 / 50

Page 23: Introduction to the R Statistical Computing Environment

Data and Functions

Exercise 1: vectors and classes

1 Create a new vector called “test” containing five numbers of your choice[ c(), <- ]

2 Create a second vector called “students” containing five common namesof your choice [ c(), <- ]

3 Determine the class of “students” and “test” [ class() or str() ]4 Convert “test” to character class, and confirm that you were successful [

as.numeric(), <-, str() ]5 Remove the objects “students” and “test” and check that they were

properly removed [ rm(), ls() ]6 Create a factor with values a, b, c, d and levels d, c, b, a (in that order),

and check that you were successful [ factor(), c(), str() ]

(Harvard MIT Data Center) Introduction to R April 19, 2013 23 / 50

Page 24: Introduction to the R Statistical Computing Environment

Loading and Saving Data

Topic

1 Getting Started with R and RStudio

2 Finding Help

3 Data and Functions

4 Loading and Saving Data

5 Data Manipulation

6 Basic Statistics and Graphs

7 Wrap-up

(Harvard MIT Data Center) Introduction to R April 19, 2013 24 / 50

Page 25: Introduction to the R Statistical Computing Environment

Loading and Saving Data

The gss dataset

The next few examples use a subset of the General Social Survey data set.The variables in this subset include> head(read.csv("dataSets/gssInfo.csv"))

var description1 marital marital status2 age age of respondent3 educ highest year of school completed4 sex respondents sex5 inc respondents income6 happy general happiness

(Harvard MIT Data Center) Introduction to R April 19, 2013 25 / 50

Page 26: Introduction to the R Statistical Computing Environment

Loading and Saving Data

The “working directory” and listing files

R knows the directory it was started in, and refers to this as the “workingdirectory”

> getwd() # get the current working directory[1] "/Users/izahn/Documents/Work/IQSS/Classes/Rintro/New"> setwd("dataSets") # set wd to the dataSets folder> getwd()[1] "/Users/izahn/Documents/Work/IQSS/Classes/Rintro/New/dataSets"> setwd("..") # set wd to enclosing folder ("up")>

It can be convenient to list files in a directory without leaving R

> list.files("dataSets") # list files in the dataSets folder[1] "gss.csv" "gss.dta" "gss.rds" "gss.sas7bdat"[5] "gss.sav" "gss.xlsx" "gssInfo.csv"> list.files("dataSets", pattern = ".csv") # restrict to .csv files[1] "gss.csv" "gssInfo.csv"

(Harvard MIT Data Center) Introduction to R April 19, 2013 26 / 50

Page 27: Introduction to the R Statistical Computing Environment

Loading and Saving Data

Reading and writing data in native R format

Use the readRDS and saveRDS functions to open/save data in R’s nativedata format> # read gss data from the gss.rds file> # and save it to gss.data.rds> gss.data.rds <- readRDS("dataSets/gss.rds")>> # write the data in gss.data.rds to a> # file named "gssNew.rds"> saveRDS(gss.data.rds, file = "gssNew.rds")>

(Harvard MIT Data Center) Introduction to R April 19, 2013 27 / 50

Page 28: Introduction to the R Statistical Computing Environment

Loading and Saving Data

Reading and writing delimited plain-text data

Delimited files can be read and written to file with the read.* and write.*fuctions> list.files("dataSets", # list .csv files in the dataSets folder+ pattern = ".csv")[1] "gss.csv" "gssInfo.csv"> gss.data <- read.csv("dataSets/gss.csv") # read gss data> write.csv(gss.data, # write gss data to a new csv file+ file = "dataSets/NewGSS.csv")>

For details see?read.table

and?write.table

(Harvard MIT Data Center) Introduction to R April 19, 2013 28 / 50

Page 29: Introduction to the R Statistical Computing Environment

Loading and Saving Data

Reading data from other programs

R has tools for bringing in data from SPSS, Stata, SAS, etc. First, you needto load the R package, “foreign”

> # Read a Stata dataset and assign the results to a variable named "datGSS"> library(foreign) # load foreign data functions> # read Stata data> datGSS <- read.dta(file="dataSets/gss.dta")>

Always a good idea to examine the imported data set–usually we want theresults to be a data.frame

> class(datGSS) # check to see that test is what we expect it to be[1] "data.frame"> dim(datGSS)[1] 1419 35> names(datGSS)[1:10] # first 10 column names[1] "age" "educ" "emailhrs" "hrs1" "sex" "usecomp"[7] "usemail" "useweb" "webhrs" "hapmar"

For information about other formats, seehelp(package="foreign")

(Harvard MIT Data Center) Introduction to R April 19, 2013 29 / 50

Page 30: Introduction to the R Statistical Computing Environment

Loading and Saving Data

Saving and loading R workspaces

In addition to saving individual datasets, R can save and load entireworkspaces

Save our entire workspace

> ls() # list objects in our workspace[1] "a" "char.var" "char.var2"[4] "dat.sas" "datGSS" "date.char.var"[7] "date.var" "gss.data" "gss.data.rds"

[10] "lmh.factor.var" "lmh.var" "log.var"[13] "num.var" "num.var2" "y"> save.image(file="myWorkspace.RData") # save workspace> rm(list=ls()) # remove all objects from our workspace> ls() # list stored objects to make sure they are deletedcharacter(0)

Load the “myWorkspace.RData” file and check that it is restored

> load("myWorkspace.RData") # load myWorkspace.RData> ls() # list objects[1] "a" "char.var" "char.var2"[4] "dat.sas" "datGSS" "date.char.var"[7] "date.var" "gss.data" "gss.data.rds"

[10] "lmh.factor.var" "lmh.var" "log.var"[13] "num.var" "num.var2" "y"

When you close R you will be asked if you want to save your workspace – ifyou choose yes then your workspace will be restored next time you start R

(Harvard MIT Data Center) Introduction to R April 19, 2013 30 / 50

Page 31: Introduction to the R Statistical Computing Environment

Loading and Saving Data

Exercise 2: loading and manipulating data

1 Look at the help for the foreign package and determine which functionyou need to read data in SPSS format

2 Read the SPSS data set in dataSets/gss.sav and assign the result to anR data object named GSS.sav

3 Make sure that the data loaded in step 2 is a data.frame4 Display the dimensions of the GSS.sav.5 Write out a .csv file containing the values in GSS.sav6 BONUS: figure out how to read the Excel file “gss.xlsx” into R

(Harvard MIT Data Center) Introduction to R April 19, 2013 31 / 50

Page 32: Introduction to the R Statistical Computing Environment

Data Manipulation

Topic

1 Getting Started with R and RStudio

2 Finding Help

3 Data and Functions

4 Loading and Saving Data

5 Data Manipulation

6 Basic Statistics and Graphs

7 Wrap-up

(Harvard MIT Data Center) Introduction to R April 19, 2013 32 / 50

Page 33: Introduction to the R Statistical Computing Environment

Data Manipulation

data.frame objects

Usually data read into R will be stored as a data.frameA data.frame is a list of vectors of equal length

Each vector in the list forms a columnEach column can be a differnt type of vectorOften the columns are variables and the rows are observations

A data.frame has two dimensions corresponding the number of rows andthe number of columns (in that order)

(Harvard MIT Data Center) Introduction to R April 19, 2013 33 / 50

Page 34: Introduction to the R Statistical Computing Environment

Data Manipulation

data.frame metadata

A number of functions are available for inspecting data.frame objects:

> # row and column names> head(names(datGSS)) # variable names in datGSS[1] "age" "educ" "emailhrs" "hrs1" "sex" "usecomp"> head(rownames(datGSS)) # first few rownames of datGSS[1] "1" "2" "3" "4" "5" "6"> # dimensions> dim(datGSS)[1] 1419 35> # structure> #str(datGSS) # get structure>

(Harvard MIT Data Center) Introduction to R April 19, 2013 34 / 50

Page 35: Introduction to the R Statistical Computing Environment

Data Manipulation

Extracting subsets of data.frames

You can flexibly extract subsets of data.frames using single bracketsThe first index corresponds to rows, the second to columnsEmpty index means “all”

> # extracting subsets> # datGSS[c(1,2), ] # rows 1 and 2, all columns> # datGSS[ , c(1,2)] # all rows, columns 1 and 2> # datGSS[ , c("age", "educ")] # same as above>> datGSS[1:5, 1] # rows 1 through 5, column 1[1] 69 27 19 21 19> datGSS[1:5, "educ"] # rows 1-5, column "educ"[1] 12 10 11 9 11> datGSS[datGSS[, "age"] > 90, c("sex", "age")] # rows where age > 90

sex age315 Female 99665 Male 99

(Harvard MIT Data Center) Introduction to R April 19, 2013 35 / 50

Page 36: Introduction to the R Statistical Computing Environment

Data Manipulation

Replacing subsets of data.frames

You can flexibly create and replace subsets of data.frames using bracketnotation> # creating new variable mean centered age> datGSS[ , "ageC"] <- datGSS[, "age"] - mean(datGSS[, "age"])>> #education difference between wifes and husbands> datGSS[ , "educ.diff"] <- datGSS[ , "wifeduc"] - datGSS[, "husbeduc"]>> # replacing subsets to create young/old variable> datGSS[ , "young"] <- "no" # all values of young = "no"> datGSS[datGSS[ , "age"] < 30, "young"] <- "yes" # change to "yes" if age < 30>> datGSS[1:4, c("age", "ageC", "young", "wifeduc", "husbeduc", "educ.diff")]

age ageC young wifeduc husbeduc educ.diff1 69 22.36364 no NA NA NA2 27 -19.63636 yes 13 10 33 19 -27.63636 yes NA NA NA4 21 -25.63636 yes NA NA NA

(Harvard MIT Data Center) Introduction to R April 19, 2013 36 / 50

Page 37: Introduction to the R Statistical Computing Environment

Data Manipulation

Exercise 3: Data manipulation

Use the gss.rds data set1 Generate the following variables:

“rich” equal to 0 if rincdol is less than 100000, and 1 otherwise“sinc” equal to incomdol - rincdol“dual.earn” equal to 1 if wkftwife = 1 and wkfthusb = 1, and zerootherwise

2 Create a subset of the data containing only rows where “usecomp” =“Yes”

3 Examine the data.frame created in step 2, and answer the followingquestions:

How many rows does it have?How many columns does it have?Is the “satjob” variable numeric?

(Harvard MIT Data Center) Introduction to R April 19, 2013 37 / 50

Page 38: Introduction to the R Statistical Computing Environment

Basic Statistics and Graphs

Topic

1 Getting Started with R and RStudio

2 Finding Help

3 Data and Functions

4 Loading and Saving Data

5 Data Manipulation

6 Basic Statistics and Graphs

7 Wrap-up

(Harvard MIT Data Center) Introduction to R April 19, 2013 38 / 50

Page 39: Introduction to the R Statistical Computing Environment

Basic Statistics and Graphs

Basic statistics

Descriptive statistics of single variables are straightforward:

> mean(datGSS[ , "educ"]) # calculate mean of x[1] 13.47498> sd(datGSS[, "educ"]) # calculate standard deviation of x[1] 5.389476> summary(datGSS[ , "educ"]) # calculate min, max, quantiles, mean

Min. 1st Qu. Median Mean 3rd Qu. Max.0.00 12.00 13.00 13.47 16.00 99.00

If you get tired of typing the data.frame name over and over, use with()instead> with(datGSS, {+ stats <- c(Lowest = min(educ),+ Average = mean(educ),+ Highest = max(educ))+ print(stats)+ })

Lowest Average Highest0.00000 13.47498 99.00000

Some of these functions (e.g., summary) will also work with data.frames andother types of objects

(Harvard MIT Data Center) Introduction to R April 19, 2013 39 / 50

Page 40: Introduction to the R Statistical Computing Environment

Basic Statistics and Graphs

Counts and proportions

Start by using the table() function to tabulate counts, then performadditional computations if needed

> sex.counts <- table(datGSS[, "sex"]) # tabulate sex categories> sex.counts

Male Female622 797

> prop.table(sex.counts) # convert to proportions

Male Female0.4383369 0.5616631

Add variables for crosstabs> table(datGSS[, c("sex", "happy")]) # crosstab marital X happy

happysex NAP VERY HAPPY PRETTY HAPPY NOT TOO HAPPY DK NA

Male 0 189 350 73 0 10Female 0 246 447 84 1 19

(Harvard MIT Data Center) Introduction to R April 19, 2013 40 / 50

Page 41: Introduction to the R Statistical Computing Environment

Basic Statistics and Graphs

Statistics by classification factors

The by() function can be used to perform a calculation separately for eachlevel of a classifying variable

> by(datGSS[, c("income", "educ")],+ INDICES=datGSS["sex"],+ FUN=summary)sex: Male

income educ$40000 TO 49999: 59 Min. : 4.00$50000 TO 59999: 56 1st Qu.:12.00$60000 TO 74999: 49 Median :13.00$35000 TO 39999: 48 Mean :13.68REFUSED : 48 3rd Qu.:16.00$110000 OR OVER: 43 Max. :99.00(Other) :319

----------------------------------------------------sex: Female

income educREFUSED : 76 Min. : 0.00$60000 TO 74999: 62 1st Qu.:12.00$40000 TO 49999: 60 Median :12.00$50000 TO 59999: 52 Mean :13.32$30000 TO 34999: 49 3rd Qu.:15.00$25000 TO 29999: 42 Max. :99.00(Other) :456(Harvard MIT Data Center) Introduction to R April 19, 2013 41 / 50

Page 42: Introduction to the R Statistical Computing Environment

Basic Statistics and Graphs

Correlations

Let’s look at correlations among between age, income, and education

> cor(datGSS[ , c("age", "incomdol", "educ")])age incomdol educ

age 1.00000000 -0.1186564 -0.07362454incomdol -0.11865641 1.0000000 0.21013267educ -0.07362454 0.2101327 1.00000000

For significance tests, use cor.test()

> with(datGSS,+ cor.test(age, educ))

Pearson’s product-moment correlation

data: age and educt = -2.779, df = 1417, p-value = 0.005525alternative hypothesis: true correlation is not equal to 095 percent confidence interval:-0.12518333 -0.02166916

sample estimates:cor

-0.07362454

(Harvard MIT Data Center) Introduction to R April 19, 2013 42 / 50

Page 43: Introduction to the R Statistical Computing Environment

Basic Statistics and Graphs

Multiple regression

Modeling functions generally use the formula interface whith DV on leftfollowed by “~" followed by predictors–for details seehelp("formula")

Predict the number of hours individuals spend on email (emailhrs)

> m1 <- lm(educ ~ sex + age, data = datGSS)> summary(m1)

Call:lm(formula = educ ~ sex + age, data = datGSS)

Residuals:Min 1Q Median 3Q Max

-13.434 -1.785 -0.688 1.955 86.049

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 14.652702 0.425691 34.421 < 2e-16sexFemale -0.275235 0.289290 -0.951 0.34156age -0.021938 0.008238 -2.663 0.00783

Residual standard error: 5.377 on 1416 degrees of freedomMultiple R-squared: 0.006056, Adjusted R-squared: 0.004652F-statistic: 4.314 on 2 and 1416 DF, p-value: 0.01356(Harvard MIT Data Center) Introduction to R April 19, 2013 43 / 50

Page 44: Introduction to the R Statistical Computing Environment

Basic Statistics and Graphs

Basic graphics: Frequency bars

Thanks to classes and methods, you can plot() many kinds of objects:

plot(datGSS[ , "marital"]) # Plot a factor

(Harvard MIT Data Center) Introduction to R April 19, 2013 44 / 50

Page 45: Introduction to the R Statistical Computing Environment

Basic Statistics and Graphs

Basic graphics: Boxplots by group

Thanks to classes and methods, you can plot() many kinds of objects:

with(datGSS,plot(marital, educ)) # Plot ordinal by numeric

(Harvard MIT Data Center) Introduction to R April 19, 2013 45 / 50

Page 46: Introduction to the R Statistical Computing Environment

Basic Statistics and Graphs

Basic graphics: Mosaic chart

Thanks to classes and methods, you can plot() many kinds of objects:

with(datGSS, # Plot factor X factorplot(marital, happy))

(Harvard MIT Data Center) Introduction to R April 19, 2013 46 / 50

Page 47: Introduction to the R Statistical Computing Environment

Basic Statistics and Graphs

Exercise 3

Using the datGSS data.frame1 Cross-tabulate sex and emailhrs2 Calculate the mean and standard deviation of incomdol by sex3 Create a scatter plot with educ on the x-axis and incomdol on the y-axis4 Create a subset of the datGSS data frame that only includes female

respondents

(Harvard MIT Data Center) Introduction to R April 19, 2013 47 / 50

Page 48: Introduction to the R Statistical Computing Environment

Wrap-up

Topic

1 Getting Started with R and RStudio

2 Finding Help

3 Data and Functions

4 Loading and Saving Data

5 Data Manipulation

6 Basic Statistics and Graphs

7 Wrap-up

(Harvard MIT Data Center) Introduction to R April 19, 2013 48 / 50

Page 49: Introduction to the R Statistical Computing Environment

Wrap-up

Help us make this workshop better!

Please take a moment to fill out a very short feedback formThese workshops exist for you – tell us what you need!http://tinyurl.com/R-intro-feedback

(Harvard MIT Data Center) Introduction to R April 19, 2013 49 / 50

Page 50: Introduction to the R Statistical Computing Environment

Wrap-up

Additional resources

IQSS workshops:http://projects.iq.harvard.edu/rtc/filter_by/workshops

Software (all free!):R and R package download: http://cran.r-project.orgRstudio download: http://rstudio.orgESS (emacs R package): http://ess.r-project.org/

Getting help:Documentation and tutorials:http://cran.r-project.org/other-docs.htmlRecommended R packages by topic:http://cran.r-project.org/web/views/Mailing list: https://stat.ethz.ch/mailman/listinfo/r-helpStackOverflow: http://stackoverflow.com/questions/tagged/r

(Harvard MIT Data Center) Introduction to R April 19, 2013 50 / 50