R: A Gentle Introduction - Data Services · Part III: R & RStudio. R and RStudio •R is a...

Preview:

Citation preview

R: A Gentle Introduction

Vega Bharadwaj | George Mason University Data Services

Part I: Why R?

What do YOU know about R and why do you want to learn it?

Reasons to use R

• Free and open-source

• User-created “packages” available to download allow for an endless number of things you can do in R

• Highly customizable graphing capabilities

• Lots of free documentation and tutorials available on the web

• Beware of bad documentation, too

• You don’t need to be a computer scientist to code in R

• Packages do a lot of the “programming” (applying fundamental CS concepts) for you

R for historians

https://programminghistorian.org/lessons/data_wrangling_and_management_in_R

R for scientists

https://rcompanion.org/rcompanion/d_10.html

R for social scientists

http://personality-project.org/r/psych/HowTo/factor.pdf

R for text mining

http://tidytextmining.com/tidytext.html

Part II: Why code?

What is pointing and clicking?

• Clicking, dragging, and using buttons and specified text boxes in user-friendly applications to do things

Drawbacks of pointing and clicking

• Does not usually involve an automatic way to keep track of all of your steps

• The things you can do are limited by the number of buttons available

Example: Microsoft Word

• Suppose you had to edit a Microsoft Word document…• Move paragraph 1 between

paragraphs 3 and 4• Relabel all paragraphs in

chronological order• Move paragraph 2 to the end• Remove all paragraph labels• Create headlines by

boldfacing the first line of each paragraph and separating it from the rest of the paragraph with a single line break

• Indent each paragraph below headlines

• Change font to 12 pt. “Times New Roman”

Example: Microsoft Word, 2

• Troubles with this situation:

• Word’s capabilities are limited

• Lots of rearranging to do (human effort)

• No buttons available through Word that can automatically put paragraphs in the order you want

• No way to document all these steps while ensuring 100% accuracy

• Might be some ambiguity with English language

• What if you had to hand this task over to someone else?

What is coding?

• The process of writing out a list of instructions for a computer to read, interpret, and do

Summary: pointing and clicking vs. coding

Pointing and clicking Coding

SPECIFICITY Limited by number of

buttons available

Do certain things that

can’t be done through

pointing and clicking

alone

REPRODUCIBILITY No innate way to keep

track of steps while

minimizing error

By nature of coding,

keep track of everything

you do and save these

steps for future use

Part III: R & RStudio

R and RStudio

• R is a programming language for statisticians

• Uses code to allow you to efficiently reshape datasets, perform statistical tests, and create graphics

• RStudio is an integrated development environment (IDE) for R

• Translates some R commands into point-and-click features

• Provides a user-friendly visual interface in which to code

R vs. RStudio

Open RStudio

Different ways to code

• Console

• Quickly enter temporary commands

• Script file

• A text document in which you save blocks of code you will want to recreate later

Exercise: R as a calculator

• Type “5-3” into the console and hit the “Enter” key

• Things to note:

• “>” indicates where you should enter your input—never type this in yourself!

• “[1]” indicates the output’s first line

Exercise: R as a calculator, 2

• Type “5-” into the console and hit the “Enter” key

• Observe what happens

• Type “3” and hit the “Enter” key

• “+” indicates R is expecting more input

Exercise: R as a calculator, 3.1

• Create a new R script

Script files (.R)

• Script files are how you save the R code you want to recreate later

• Every line that begins with “#” is a comment, not interpreted by R as code

Exercise: R as a calculator, 3.2

• Type “5-3” into your R script, followed by “5+3”, separated by a line break exactly as it looks like below

Exercise: R as a calculator, 3.3

• Highlight the first line only and click the “Run” button

Exercise: R as a calculator, 3.4

• Examine the output in the “Console”

• Do the same for “5+3”

Concepts: the things you code

• Objects (NOUNS)

• The things you work with in R, i.e. datasets and statistical analysis information

• Functions (VERBS)

• The actions you perform in R, usually on objects

Functions in Excel

• Functions are indicated by their name, followed IMMEDIATELY by parentheses (see text in red)

• Arguments are references to objects (in this case, specific cells) or other types of descriptors that provide information to the function

=SUM(A1, A2)

Object creation

• Use “<-” to create new objects

• Type the object name into the console to get its value

Exercise: objects & functions

• Add the following to your R script and run each line:

Pay attention to spacing!

Exercise: objects & functions, 2

• Examine the output and the “Environment”

Other types of objects

Packages

• A set of functions and object templates available to download and use directly through RStudio

• Ordinarily, you can open them up using checkboxes, but we will do so using code

Summary: what you do in R

Using code…

1. Create objects (“nouns”)

2. Use functions (“verbs”) to do things to objects

Part IV: Working with Data in R

What is a CSV file?

• CSV stands for “comma-separated values”

• Preferred format for working with data in R

• Can be opened in Excel

• Why CSV over Excel format?

• .XLSX files can cause problems in R

CSV: Excel vs. text editor

Functions and datasets

• When working with datasets, it may be necessary to work with more complicated functions

• Arguments without an equals sign (“positional”) must always be in the same spot whenever the function is called

read.table(datafile, header=TRUE, sep=",")

Positional

Argument

Named

Argument

Named

Argument

Function

Other function examples

help()

library(ggplot2)

read.table(datafile, header=TRUE, sep=",")

read.table(datafile,header=TRUE,sep=",")

ggplot(mydata, aes(age, fare)) + geom_point(aes(color =factor(survived)))

Script and dataset management

• For every project, you must create a unique folder on your computer in which to store all your datasets (.CSV files) and scripts (.R files)

Set working directory

• Direct R to the right file folder

Save R script

NO NEED TO SPECIFY FILE EXTENSION!

Exercise: Load in data

Load in data, 2

Load in data, 3

Load in data, 4

Copy & paste into R script

Load in data, 5

• Highlight and run each line

• Examine output in the CONSOLE

Load in data, 6

• You can close out of dataset and click “titanic_r” in ENVIRONMENT to open it up again

Installing and loading packages

• To install a package, use the install.packages() function followed by the package name in double quotes

• To load a package, use the library() function followed by the package name without quotes

Using the ggplot2 package

• Copy and paste the following text into your script (after inserting some line breaks):

# install.packages(“ggplot2”)

library(ggplot2)

• REMEMBER: the “#” indicates a comment that is not interpreted by R as code

• We left this function as a comment because ggplot2is already installed

Exercise: Exploratory data analysis• Copy and paste the following lines of code into your

script (after inserting some line breaks):

head(titanic_r)

str(titanic_r)

summary(titanic_r$gender)

table(titanic_r$pclass, titanic_r$gender)

Exercise: Create graphs

• Copy and paste the following lines of code into your script (after inserting some line breaks):

qplot(pclass, fill=gender, data=mydata)

ggplot(titanic_r, aes(age, fare)) + geom_point(aes(color = factor(survived)))

Things to remember

• R is case-sensitive

• No spaces between function name and opening parenthesis

• Comment every block of code

• Leave line breaks after every code block

For more coding practice:

http://infoguides.gmu.edu/learn_r/101

Workshop resources:

https://dataservices.gmu.edu/workshops/r

Recommended