10
Advanced Data Analytics: Getting Started with R Jeffrey Stanton School of Information Studies Syracuse University

Getting Started with R

Embed Size (px)

DESCRIPTION

Part of advanced analytics course.

Citation preview

Page 1: Getting Started with R

Advanced Data Analytics: Getting Started with R

Jeffrey Stanton

School of Information Studies

Syracuse University

Page 2: Getting Started with R

2

Analytics: Key Steps

• Learn the application domain• Locate or develop a data source or data set• Clean and preprocess data: May take 60% of effort!• Data reduction and transformation

– Find useful pieces, squeeze out redundancies• Choose analytical approaches

– summarize, visualize, organize, describe, explore, find patterns, predict, test, infer

• Communicate the results and implications to data users• Deploy discovered knowledge in a system• Monitor and evaluate the effectiveness of the system

Page 3: Getting Started with R

First Example: Ice Cream Consumption

• We all know the domain, we have all eaten ice cream• Public data set obtained from supplement to Verbeek’s text:

http://eu.wiley.com/legacy/wileychi/verbeek2ed/datasets.html

• Let’s read the data into R and summarize it:

ICECREAM=read.csv("[pathname]/icecream.csv",header=T)

summary(ICECREAM)

• What do these two R commands do? Did you get a mean of 84.6 for Income? What are “Min,” “1st Qu.” and all of those other things?

3

Page 4: Getting Started with R

Metadata

• There is a text file that goes with the CSV dataset: “icecream.txt”

• This describes the meaning of the variables provided in the dataset; essential if we are to make sense of these data:

Variable labels:

cons: consumption of ice cream per head (in pints);income: average family income per week (in US

Dollars);price: price of ice cream (per pint);temp: average temperature (in Fahrenheit);Time: index from 1 to 30

• We also learn from the metadata that these are time series data with monthly observations from 18 March 1951 to 11 July 1953

4

Page 5: Getting Started with R

“Sanity Check” Using Histograms and Boxplots

• Cleaning, screening, and preprocessing is essential to ensure that you understand what your data set contains and that it does not contain garbage; it is impractical to look at every data point so we use histograms and boxplots to overview our data:

hist(ICECREAM$income)boxplot(ICECREAM$income)

• What is the purpose of the “$” notation in the commands above? Is there any other way of referring to these variables?

5

Page 6: Getting Started with R

Interpret These Graphics

6

Page 7: Getting Started with R

Explore

• Perhaps a family with greater income can afford to purchase more ice cream:

plot(ICECREAM$income,ICECREAM$cons)

• How do you interpret ascatterplot?

• Is there a pattern here?• Does our intuitive hypothesis

fit the scatterplot?• What else could scatterplots

show?

7

Page 8: Getting Started with R

More Tools to Support Exploration

results=lm(ICECREAM$cons~ICECREAM$temp)# This is a comment line# The previous command calculates a line# that best fits the scatterplot with temp# on the X axis and cons on the Y axis

plot(ICECREAM$temp,ICECREAM$cons)abline(results) # Plots the best fit line

# The new data structure “results” has# lots of information about the analysis.# What does this list contain:results$residuals

8

Page 9: Getting Started with R

What is the effect of time on these data?

plot(ICECREAM$time,ICECREAM$temp)plot(ICECREAM$time,ICECREAM$cons)

• What do these plots show? Can you explain why these are shaped the way they are?

• Based on your answer to the previous question, how does the situation affect your strategies for understanding ice cream consumption?

9

Page 10: Getting Started with R

Demonstrating Mastery

• Find a small numeric dataset; try starting at the Journal of Statistical Education data website:http://www.amstat.org/publications/jse/jse_data_archive.htm

• Read the dataset into R• Summarize the variables in that dataset• Use histograms and boxplots to check and understand your

data; use the metadata description that came with the dataset to make sure that you know the variables

• Explore the data using plot; look for something interesting• Put your findings in a slide and communicate them to me or

someone else

10