Upload
syracuse-university
View
1.298
Download
0
Embed Size (px)
DESCRIPTION
Part of advanced analytics course.
Citation preview
Advanced Data Analytics: Getting Started with R
Jeffrey Stanton
School of Information Studies
Syracuse University
2
Analytics: Key Steps
• Learn the application domain• Locate or develop a data source or data set• Clean and preprocess data: May take 60% of effort!• Data reduction and transformation
– Find useful pieces, squeeze out redundancies• Choose analytical approaches
– summarize, visualize, organize, describe, explore, find patterns, predict, test, infer
• Communicate the results and implications to data users• Deploy discovered knowledge in a system• Monitor and evaluate the effectiveness of the system
First Example: Ice Cream Consumption
• We all know the domain, we have all eaten ice cream• Public data set obtained from supplement to Verbeek’s text:
http://eu.wiley.com/legacy/wileychi/verbeek2ed/datasets.html
• Let’s read the data into R and summarize it:
ICECREAM=read.csv("[pathname]/icecream.csv",header=T)
summary(ICECREAM)
• What do these two R commands do? Did you get a mean of 84.6 for Income? What are “Min,” “1st Qu.” and all of those other things?
3
Metadata
• There is a text file that goes with the CSV dataset: “icecream.txt”
• This describes the meaning of the variables provided in the dataset; essential if we are to make sense of these data:
Variable labels:
cons: consumption of ice cream per head (in pints);income: average family income per week (in US
Dollars);price: price of ice cream (per pint);temp: average temperature (in Fahrenheit);Time: index from 1 to 30
• We also learn from the metadata that these are time series data with monthly observations from 18 March 1951 to 11 July 1953
4
“Sanity Check” Using Histograms and Boxplots
• Cleaning, screening, and preprocessing is essential to ensure that you understand what your data set contains and that it does not contain garbage; it is impractical to look at every data point so we use histograms and boxplots to overview our data:
hist(ICECREAM$income)boxplot(ICECREAM$income)
• What is the purpose of the “$” notation in the commands above? Is there any other way of referring to these variables?
5
Interpret These Graphics
6
Explore
• Perhaps a family with greater income can afford to purchase more ice cream:
plot(ICECREAM$income,ICECREAM$cons)
• How do you interpret ascatterplot?
• Is there a pattern here?• Does our intuitive hypothesis
fit the scatterplot?• What else could scatterplots
show?
7
More Tools to Support Exploration
results=lm(ICECREAM$cons~ICECREAM$temp)# This is a comment line# The previous command calculates a line# that best fits the scatterplot with temp# on the X axis and cons on the Y axis
plot(ICECREAM$temp,ICECREAM$cons)abline(results) # Plots the best fit line
# The new data structure “results” has# lots of information about the analysis.# What does this list contain:results$residuals
8
What is the effect of time on these data?
plot(ICECREAM$time,ICECREAM$temp)plot(ICECREAM$time,ICECREAM$cons)
• What do these plots show? Can you explain why these are shaped the way they are?
• Based on your answer to the previous question, how does the situation affect your strategies for understanding ice cream consumption?
9
Demonstrating Mastery
• Find a small numeric dataset; try starting at the Journal of Statistical Education data website:http://www.amstat.org/publications/jse/jse_data_archive.htm
• Read the dataset into R• Summarize the variables in that dataset• Use histograms and boxplots to check and understand your
data; use the metadata description that came with the dataset to make sure that you know the variables
• Explore the data using plot; look for something interesting• Put your findings in a slide and communicate them to me or
someone else
10