Upload
howard-little
View
225
Download
0
Tags:
Embed Size (px)
Citation preview
18 August 2015 1
Statistical Analysis with R
Questionnaires Variables organization Descriptive analysis Graphs Statistical tests
1
18 August 2015 2
R
Statistical package 4th generation programming language
extensible through functions and extensions environment for statistical computing and
graphics statistical and graphical techniques
extensible through packages
Competitors: SPSS, Matlab
2
Variables
18 August 2015 3
Scale or numeric variables time, age, weight, distance in Kilometers,
length, number of children, GDP Nominal or categorical variables
country of residence, sex, degree course Ordinal variables
education level, rankings, Likert scale in statistical analysis are often considered
as nominal or scale variables Questionnaire overview
Missing values
18 August 2015 4
NA: means "not available", are inserted manually by you whenever datum is missing
NaN: means "not a number", whenever calculation cannot be done for this datum
Are skipped in any statistical analysis Any math operation with them gives NaN
4
Portable R
18 August 2015 5
Portable R Download from my website already
preconfigured or download from http://rportable.sourceforge.net
Uncompress it on your computer’s hard disk or on an USB pendrive
or install R on your computer Download from www.r-project.org Install it on your computer Try desperately to set the language to English
5
Installing packages
18 August 2015 6
To install R commander Packages Install Package(s)... CRAN
Mirror Rcmdr wait for installation of Rcmdr and additional
packages To load R commander
Packages Load Package... Rcmdr to warning on missing packages answer Yes answer to download them from CRAN
Learn to load an R package6
Running R commander
18 August 2015 7
Whenever you want to run it Packages Load Package... Rcmdr File Change Working directory
R commander has problems navigating through your directories’ tree
Choose an easy-to-find directory, such as your Desktop or the place where you keep your R exercises.
7
Files to save
18 August 2015 8
R commander windows script, contains the written instructions R commander File Save Script as… output, contains the output R commander File Save Output as… pasting them into a text file
Workspace contains the data structure File Save Workspace… R commander File Save R workspace As… File Load Workspace…
8
data.frame or dataset
18 August 2015 9
database table suited for statistical analysis case names are optional
9
Building a new dataset
18 August 2015 10
R commander Data New data set… Insert all variables first Only after insert data and build a codebook
use numbers for nominal and ordinal variables Convert nominal and ordinal variables to
factor R commander Data Manage variables in
active data set Convert numeric variables to factor
Convert ordinal variables to ordered Submit the 3 lines of code with ordered instead of
factor ls.str() and str(dataset)
10
Importing dataset
18 August 2015 11
R commander Data import from a package
Data in packages import from a text file
Import Data from text file, clipboard or URL… import from Excel (hoping that it works )
Import Data from Excel, Access or dBase data set… export to a text file
Active data set Export active data set… 11
Importing dataset from SPSS
18 August 2015 12
written here just in case you'll ever need it; better and easier converting to text file!
R commander Data Import Data from SPSS data set… Pay attention to value labels and
factors date importing is wrong! Fix it with
library(chron) var <- as.chron(ISOdate(1582, 10, 14) +
var) 12
Univariate descriptive analysis
18 August 2015 13
Statistics Summaries For scale variables
Numerical summaries For ordinal and nominal variables
Frequency distributions
13
Graphs for one nominal variable Pie chart Radar graph
18 August 2015 15
Mon
Tue
Wed
ThuFri
Sat
Sun
0
500
Graphs for one nominal variable Bar plot Line plot
18 August 2015 16
Apr May Jun Jul Aug Sep0.0%
0.2%
0.4%
0.6%
0.8%
1.0%
1.2%
1.4%
JP
US
EPO
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
Graphs for one nominal variable Area plot 3D variants
18 August 2015 17
Apr May Jun Jul Aug Sep0.0%
0.2%
0.4%
0.6%
0.8%
1.0%
1.2%
1.4%
Graphs for one nominal variable
18 August 2015 18
R commander Graphs Color palette… Bar graph… Pie chart…
To change colors, add option col=c(number of colors from palette) to text command, select text command and submit it
18
Graphs for one scale variable
Building an histogram grouping into bins
18 August 2015 19
$1,000 $2,000 $3,000 $4,000 $5,000
0
4
8
12
Graphs for one scale variable
Choosing the bins carefully
18 August 2015 20
$1,000 $2,000 $3,000 $4,000 $5,000
0
10
20
30
$1,000 $2,000 $3,000 $4,000 $5,000
2
4
6
8
Graphs for one scale variable Boxplot
Median in black line Central 50% is in the
rectangle Central 90% is
between whiskers Extremes are
symbols
18 August 2015 21
One scale variable case by case Only for scale variable with few
cases Use any appropriate nominal
variable graph
18 August 2015 22
Graphs for one scale variable
18 August 2015 23
R commander Graphs Histogram… Boxplot… Index plot…
23
Bivariate analysis: nominal vs nominal
18 August 2015 24
Statistics Contingency table Two-way
table… Percentages Understand clearly when using row
percentages and column percentages
24
Graphs for nominal vs nominal Side by side Stacked
18 August 2015 25
Enterntainment Games Lifestyle News Social networking0
2
4
6
8
10
12
14
16
18
20
iPhone Android
Graphs for nominal vs nominal Appropriate 3D variants
18 August 2015 26
Ente
rnta
inm
ent
Games
Lifes
tyle
News
Socia
l net
wor
king
02468
101214161820
iPhoneAndroid
Graphs for nominal vs nominal a rare example of a useful stacked area
chart
18 August 2015 27
Graphs for nominal vs nominal
18 August 2015 28
No available graph in R as far as I know
How to export your graphics into Word right-click copy as bitmap
28
Bivariate analysis: scale vs nominal
18 August 2015 29
Statistics Summaries Numerical summaries
Summarize by groups… Table of statistics…
29
Graphs for scale vs nominal Boxplot side by
side
Histogram one above the other
18 August 2015 30
Graphs for two variables
18 August 2015 31
R commander Graphs Boxplot… Plot by groups…
31
Bivariate analysis: scale vs scale
18 August 2015 32
Statistics Summaries Correlation matrix
Pearson linear correlation
Spearman rank correlation
32
Scale versus scale Mathematical
graph
Regression line
18 August 2015 34
Graphs for two variables
18 August 2015 35
R commander Graphs Scatterplot…
Remove all the unnecessary options Line graph… (mathematical graph)
X variable must have values in order
35
Multivariate analysis
18 August 2015 36
Statistics three nominal
Contingency table Multi-way table
three scale Summaries Correlation matrix
36
Graphs for three scale variables
Bubble chart www.gapminder.org
18 August 2015 38
Graphs for two scale and one nominal variables
18 August 2015 39
R commander Graphs Scatterplot… Plot by groups…
39
Restrict data set
18 August 2015 40
R commander Data Active Data Set
Subset active data set… Used to restrict data set to some cases
Use labels and not numbers for nominal variables!
Remove cases with missing data…
40
Recode
18 August 2015 41
Used to create or modify factor/ordered variables
R commander Data Manage variables in active data set Recode variables…
"Bolzano"="here" c("Munich","Hannover",“Bonn") = "Germany“
Do not use "Munich","Hannover",“Bonn" = "Germany” as suggest by help
else= "Others" For numerical variableswe may use also 8:27=
"high" together with lo and hi
Massive recoding 41
Binning
18 August 2015 42
Used to group scale variables into ordered (but it produces factor)
R commander Data Manage variables in active data set Bin numeric variable…
42
Compute
18 August 2015 43
Used to create new variable through math operations
R commander Data Manage variables in active data set Compute new variable…
newvector <- with(dataset, formula) CO2$myname <- with(CO2, uptake*7-sqrt(conc)
) it is identical to
CO2$myname <- CO2$uptake*7-sqrt(CO2$conc)
43
Computing (line command)
18 August 2015 44
Instruction produced by compute CO2$myname <- with(CO2, uptake*7-
sqrt(conc) ) can be easily typed directly by you! Or you can type
CO2$myname <- CO2$uptake*7-sqrt(CO2$conc)
Variables’ names must be preceded by dataset’s name and $
<- means take things from the right and put on the left44
Computing (line command)
18 August 2015 45
If you do not specify dataset$, variable will be created outside the dataset with only one case (unless otherwise specified)
print(variable) to look at it Variable assignment
variable <- value or formula, value or formula -> variable + - * / **
45
Computing (line command)
18 August 2015 46
Variable with many cases outside dataset is called “vector” vector <- c(list of items) to create it
manually vector[index] to access a specific vector’s
element vector[from:to] to access a sequence of
vector’s elements
46
18 August 2015 47
Statistical tests
Example: we want to study the age of Internet users, checking whether the average age is 35 years or not The only information we have are the
observations on a sample of 100 users, which are: 25; 26; 27; 28; 29; 30; 31; 30; 33; 34; 35; 36; 37; 38; 30; 30; 41; 42; 43; 44; 45; 46; 47; 48; 49; 50; 51; 52; 20; 54; 55; 56; 57; 20; 20; 20; 30; 31; 32; 33; 34; 35; 36; 37; 38; 39; 40; 41; 42; 43; 44; 45; 46; 47; 48; 49; 50; 20; 21; 22; 23; 24; 25; 26; 27; 28; 29; 30; 31; 32; 33; 34; 35; 36; 37; 38; 39; 40; 35; 36; 37; 35; 36; 37; 35; 36; 37; 35; 36; 37; 35; 36; 37; 35; 36; 37; 35; 36; 37; 35.
18 August 2015 48
Statistical tests Test’s hypotheses:
H0: average age on population is 35 H1: average age on population is not 35
We calculate the age average on the sample, 36.2, which is an estimation for the average population’s age. We compare this result with the 35 of the H0 hypothesis and we find a difference of +1.2.
We ask ourselves whether this difference is: large , implying that the average population’s age is
not 35 and thus H0 must be rejected small and it can be caused by random fluctuation in
the sample choice and therefore H0 must be accepted.
18 August 2015 49
Statistical tests In order to answer, the test provides us
with a significance: probability that H0 is not false In this example significance is 16%
If significance is large, we accept H0
this implies that we do not know If significance is small, we reject H0
this implies that we are almost sure that H0 is false
Significance is also called p-value
18 August 2015 50
Typical univariate analysis techniques
Variables
Numerical description
Graphical descriptio
n
Parametric test
Non-parametric
test
nominal
Frequencies (one-
dimensional contingency
table)
Column plot
Pie chart---
Chi-square for a one-
dimensional contingency
table
scale Descriptive statistics
HistogramBoxplot
Student’s t for one variable
Sign test
18 August 2015 51
Tests for one scale variable Student’s t test for one var
H0: avg on the population = m Statistics Means
Single-sample t-test
Sign test H0: median on the population = m Not available in R commander
18 August 2015 52
Tests for one nominal variable
Chi-square test for a one-dimensional contingency table H0: classification follows a
predetermined distribution Statistics Summaries Frequencies
Distributions… Chi-square
18 August 2015 53
Typical bivariate analysis techniques
VariablesNumerical descriptio
n
Graphical description
Parametric test
Non-parametric
test
nominal vs
nominal
2D contingenc
y table
Clustered or stacked or 3D column plot
---Chi square for a 2D contingency
table
binary nominal vs scale Descriptive
statistics by groups
Boxplots or histograms by groups
Student’s t for two populations
Mann-Whitney
non binary
nominal vs scale
One-way analysis of variance (ANOVA)
Kruskal-Wallis
scale vs scale
Person’s or Spearman’s correlation
Scatterplot
Pearson’s correlation
Student’s t for paired data
Spearman’s correlation
Wilcoxon signed rank test
18 August 2015 54
Tests for two nominal variables
Chi-square test for a two-dimensional contingency table H0: classification of two variables is
independent Statistics Contingency table Two-
way table… Statistics Chi-square test of
independence Warning: you should have no expected
frequency less than 5
18 August 2015 55
Test for binary nominal vs scale
Student’s t test for two pop H0: average group 1 =
average group 2 Statistics Means
Independent samples t-test Warning: scale variable should be
normally distributed on two groups
18 August 2015 56
Non-parametric test for binary nominal vs scale
Mann-Whitney Wilcoxon rank-sum
It tests the ranks H0: position group 1 = position group 2 Statistics Nonparametric tests Two-
samples Wilcoxon test
18 August 2015 57
Test for non-binary nominal vs scale
ANOVA (ANalysis Of VAriance) H0: average is the same for all groups Statistics Means One-way ANOVA Test rejects if just one population’s
average is different than the others
Warning: scale variable should be normally distributed for each group
18 August 2015 58
Non-parametric test for non-binary nominal vs scale
Kruskal-Wallis
It tests the ranks H0: position is the same for all groups Statistics Nonparametric tests
Kruskal-Wallis test
18 August 2015 59
Tests for two scale variables Pearson’s and Spearman’s correlation
tests H0: correlation = 0 Statistics Summaries Correlation
test
18 August 2015 60
Tests for difference of two scale variables
When using tests on variables differences
Student’s t test for paired data H0: average (var 1 – var 2) = 0 Statistics Means Paired t test Warning: distribution of difference of
scale variables must be normal
18 August 2015 61
Nonparametric test for two scale paired variables
Wilcoxon signed-rank test It tests the ranks H0: var 1 – var 2 is
positioned around 0 Statistics
Nonparametric tests Paired-samples Wilcoxon test
18 August 2015 62
Is a variable normally distributed?
Histogram with normal curve Find out average a and standard deviation
s Build an histogram with appropriate
binning close it, add prob=TRUE and rebuild it do not close it!
curve(dnorm(x, mean=a, sd=s), col="blue", lwd=2, add=TRUE, yaxt="n")
Q-Q plot (data must be on the line) Graphs Quantile-comparison Plot
18 August 2015 63
Is a variable normally distributed? Skewness
negative: tail left, positive: tail right excess Kurtosis
negative : flat, 0: normal, positive: too pointy
Statistics Summaries Numerical summaries Options
Shapiro-Wilk normality test H0: variable comes from a normal
distribution Statistics Summaries Shapiro-Wilk test of
normality