R Program.pdf

R R

( ) (Ph.D.)

Source: http://en.wikipedia.org/wiki/R_programming_language

1(accessed June 23, 2011)

1. R2. Down Load R3. R

4.

2

1. REveritt and Hothorn (2006: 1) describe the R program as follows:

The R system for statistical computing is an environment for data analysis and graphics. The y g proot of R is the S language, developed by John Chambers and colleagues (Becker et al 1988 Chambers and colleagues (Becker et al., 1988, Chambers and Hastie, 1992, Chambers, 1998) at B ll L b i (f l AT&T d Bell Laboratories (formerly AT&T, now owned by Lucent Technologies) starting in the 1960s. The base distribution of R and a large number of user contributed extensions are available under the terms of the Free Software Foundations GNU General Public License in source code form.

3

General Public License in source code form.

1. R () R

2 (Base system) (User contributed add-on packages)

R9 open source / 9

R

X X (non-standard program) X (non standard program) SPSS, Minitab,

4

2. Down Load R 2.1 R Download, Packages: CRAN

Website http://www.r-project.org/

2.2 :Thailand2.3 : Windows

D l d R 2 13 0 f Wi d

2.4 R: Base2.5 Download R 2.13.0 for Windows

2 6 R2.6 R

5

2. Down Load R () (Base) R

> library() > library(survival)

> library( ) > library( survival )

Packages Install packages

6

3. R R

B b

1 ; { } { }

# # R

(Warning and R > prompt

history()

7

(Warning and Error)

R R

< -

> x x = 9.5

() (2 3 6 8)c() (Concatenation operator)

> y a e aa=rep(1:6, 2) [ 1 6 2 ]

rep(AB, 5)NA >e=c(1 4 6 4 NA 3); mean(e)NA >e=c(1,4,6,4,NA,3); mean(e)

is.na() is.na(e) ; FALSE FALSE FALSE FALSE TRUE FALSE

8na.rm=TRUE mean(e, na.rm=TRUE); 3.6

R

x[i] i xy=1:6 y 1 2 3 4 5 6y=1:6

x=rev(sort(y))x[2]

y 1,2,3,4,5,6X = y 6,5,4,3,2,1 x 2 5x[2]

x[3:5]x[-length(x)]

x 2 5 x 3-5 4,3,2 x (length= x ) x[ length(x)]

x[-1]

x (length x ) 1,2,3,4,5 x 1 5,4,3,2,1[ ]

edit(x), , , ,

9

R

scan() gl(n, k, length = n*k, labels = 1:n)

n k k length n*klabels label

> gl(5,2) 5 2;[1] 1 1 2 2 3 3 4 4 5 5Levels: 1 2 3 4 5 Levels: 1 2 3 4 5

sort(x) x.rev(sort(x)) xrev(sort(x)) x.order()

(d A d B b ) A B () merge(dataA, dataB, by= ) A B () rbind(dataA, dataB) A B

10

R R

+ (addition) > x x x x < -2/3 (x = 0.6667)^ ( ti ti ) > 2^3 ( 8)^ (exponentiation) >x = 2^3 (x = 8)

sqrt() (square root) >x = sqrt(4) (x = 2)abs() (absolute value) >x = abs(-2) (x=2)log() (natural logarithm) >x=log(2) (x =0.6931472 )

log10() common logarithm >x=log10(2) (x =0.30103)() > (2) ( 7 389056 )exp() >x=exp(2) (x =7.389056 )

%/% (integer division) >6%/%4 111%% (modulus) > 6%% 4 2

(Logical and relation) R

= = (exactly equal to)! = (not equal to)< (less than)> (greater than) (greater than)

< = (less than or equal to) > = (greater than or equal to)

& ( AND)| (OR)

!= ( NOT)! ( NOT)

12

1 x1 = (1,2,3), x2 = (4,5,6) x3 = (7,8,9) 1. 3*x1 2. x1*x2

3. x3 - x2 4. x3 + x2

R sum(x) x >sum(x1) 6

( ) ( ) mean(x) x >mean(x1) 2median(x) x >median(x1) 2

var(x) x >var(x1) 1sd(x) x >sd(x1) 1

13

( ) ( )

R

function () { expr return(value) }

() {} {}

{ }

n n3 2

i i3 i 1 i 1

3 23/22

(x x) (x x)mSK ; m , m

n nm= =

= = =

R > sk = function(x) {xbar = mean(x)

sk x

{ }2

xbar = mean(x)m2=mean((x-xbar)^2)m3=mean((x-xbar)^3)

x sk1 = m3/{m2}3/2cat("x =" x "\n") x = x m3 mean((x xbar) 3)

sk1 =m3/m2^1.5return(cat("x=", x, "\n","Skewness

cat( x ,x, \n ) x x enter cat() text return(cat( x , x, \n , Skewness

statistic =", sk1,"\n"))}

cat() text "\n" }

> x= sample(1:20, size=5, replace=T)

14> sk(x)

(Combining vectors) R

cbind(v1, v2, , vn) v1, v2, ,vn 1 2 n 1 2 nrbind(v1, v2, , vn) v1, v2, ,vn

1 x1 = (1,2,3), x2 = (4,5,6) x3 = (7,8,9) > x1=c(1 2 3) > Y=rbind(x1 x2 x3)> x1=c(1,2,3)> x2=c(4,5,6)> x3 c(7 8 9)

> Y=rbind(x1,x2,x3)> Y

[ 1] [ 2] [ 3]> x3=c(7,8,9)> X=cbind(x1,x2,x3)

X

[,1] [,2] [,3]x1 1 2 3

> Xx1 x2 x3

x2 4 5 6x3 7 8 9

[1,] 1 4 7[2,] 2 5 8

15

[ ,] 5 8[3,] 3 6 9

(matrix)matrix(data = , nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)

Data , nrow ,ncol ,byrow = FALSE ,dimnames 3 1 4 3 2 1 2 A 2 2 5 , B 1 2 3

1 3 6 4 5 6

= = > a=c( 3 2 1 1 2 3 4 5 6)> a=c(-3,-2,-1,1,2,3,4,5,6)>A= matrix(a, ncol = 3, dimnames =list(c("r1","r2","r3"),c("c1","c2","c3")))> B= matrix(a, ncol = 3,byrow=TRUE )

>Ac1 c2 c3

> B [,1] [,2] [,3]

( , , y )

r1 -3 1 4r2 -2 2 5r3 1 3 6

[1,] -3 -2 -1[2,] 1 2 3[3 ] 4 5 6

x[i,j] i j x > A[3,2] 3r3 -1 3 6 [3,] 4 5 6

16dim() >dim(A) [1] 3 3

R

A*B A%*%B (Matrix multiplication)

2 3 1 4 3 2 1A 2 2 5 , B 1 2 3 = = 1 3 6 4 5 6 * A%*%B B A%*%> A*B

c1 c2 c31 9 2 4

> A%*%B[,1] [,2] [,3]

1 26 28 30

> B-Ac1 c2 c3

1 0 3 5

> B%*%Ac1 c2 c3

[1 ] 14 10 28r1 9 -2 -4r2 -2 4 15r3 4 15 36

r1 26 28 30r2 28 33 38r3 30 38 46

r1 0 -3 -5r2 3 0 -2r3 5 2 0

[1,] 14 -10 -28[2,] -10 14 32[3 ] 28 32 77r3 -4 15 36 r3 30 38 46 r3 5 2 0[3,] -28 32 77

17

R

t() det()

solve() solve()

2 3 1 4 3 2 1A 2 2 5 B 1 2 3 2 A 2 2 5 , B 1 2 3 1 3 6 4 5 6

= =

d t A> t(A)r1 r2 r3

1 3 2 1

> det(A)[1] 0> d t B

> D=A*A> solve(D)

r1 r2 r3c1 -3 -2 -1c2 1 2 3c3 4 5 6

> det(B)[1] 5.828671e-16

r1 r2 r3c1 0.2410714 -0.3214286 0.1160714c2 0.3541667 -0.9166667 0.47916673 0 0952381 0 2380952 0 0952381c3 4 5 6 c3 -0.0952381 0.2380952 -0.0952381

18

R

colMeans() rowMeans() colSums() colSums() rowSums()

23 1 4

A 2 2 51 3 6

= 1 3 6

> colMeans(A)1 2 3

> rowMeans(A)1 2 3c1 c2 c3

-2 2 5 r1 r2 r3

0.6666667 1.6666667 2.6666667

> colSums(A)c1 c2 c3

19-6 6 15

R

solve(A,b) x b = Ax [ x = A-1b]

3 x, y z x+2y z = 12x y + z = 3 x+2y +3z = 7

1 2 1 x 1S e t A 2 1 1 , x y , b 3

1 2 3 7

= = = 1 2 3 z 7

> A=matrix(c(1,2,-1,2,-1,2,-1,1,3), ncol=3)> b=matrix(c(1,3,7), ncol=1)>x=solve(A,b) >x=solve(A)%*%b >x

[,1][1 ] 1

( )> x

[,1]x+2y z = 1

> 1*x[1,1]+2*x[2,1]-x[3,1][1,] 1[2,] 1[3 ] 2

[ ][1,] 1[2,] 1

[ , ] [ , ] [ , ][1] 1

20[3,] 2 [3,] 2

R eigen() eigenvalues eigenvectors (A I)X 0 =

3 x, y z x+2y z = 12x y + z = 3 x+2y +3z = 7

1 2 1 x 1S e t A 2 1 1 , x y , b 3

1 2 3 7

= = = 1 2 3 z 7 > eigen(A)$values$values[1] 3.483612 -2.766435 2.282824

$vectors [ 1] [ 2] [ 3] [,1] [,2] [,3][1,] -0.32804779 -0.5102976 0.8639730[2 ] 0 06387783 0 7812687 0 4830132

21[2,] 0.06387783 0.7812687 0.4830132[3,] 0.94249895 -0.3594656 -0.1422988

4 x1, x2, x3 x4 1, 2, 3 4x1 + 3x3 = 2

3 3x2 - 3x4 = 3-2x2 + 3x3 +2x4 = 1

3x1 +7x4 = -5

22

4.

R

p for "probability" the cumulative distribution function (c. d. f.) q for "quantile" the inverse c. d. f.

d for "density" the density function (p f or p d f )d for density , the density function (p. f. or p. d. f.) r for "random" a random variable having the specified distribution

P i di t ib tidpois(x, lambda) ppois(q, lambda) qpois(p, lambda) rpois(n, lambda)

Poisson distributiondpois(x, lambda) ppois(q, lambda) qpois(p, lambda) rpois(n, lambda)

set.seed(1259)>x1=rpois(30,5); x1mean(x1); var(x1)

23

( ); ( )

Distribution Functions

Beta pbeta qbeta dbeta rbeta

Binomial pbinom qbinom dbinom rbinom

Cauchy pcauchy qcauchy dcauchy rcauchy

Chi S hi hi d hi hiChi-Square pchisq qchisq dchisq rchisq

Exponential pexp qexp dexp rexp

F pf qf df rf

Gamma pgamma qgamma dgamma rgamma

Geometric pgeom qgeom dgeom rgeom

H t i h h dh hHypergeometric phyper qhyper dhyper rhyper

Logistic plogis qlogis dlogis rlogis

Log Normal plnorm qlnorm dlnorm rlnorm

Negative Binomial pnbinom qnbinom dnbinom rnbinom

Normal pnorm qnorm dnorm rnorm

P i i i d i iPoisson ppois qpois dpois rpois

Student t pt qt dt rt

Studentized Range ptukey qtukey dtukey rtukey

Uniform punif qunif dunif runif

Weibull pweibull qweibull dweibull rweibull

Wil R k S S i i il il d il il24

Wilcoxon Rank Sum Statistic pwilcox qwilcox dwilcox rwilcox

Wilcoxon Signed Rank Statistic psignrank qsignrank dsignrank rsignrank

(Statistical Functions) R

sum(x) xmean(x) x

di ( ) median(x) xvar(x) xsd(x) x

min(x) xmax(x) x

( ) i range(x) min max xquantile(x) x

summary(x) xfactorial(x) x

25( )!

! !n nr r n r

= choose(n, r)

R

hist(x) histogram x. pie(data) stem(x) stem and leaf plot x. plot(x y) scatterplot of y against xplot(x,y) scatterplot of y against x.

barplot(x) barplot of x (where x contains the heights of the bars)boxplot(y ~ x) boxplots y x ..

boxplot(list(x1,x2,...)) boxplots x1, x2, etc. p , , p , ,qqnorm(x) normal probability plot x.

qqline(x) normal probability plot passing through 1Q and 3Q

26

R

sample(size=n, replace =FALSE)sample(size=n, replace =TRUE )

t.test(x, mu) 1 .

t.test(x1, x2) 2

aov(y ~ x, data=data.df) ychisq.test(x, p) chi-squared.chisq.test(x) (chi-squared test of

independence). 27

4 () 4 () 20

300 600 199 500 350 400 650 500 300 450 500 300 200 250 400 550 360 499 250 650

0.051) 420 2) )

28

0.051) 420 2) 2) > male=c(300,600,199,500,350,400,650,500,300,450)( , , , , , , , , , )> female=c(500,300,200,250,400,550,360,499,250,650)

> t.test(male,mu=420) ##( 1)> t.test(male,female) ##( 2 )> t.test(male,female, paired = T) ## ( 2 )

qqnorm(male); shapiro.test(male) ##()> ex4=stack(list(x1=male,x2=female)) ##()> bartlett.test(values~ind, data=ex4) ##()

29

1 2X X X ~ NID( ) 1 2 nX , X ,..., X ~ NID( , )

qqnorm(male) shapiro.test(male)shapiro.test(male)

30

2 bartlett test(y~G) bartlett.test(y~G)

31

2

32

( R Commander) ( R Commander) R C d R CommanderPackages Install packages Thailand RcmdrPackages Install packages Thailand Rcmdr

install.packages("Rcmdr",dependencies=TRUE)

R Commander R Commander> library(Rcmdr)

33

1) (C l l d i d d i CRD)1) (Completely randomized design, CRD)

34

1) (Completely randomized design, CRD) 5

( : ) 5 8

14.20 12.85 14.1514.30 13.65 13.9015.00 13.40 13.6514.60 14.20 13.6014.55 12.75 13.2015.15 13.35 13.2014.60 12.50 14.0514.55 12.80 13.8014.55 12.80 13.80

5 0 01

35 0.01 3

1) (Completely randomized design, CRD) R aov(y ~ x, data=)

TukeyHSD() plot(TukeyHSD())

>Thai=c(14.20,14.30,15.00,14.60,14.55,15.15,14.60,14.55)>Japan=c(12.85,13.65,13.40,14.20,12.75,13.35,12.50,12.80)>China=c(14.15,13.90,13.65,13.60,13.20,13.20,14.05,13.80)>ex5=stack(list(x1=Thai x2=Japan x3=China))>ex5 stack(list(x1 Thai,x2 Japan,x3 China))>summary(aov(values~ind,data=ex5))>DTk=TukeyHSD(aov(values~ind,data=ex5)); DTk;

l t(DTk)>plot(DTk)

36

(Multiple Comparisons)

H0 k H0 2 99 Fishers Least Significant Difference : LSD9 Duncans New multiple Range Test : Duncan Duncan s New multiple Range Test : Duncan9 Student-Newman-Keuls : SNK9 Tukeys Multiple Comparison test : Tukey9 Bonferroni

379 (Dunnett) (Control)

(A i )

(Assumption)

1: Yij

shapiro.test Q-Q plot

Bartletts Test Levenes Test

38

2: (eij)

(eij)

(e ) shapiro.test Q-Q plot

(eij) Bartletts Test Levenes Test

Plot (eij) ( ) patternijY

(eij) Plot (e )

39

Plot (eij)

(Randomized complete block design, RBD)

(Additive) 40

(No Interaction)

6 4 40 4 3

12 9 13 15 9 7 8 10

8 6 8 9 3

0.05

41

>y=c(12,9,13,15,9,7,8,10,8,6,8,9) ##()>Temp=c("H" "H" "H" "H" "M" "M" "M" "M" "L" "L" "L" "L")>Temp c( H , H , H , H , M , M , M , M , L , L , L , L )>Fruit=c("1","2","3","4","1","2","3","4","1","2","3","4")>ex6=data.frame(y=y,z=Temp,x=Fruit)ex6 data.frame(y y,z Temp,x Fruit)>anova(lm(y~z+x,data=ex6))>boxplot(y~z,data=ex6)p (y , )>boxplot(y~x,data=ex6)

42

43 44

45

7 Myers and Montgomery (1995) describe a study in which the transistor gain in integrated circuit device between emitter and collector (hFE) is reported along with two variables that can be controlled at the deposition process, emitter drive-in time (x1 in minutes) and emitter dose (x2 in ions x 1014) . Fourteen observations were obtained following deposition and the resulting data are shown in Table below. observation x1 (drive in minutes) x2 (dose, ions 1014) y (gain or hFE)

1 195 4 10042 255 4 16363 195 4.6 8524 255 4.6 15065 225 4.2 12726 225 4 1 12706 225 4.1 12707 225 4.6 12698 195 4.3 9039 255 4.3 155510 225 4 126011 225 4.7 114612 225 4.3 1276

4613 225 4.72 122514 230 4.3 1321

>y=c(1004,1636,852,1506,1272,1270,1269,903,1555,1260,1146,1276,1225,1321)>x1=c(195,255,195,255,225,225,225,195,255,225,225,225,225,230)>x2=c(4,4,4.6,4.6,4.2,4.1,4.6,4.3,4.3,4,4.7,4.3,4.72,4.3)x2 c(4,4,4.6,4.6,4.2,4.1,4.6,4.3,4.3,4,4.7,4.3,4.72,4.3)>ymodel=glm(y ~ x1+x2)>summary(ymodel)

residual>residuals(ymodel)

47

Generalized Linear Models (GLMs)( )A GLM consists of three components:

A random component for the response,y exponential family distributionsy ~ exponential family distributions

y b( )f ( y ) exp c( y ) + f ( y , ) exp c( y, ) ,a ( )

represents the location, called the canonical param eter,

= +

represents the scale, called the dispersion param eter.

h b ( ) d b ( ) (

)T hen, E[Y ] = b ( ) = and V ar[Y ] = b ( )a( ).

The linear predictor,The linear predictor, = X

48

Link function g ( ) Link function, g ( ) = Family Canonical Link Mean FunctionFamily Canonical Link Mean Function

Gaussian Identity; = = Xy

Bi i l L i

= X

Binomial Logit; elog 1 =

ex p ( )1 ex p ( )

= + X

X

Poisson Log;elog = exp( ) = X

Gamma Inverse; 1 = 1( )XGamma Inverse; = 1( ) = X

49

GLM estimation, Consider the log likelihoodConsider the log-likelihood,

( )l ( ) ( )n y b

1

( ) lo g L ( , ; ) ( , )( )e i

y by L c ya=

= = + Then,

j j

L L =

1

( ) ( )

ni i

iji i i

y xa V=

= To find estimates, Newton-Raphson Method

( 1)( ) ( 1)

( 1)

( ) ,( )

rr r

rLL

=

(0)

2

( )for 1, 2, 3, ... , a reasonable starting value ,r

L L

=

50 and .( )tL LL L = =

Goodness-of-fit Test

Deviance is a quality of fit statistic for a model that is oftenDeviance is a quality of fit statistic for a model that is often used for statistical hypothesis testing. [wikipedia,2010]

Th d i f d l M i d fi dThe deviance for a model M0 is defined as

H d t th fitt d l f th t i th d l MHere denotes the fitted values of the parameters in the model M0, while denotes the fitted parameters for the "full model": both sets of fitted values are implicitly functions of the observations y.p y y

In GLMs, the deviance D is given by

[ ]n i i i i ii 1

g y

D 2 y { (y ) ( )} b{ (y )} b{ ( )}= + 51

i 1=

Logistic RegressionExercises 4.12 (Myers et al., 2002) The experimenters recorded the number of rodents in ( y , ) pthe litter and the number of dead embryos, used 0.1 of boric acid.

Dead Litter Size Dead Litter Size

0 6

1 14

1 11

1 111 141 120 10

1 110 110 130 10

2 140 12

0 13

0 101 120 12

0 140 10

1 120 112 100 10

2 12

3 13

2 102 12

1 123 13 1 12

ymodel=glm(Dead ~ LS, family=binomial("logit"))52

summary(ymodel)pchisq(deviance(ymodel), df.residual(ymodel), lower.tail=F)

Poisson RegressionExercises 4.10 (Myers et al., 2002) A student conducted a project looking at the impact of popping temperature, amount of oil, and the popping time on the number of inedible kernels of popcorn.kernels of popcorn.

Temperature 7 5 7 7 6 6 5 6 5 6 5 7 6 6 6

Oil 4 3 3 2 4 3 3 2 4 2 2 3 3 3 4

Time 90 105 105 90 105 90 75 105 90 75 90 75 90 90 75Time 90 105 105 90 105 90 75 105 90 75 90 75 90 90 75

y 24 28 40 42 11 16 126 34 32 32 34 17 30 17 50

ypoi=glm(y~ Tem+oil+time, family=poisson()); anova(ypoi)summary(ypoi)summary(ypoi)

Negative Binomial Regressiong glibrary(MASS)

53ynb=glm.nb(y ~ Tem+oil+time); anova(ynb)summary(ynb)

read.table(file, header = TRUEE, sep = ",", dec = ".", row.names=1)

Header=TRUE:, sep=",": , row.name=1:

dataa write.table(dataa, file="D:/data3.csv",sep=",", col.names=NA)

Note: Excel SPSS RODBC

>library(RODBC)

RODBC

>dataa=odbcConnectExcel(data1.xls)>datab=read spss(data1 sav)>datab read.spss( data1.sav )

55

Everitt, B. S. and Hothorn, T. 2006. A handbook of statistical analyses using R. New York: Chapman & Hall/CRCNew York: Chapman & Hall/CRC.Myers, R.H., Montgomery, D.C. and Vining, G.G. 2002. Generalized Linear Models with Applications in Engineering and the Sciences. New York: John Wiley & Sons.

Fox, J. 2005. The R Commander: A Basic-Statistics Graphical User Interface to R. Journal of Statistical Software, Vol. 14.R Development Core Team. 2009. An Introduction to R. Available from URL: R. Journal of Statistical Software, Vol. 14.

http://cran.r-project.org/doc/manuals/R-intro.pdf [accessed September 28, 2009].

Wikipedia. R (programming language). Retrieved June 23, 2011 from http://en.wikipedia.org/wiki/R_programming_language

56

Documents

R Program.pdf