1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time

1

Peter Fox

Data Analytics – ITWS-4963/ITWS-6965

Week 14b, May 2, 2014

PCA and return to Big Data infrastructure…. and

assignment time.

Visual approaches for PCA/DR• Screeplot - A plot, in descending order of

magnitude, of the eigenvalues of a correlation matrix. In the context of factor analysis or principal components analysis a scree plot helps the analyst visualize the relative importance of the factors — a sharp drop in the plot signals that subsequent factors are ignorable.

2

require(graphics)

## the variances of the variables in the

## USArrests data vary by orders of magnitude, so scaling is appropriate

prcomp(USArrests) # inappropriate

prcomp(USArrests, scale = TRUE)

prcomp(~ Murder + Assault + Rape, data = USArrests, scale = TRUE)

plot(prcomp(USArrests))

summary(prcomp(USArrests, scale = TRUE))

biplot(prcomp(USArrests, scale = TRUE)) 3

prcomp> prcomp(USArrests) # inappropriate

Standard deviations:

[1] 83.732400 14.212402 6.489426 2.482790

Rotation:

PC1 PC2 PC3 PC4

Murder 0.04170432 -0.04482166 0.07989066 -0.99492173

Assault 0.99522128 -0.05876003 -0.06756974 0.03893830

UrbanPop 0.04633575 0.97685748 -0.20054629 -0.05816914

Rape 0.07515550 0.20071807 0.97408059 0.07232502

> prcomp(USArrests, scale = TRUE)


[1] 1.5748783 0.9948694 0.5971291 0.4164494

Rotation:

PC1 PC2 PC3 PC4

Murder -0.5358995 0.4181809 -0.3412327 0.64922780

Assault -0.5831836 0.1879856 -0.2681484 -0.74340748

UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773

Rape -0.5434321 -0.1673186 0.8177779 0.08902432

4

screeplot

5

> prcomp(~ Murder + Assault + Rape, data = USArrests, scale = TRUE)


[1] 1.5357670 0.6767949 0.4282154

Rotation:

PC1 PC2 PC3

Murder -0.5826006 0.5339532 -0.6127565

Assault -0.6079818 0.2140236 0.7645600

Rape -0.5393836 -0.8179779 -0.1999436

> summary(prcomp(USArrests, scale = TRUE))

Importance of components:

PC1 PC2 PC3 PC4

Standard deviation 1.5749 0.9949 0.59713 0.41645

Proportion of Variance 0.6201 0.2474 0.08914 0.04336

Cumulative Proportion 0.6201 0.8675 0.95664 1.000006

bigplot

7

Line plots lab 6 prcomp (top) and metaPCA (bottom)

8

Looking for convergence as iteration increases

Eigen Angle RobustAngle SparseAngle

http://cran.r-project.org/web/packages/MetaPCA/MetaPCA.pdf



prostate data (lab 7) 2D plot.

9

Lab 9library(dr)

data(ais)

# default fitting method is "sir"

s0 <- dr(LBM~log(SSF)+log(Wt)+log(Hg)+log(Ht)+log(WCC)+log(RCC)+

log(Hc)+log(Ferr),data=ais)

# Refit, using a different function for slicing to agree with arc.

summary(s1 <- update(s0,slice.function=dr.slices.arc))

# Refit again, using save, with 10 slices; the default is max(8,ncol+3)

summary(s2<-update(s1,nslices=10,method="save"))

# Refit, using phdres. Tests are different for phd, and not

# Fit using phdres; output is similar for phdy, but tests are not justifiable.

summary(s3<- update(s1,method="phdres"))

# fit using ire:

summary(s4 <- update(s1,method="ire"))

# fit using Sex as a grouping variable.

s5 <- update(s4,group=~Sex)10

> s0

dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) +

log(WCC) + log(RCC) + log(Hc) + log(Ferr), data = ais)

Estimated Basis Vectors for Central Subspace:

Dir1 Dir2 Dir3 Dir4

log(SSF) 0.150963358 -0.0501785457 0.10898336 -0.002210206

log(Wt) -0.916480522 -0.1942298625 -0.20123696 -0.089722026

log(Hg) -0.131538894 0.6854750758 0.71997546 -0.663097774

log(Ht) -0.093358860 -0.0433408964 0.46445398 0.290838658

log(WCC) 0.004467838 0.0001833808 0.04497590 0.071904557

log(RCC) -0.188973540 0.3475652934 0.29496908 0.037056363

log(Hc) 0.274758965 -0.6058301419 -0.34196615 0.678877114

log(Ferr) -0.005631238 0.0130588502 -0.08702709 0.015547214

Eigenvalues:

[1] 0.95766163 0.24504161 0.10707594 0.0904130511

> summary(s1 <- update(s0,slice.function=dr.slices.arc))

Call:

dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) + log(WCC) +

log(RCC) + log(Hc) + log(Ferr), data = ais, slice.function = dr.slices.arc)

Method:

sir with 11 slices, n = 202.

Slice Sizes:

19 19 19 19 19 19 19 18 18 18 15


Dir1 Dir2 Dir3 Dir4

log(SSF) 0.143177 -0.0476079 -0.02815 0.003785

log(Wt) -0.879504 -0.1425841 0.23303 -0.094970

log(Hg) -0.195963 0.6318503 0.24483 -0.509424

log(Ht) -0.058923 -0.1100757 -0.87893 0.217803

log(WCC) -0.007276 -0.0029772 -0.05309 0.043056

log(RCC) -0.167736 0.3924936 -0.19711 -0.213689

log(Hc) 0.368652 -0.6418658 -0.26373 0.796849

log(Ferr) -0.002697 0.0002593 0.03492 0.03911612

Dir1 Dir2 Dir3 Dir4Eigenvalues 0.9572 0.2275 0.09368 0.07319R^2(OLS|dr) 0.9980 0.9981 0.99839 0.99864

Large-sample Marginal Dimension Tests: Stat df p.value0D vs >= 1D 284.78 80 0.000001D vs >= 2D 91.43 63 0.011132D vs >= 3D 45.48 48 0.576903D vs >= 4D 26.55 35 0.84694

> summary(s2<-update(s1,nslices=10,method="save"))

Call:

dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) + log(WCC) +

log(RCC) + log(Hc) + log(Ferr), data = ais, slice.function = dr.slices.arc,

nslices = 10, method = "save")

Method:

save with 10 slices, n = 202.

Slice Sizes:

21 21 20 20 20 25 24 22 20 9


Dir1 Dir2 Dir3 Dir4

log(SSF) 0.127709 -0.00907 0.01018 -0.06144

log(Wt) -0.905004 -0.07107 -0.15734 0.25774

log(Hg) -0.056187 0.50674 -0.34064 -0.38087

log(Ht) 0.399868 0.36613 0.68439 -0.54216

log(WCC) 0.032608 0.02733 0.02277 0.03474

log(RCC) -0.008463 0.15137 -0.24136 -0.47219

log(Hc) -0.021630 -0.76164 0.57591 0.51526

log(Ferr) 0.002116 -0.01670 0.01631 -0.03360

13

Dir1 Dir2 Dir3 Dir4Eigenvalues 0.9389 0.6611 0.5129 0.4653R^2(OLS|dr) 0.9936 0.9950 0.9985 0.9989

Large-sample Marginal Dimension Tests: Stat df(Nor) p.value(Nor) p.value(Gen)0D vs >= 1D 378.3 324 0.02012 0.10711D vs >= 2D 279.6 252 0.11214 0.31162D vs >= 3D 179.9 189 0.67101 0.51603D vs >= 4D 134.3 135 0.50176 0.2786

S0 v. S2

14

S3 and S4

15

Infrastructure tools

In R Studio – Install the rmongodb package– http://cran.r-project.org/web/packages/rmongodb/

vignettes/rmongodb_cheat_sheet.pdf

– http://cran.r-project.org/web/packages/rmongodb/vignettes/rmongodb_introduction.html

• MongoDB - http://www.mongodb.org/ – http://kkovacs.eu/cassandra-vs-mongodb-vs-cou

chdb-vs-redis - get familiar with the choices

• General idea:– These are “backend” stores that can do various

“things”

16

http://cran.r-project.org/web/packages/rmongodb/vignettes/rmongodb_cheat_sheet.pdf



http://cran.r-project.org/web/packages/rmongodb/vignettes/rmongodb_introduction.html



http://www.mongodb.org/

http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis



Back-ends• Files (e.g. csv), application files (e.g. Rdata,

xls, mat, …) – essentially for reading/input• Databases – for reading and writing

– Also – for advanced operations inside the database!!

– Operations range from simple summaries to array operations and analytics functions

– Overhead is opening/ maintaining connections/ closing – easy on your laptop – harder when they are remote (network, authentication, etc.)

– Overhead is also around their internal storage formats (e.g. BSON for MongoDB)

17

Functions versus languages• Libraries for R mean that you code in R and

call functions and the result returns into R– Whatever the function does (i.e. how it is

implemented) is what you get (subject to setting parameters)

• Languages (like Pig) provide more direct access to efficiently using the underlying capabilities of the application engine/ database– Cost is learning this new language 18

Example layout - Hadoop

19

Relating Open-Source and Commercial

20

Even further• http://projects.apache.org/indexes/category.ht

ml#database – Hadoop (MapReduce) – distributed execution

(via disk when data is large)– Pig (http://wiki.apache.org/pig/RunPig )– HIVE (http://hive.apache.org/releases.html )

– Spark – in memory (RSpark still not easy to find/ install) http://gigaom.com/2014/02/27/as-mapreduce-fades-apache-spark-is-now-a-top-level-project/

21

http://projects.apache.org/indexes/category.html%23database

http://projects.apache.org/indexes/category.html%23database

http://wiki.apache.org/pig/RunPig

http://wiki.apache.org/pig/RunPig

http://hive.apache.org/releases.html

http://hive.apache.org/releases.html

http://gigaom.com/2014/02/27/as-mapreduce-fades-apache-spark-is-now-a-top-level-project/

~ Objectives• Provide an application, i.e. predictive/

prescriptive model view of data analytics by focusing on the “front-end” (Rstudio)

• Over a variety of data…• Provide enough of a view of the back-end to

know how you will need to interface to them (both open-source and commercial)

22

Layers across the Analytics Stack

23

Time for assignments

24

Documents

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time