Upload
maryann-patrick
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
1
Peter Fox
Data Analytics – ITWS-4963/ITWS-6965
Week 14b, May 2, 2014
PCA and return to Big Data infrastructure…. and
assignment time.
Visual approaches for PCA/DR• Screeplot - A plot, in descending order of
magnitude, of the eigenvalues of a correlation matrix. In the context of factor analysis or principal components analysis a scree plot helps the analyst visualize the relative importance of the factors — a sharp drop in the plot signals that subsequent factors are ignorable.
2
require(graphics)
## the variances of the variables in the
## USArrests data vary by orders of magnitude, so scaling is appropriate
prcomp(USArrests) # inappropriate
prcomp(USArrests, scale = TRUE)
prcomp(~ Murder + Assault + Rape, data = USArrests, scale = TRUE)
plot(prcomp(USArrests))
summary(prcomp(USArrests, scale = TRUE))
biplot(prcomp(USArrests, scale = TRUE)) 3
prcomp> prcomp(USArrests) # inappropriate
Standard deviations:
[1] 83.732400 14.212402 6.489426 2.482790
Rotation:
PC1 PC2 PC3 PC4
Murder 0.04170432 -0.04482166 0.07989066 -0.99492173
Assault 0.99522128 -0.05876003 -0.06756974 0.03893830
UrbanPop 0.04633575 0.97685748 -0.20054629 -0.05816914
Rape 0.07515550 0.20071807 0.97408059 0.07232502
> prcomp(USArrests, scale = TRUE)
Standard deviations:
[1] 1.5748783 0.9948694 0.5971291 0.4164494
Rotation:
PC1 PC2 PC3 PC4
Murder -0.5358995 0.4181809 -0.3412327 0.64922780
Assault -0.5831836 0.1879856 -0.2681484 -0.74340748
UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773
Rape -0.5434321 -0.1673186 0.8177779 0.08902432
4
screeplot
5
> prcomp(~ Murder + Assault + Rape, data = USArrests, scale = TRUE)
Standard deviations:
[1] 1.5357670 0.6767949 0.4282154
Rotation:
PC1 PC2 PC3
Murder -0.5826006 0.5339532 -0.6127565
Assault -0.6079818 0.2140236 0.7645600
Rape -0.5393836 -0.8179779 -0.1999436
> summary(prcomp(USArrests, scale = TRUE))
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 1.5749 0.9949 0.59713 0.41645
Proportion of Variance 0.6201 0.2474 0.08914 0.04336
Cumulative Proportion 0.6201 0.8675 0.95664 1.000006
bigplot
7
Line plots lab 6 prcomp (top) and metaPCA (bottom)
8
Looking for convergence as iteration increases
Eigen Angle RobustAngle SparseAngle
http://cran.r-project.org/web/packages/MetaPCA/MetaPCA.pdf
prostate data (lab 7) 2D plot.
9
Lab 9library(dr)
data(ais)
# default fitting method is "sir"
s0 <- dr(LBM~log(SSF)+log(Wt)+log(Hg)+log(Ht)+log(WCC)+log(RCC)+
log(Hc)+log(Ferr),data=ais)
# Refit, using a different function for slicing to agree with arc.
summary(s1 <- update(s0,slice.function=dr.slices.arc))
# Refit again, using save, with 10 slices; the default is max(8,ncol+3)
summary(s2<-update(s1,nslices=10,method="save"))
# Refit, using phdres. Tests are different for phd, and not
# Fit using phdres; output is similar for phdy, but tests are not justifiable.
summary(s3<- update(s1,method="phdres"))
# fit using ire:
summary(s4 <- update(s1,method="ire"))
# fit using Sex as a grouping variable.
s5 <- update(s4,group=~Sex)10
> s0
dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) +
log(WCC) + log(RCC) + log(Hc) + log(Ferr), data = ais)
Estimated Basis Vectors for Central Subspace:
Dir1 Dir2 Dir3 Dir4
log(SSF) 0.150963358 -0.0501785457 0.10898336 -0.002210206
log(Wt) -0.916480522 -0.1942298625 -0.20123696 -0.089722026
log(Hg) -0.131538894 0.6854750758 0.71997546 -0.663097774
log(Ht) -0.093358860 -0.0433408964 0.46445398 0.290838658
log(WCC) 0.004467838 0.0001833808 0.04497590 0.071904557
log(RCC) -0.188973540 0.3475652934 0.29496908 0.037056363
log(Hc) 0.274758965 -0.6058301419 -0.34196615 0.678877114
log(Ferr) -0.005631238 0.0130588502 -0.08702709 0.015547214
Eigenvalues:
[1] 0.95766163 0.24504161 0.10707594 0.0904130511
> summary(s1 <- update(s0,slice.function=dr.slices.arc))
Call:
dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) + log(WCC) +
log(RCC) + log(Hc) + log(Ferr), data = ais, slice.function = dr.slices.arc)
Method:
sir with 11 slices, n = 202.
Slice Sizes:
19 19 19 19 19 19 19 18 18 18 15
Estimated Basis Vectors for Central Subspace:
Dir1 Dir2 Dir3 Dir4
log(SSF) 0.143177 -0.0476079 -0.02815 0.003785
log(Wt) -0.879504 -0.1425841 0.23303 -0.094970
log(Hg) -0.195963 0.6318503 0.24483 -0.509424
log(Ht) -0.058923 -0.1100757 -0.87893 0.217803
log(WCC) -0.007276 -0.0029772 -0.05309 0.043056
log(RCC) -0.167736 0.3924936 -0.19711 -0.213689
log(Hc) 0.368652 -0.6418658 -0.26373 0.796849
log(Ferr) -0.002697 0.0002593 0.03492 0.03911612
Dir1 Dir2 Dir3 Dir4Eigenvalues 0.9572 0.2275 0.09368 0.07319R^2(OLS|dr) 0.9980 0.9981 0.99839 0.99864
Large-sample Marginal Dimension Tests: Stat df p.value0D vs >= 1D 284.78 80 0.000001D vs >= 2D 91.43 63 0.011132D vs >= 3D 45.48 48 0.576903D vs >= 4D 26.55 35 0.84694
> summary(s2<-update(s1,nslices=10,method="save"))
Call:
dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) + log(WCC) +
log(RCC) + log(Hc) + log(Ferr), data = ais, slice.function = dr.slices.arc,
nslices = 10, method = "save")
Method:
save with 10 slices, n = 202.
Slice Sizes:
21 21 20 20 20 25 24 22 20 9
Estimated Basis Vectors for Central Subspace:
Dir1 Dir2 Dir3 Dir4
log(SSF) 0.127709 -0.00907 0.01018 -0.06144
log(Wt) -0.905004 -0.07107 -0.15734 0.25774
log(Hg) -0.056187 0.50674 -0.34064 -0.38087
log(Ht) 0.399868 0.36613 0.68439 -0.54216
log(WCC) 0.032608 0.02733 0.02277 0.03474
log(RCC) -0.008463 0.15137 -0.24136 -0.47219
log(Hc) -0.021630 -0.76164 0.57591 0.51526
log(Ferr) 0.002116 -0.01670 0.01631 -0.03360
13
Dir1 Dir2 Dir3 Dir4Eigenvalues 0.9389 0.6611 0.5129 0.4653R^2(OLS|dr) 0.9936 0.9950 0.9985 0.9989
Large-sample Marginal Dimension Tests: Stat df(Nor) p.value(Nor) p.value(Gen)0D vs >= 1D 378.3 324 0.02012 0.10711D vs >= 2D 279.6 252 0.11214 0.31162D vs >= 3D 179.9 189 0.67101 0.51603D vs >= 4D 134.3 135 0.50176 0.2786
S0 v. S2
14
S3 and S4
15
Infrastructure tools
In R Studio – Install the rmongodb package– http://cran.r-project.org/web/packages/rmongodb/
vignettes/rmongodb_cheat_sheet.pdf
– http://cran.r-project.org/web/packages/rmongodb/vignettes/rmongodb_introduction.html
• MongoDB - http://www.mongodb.org/ – http://kkovacs.eu/cassandra-vs-mongodb-vs-cou
chdb-vs-redis - get familiar with the choices
• General idea:– These are “backend” stores that can do various
“things”
16
Back-ends• Files (e.g. csv), application files (e.g. Rdata,
xls, mat, …) – essentially for reading/input• Databases – for reading and writing
– Also – for advanced operations inside the database!!
– Operations range from simple summaries to array operations and analytics functions
– Overhead is opening/ maintaining connections/ closing – easy on your laptop – harder when they are remote (network, authentication, etc.)
– Overhead is also around their internal storage formats (e.g. BSON for MongoDB)
17
Functions versus languages• Libraries for R mean that you code in R and
call functions and the result returns into R– Whatever the function does (i.e. how it is
implemented) is what you get (subject to setting parameters)
• Languages (like Pig) provide more direct access to efficiently using the underlying capabilities of the application engine/ database– Cost is learning this new language 18
Example layout - Hadoop
19
Relating Open-Source and Commercial
20
Even further• http://projects.apache.org/indexes/category.ht
ml#database – Hadoop (MapReduce) – distributed execution
(via disk when data is large)– Pig (http://wiki.apache.org/pig/RunPig )– HIVE (http://hive.apache.org/releases.html )
– Spark – in memory (RSpark still not easy to find/ install) http://gigaom.com/2014/02/27/as-mapreduce-fades-apache-spark-is-now-a-top-level-project/
21
~ Objectives• Provide an application, i.e. predictive/
prescriptive model view of data analytics by focusing on the “front-end” (Rstudio)
• Over a variety of data…• Provide enough of a view of the back-end to
know how you will need to interface to them (both open-source and commercial)
22
Layers across the Analytics Stack
23
Time for assignments
24