Upload
harold-snow
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
1
Peter Fox
Data Analytics – ITWS-4963/ITWS-6965
Week 4b, February 14, 2014
Lab exercises: regression, kNN and K-means
Today• Linear regression
• K Nearest Neighbors
• K Means
2
The Dataset(s)• http://escience.rpi.edu/data/DA
• Some new ones; nyt/ and sales/ and the fb100/ (.mat files)
• 3 script (fragments, i.e. they will not run as-is, I think) to help with code for today: Lab4b_{1,2,3}.R
3
Linear and least-squares> multivariate <- read.csv(”EPI_data.csv")
> attach(EPI_data);
> boxplot(ENVHEALTH,DALY,AIR_H,WATER_H)
> lmENVH<-lm(ENVHEALTH~DALY+AIR_H+WATER_H)
> lmENVH
… (what should you get?)
> summary(lmENVH)
…
> cENVH<-coef(lmENVH)
4
Predict> DALYNEW<-c(seq(5,95,5))
> AIR_HNEW<-c(seq(5,95,5))
> WATER_HNEW<-c(seq(5,95,5))
> NEW<-data.frame(DALYNEW,AIR_HNEW,WATER_HNEW)
> pENV<- predict(lmENV,NEW,interval=“prediction”)
> cENV<- predict(lmENV,NEW,interval=“confidence”) 5
Repeat forAIR_E
CLIMATE
6
Remember a few useful cmdshead(<object>)
tail(<object>)
summary(<object>)
7
K Nearest Neighbors (classification)> nyt1<-read.csv(“nyt1.csv")
> nyt1<-nyt1[which(nyt1$Impressions>0 & nyt1$Clicks>0 & nyt1$Age>0),]
> nnyt1<-dim(nyt1)[1] # shrink it down!
> sampling.rate=0.9
> num.test.set.labels=nnyt1*(1.-sampling.rate)
> training <-sample(1:nnyt1,sampling.rate*nnyt1, replace=FALSE)
> train<-subset(nyt1[training,],select=c(Age,Impressions))
> testing<-setdiff(1:nnyt1,training)
> test<-subset(nyt1[testing,],select=c(Age,Impressions))
> cg<-nyt1$Gender[training]
> true.labels<-nyt1$Gender[testing]
> classif<-knn(train,test,cg,k=5)
#
> classif
> attributes(.Last.value)
# interpretation to come!8
Regression> bronx<-read.xlsx(”sales/rollingsales_bronx.xls",pattern="BOROUGH",stringsAsFactors=FALSE,sheetIndex=1,startRow=5,header=TRUE)
> plot(log(bronx$GROSS.SQUARE.FEET), log(bronx$SALE.PRICE) )
> m1<-lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET),data=bronx)
What’s wrong?9
Clean up…> bronx<-bronx[which(bronx$GROSS.SQUARE.FEET>0 & bronx$LAND.SQUARE.FEET>0 & bronx$SALE.PRICE>0),]
> m1<-lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET),data=bronx)
#
> summary(m1)10
Call:
lm(formula = log(SALE.PRICE) ~ log(GROSS.SQUARE.FEET), data = bronx)
Residuals:
Min 1Q Median 3Q Max
-14.4529 0.0377 0.4160 0.6572 3.8159
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.0271 0.3088 22.75 <2e-16 ***
log(GROSS.SQUARE.FEET) 0.7013 0.0379 18.50 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.95 on 2435 degrees of freedom
Multiple R-squared: 0.1233, Adjusted R-squared: 0.1229
F-statistic: 342.4 on 1 and 2435 DF, p-value: < 2.2e-16 11
Plot> plot(log(bronx$GROSS.SQUARE.FEET), log(bronx$SALE.PRICE))
> abline(m1,col="red",lwd=2)
# then
> plot(resid(m1)) 12
Another model (2)?Add two more variables to the linear model LAND.SQUARE.FEET and NEIGHBORHOOD
Repeat but suppress the intercept (2a)
13
Model 3/4Model 3
Log(SALE.PRICE) vs. no intercept Log(GROSS.SQUARE.FEET), Log(LAND.SQUARE.FEET), NEIGHBORHOOD, BUILDING.CLASS.CATEGORY
Model 4
Log(SALE.PRICE) vs. no intercept Log(GROSS.SQUARE.FEET), Log(LAND.SQUARE.FEET), NEIGHBORHOOD*BUILDING.CLASS.CATEGORY
14
Solution model 2> m2<-lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD),data=bronx)
> summary(m2)
> plot(resid(m2))
#
> m2a<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD),data=bronx)
> summary(m2a)
> plot(resid(m2a))15
16
Solution model 3 and 4> m3<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD)+factor(bronx$BUILDING.CLASS.CATEGORY),data=bronx)
> summary(m3)
> plot(resid(m3))
#
> m4<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD)*factor(bronx$BUILDING.CLASS.CATEGORY),data=bronx)
> summary(m4)
> plot(resid(m4))
17
18
And now… a complex example> install.packages("geoPlot")
> install.packages(”xslx")
> require(class)
> require(gdata)
> require(geoPlot)
> require(”xslx”)
#if not already read-in:
bronx<-read.xlsx(”sales/rollingsales_bronx.xls",pattern="BOROUGH",stringsAsFactors=FALSE,sheetIndex=1,startRow=5,header=TRUE)
19
View(bronx)#clean up with regular expressions
bronx$SALE.PRICE<- as.numeric(gsub("[^]:digit:]]","",bronx$SALE.PRICE))
#missing values?
sum(is.na(bronx$SALE.PRICE))
#zero sale prices
sum(bronx$SALE.PRICE==0)
#clean these numeric and date fields
bronx$GROSS.SQUARE.FEET<- as.numeric(gsub("[^]:digit:]]","",bronx$GROSS.SQUARE.FEET))
bronx$LAND.SQUARE.FEET<- as.numeric(gsub("[^]:digit:]]","",bronx$LAND.SQUARE.FEET))
bronx$SALE.DATE<- as.Date(gsub("[^]:digit:]]","",bronx$SALE.DATE))
bronx$YEAR.BUILT<- as.numeric(gsub("[^]:digit:]]","",bronx$YEAR.BUILT))
bronx$ZIP.CODE<- as.character(gsub("[^]:digit:]]","",bronx$ZIP.CODE))
20
More corrections#filter out low prices
minprice<-10000
bronx<-bronx[which(bronx$SALE.PRICE>=minprice),]
#how many left?
nval<-dim(bronx)[1]
#addresses contain apartment #'s even though there is another column for that - remove them - compresses addresses
bronx$ADDRESSONLY<- gsub("[,][[:print:]]*","",gsub("[ ]+","",trim(bronx$ADDRESS)))
#new data frame for sorting the addresses, fixing etc.
bronxadd<-unique(data.frame(bronx$ADDRESSONLY, bronx$ZIP.CODE,stringsAsFactors=FALSE))
# fix the names
names(bronxadd)<-c("ADDRESSONLY","ZIP.CODE")
bronxadd<-bronxadd[order(bronxadd$ADDRESSONLY),]21
Yep, more…# duplicates?
duplicates<-duplicated(bronxadd$ADDRESSONLY)
##if(duplicates) dupadd<-bronxadd[bronxadd$duplicates,1]
##bronxadd<-bronxadd[(bronxadd$ADDRESSONLY!=dupadd[1] & bronxadd$ADDRESSONLY != dupadd[2]),]
#how many?
nadd<-dim(bronxadd)[1] 22
Oh, we want nearest neighbors? How?
#problem, we need a spatial distribution since none of the columns have that
#we will use google maps so limit the number to under 500 (ask me why)
nsample=450
addsample<-bronxadd[sample.int(nadd,size=nsample),]
#new data frame for the full address
addrlist<-data.frame(1:nsample,addsample$ADDRESSONLY,rep("NEW YORK",times=nsample),rep("NY",times=nsample),addsample$ZIP.CODE,rep("US",times=nsample))
#look them up
querylist<-addrListLookup(addrlist)23
Lots missing – why?# how many returned valid lat/long?
querylist$matched <- (querylist$latitude !=0)
unmatchedid<- which(!querylist$matched)
#MANY missing - what's up?
unmatched<- length(unmatchedid)
#WEST -> W and EAST -> E - do again.
addrlist2<-data.frame(1:unmatched,gsub(" WEST "," W ",gsub(" EAST "," E ",addsample[unmatchedid,1])),rep("NEW YORK",times=unmatched),rep("NY",times=unmatched),addsample[unmatchedid,2],rep("US",times=unmatched))
querylist[unmatchedid,1:4]<-addrListLookup(addrlist2)[,1:4]
querylist$matched <- (querylist$latitude !=0)
unmatchedid<- which(!querylist$matched)
unmatched<- length(unmatchedid)24
Not enough#this fixed a LOT but we need more: STREET and AVENUE (could have done PLACE) and others
addrlist3<-data.frame(1:unmatched,gsub("WEST","W",gsub("EAST","E",gsub("STREET","ST ", gsub("AVENUE","AVE", addsample[unmatchedid,1])))),rep("NEW YORK", times=unmatched), rep("NY",times=unmatched), addsample[unmatchedid,2], rep("US",times=unmatched))
querylist[unmatchedid,1:4]<-addrListLookup(addrlist3)[,1:4]
querylist$matched <- (querylist$latitude !=0)
unmatchedid<- which(!querylist$matched)
unmatched<- length(unmatchedid)
# 9 left now? good enough.25
Rebuild!addsample<-cbind(addsample,querylist$latitude,querylist$longitude)
##names(addsample[3:4])<-c("latitude","longitude") - this was meant to correct the column names but did not work for me
addsample<-addsample[addsample$'querylist$latitude'!=0,] # note ' ' to work around column name
adduse<-merge(bronx,addsample)
adduse<-adduse[!is.na(adduse$latitude),]
26
Most satisfying part!mapcoord<-adduse[,c(2,4,24,25)]
table(mapcoord$NEIGHBORHOOD)
mapcoord$NEIGHBORHOOD <- as.factor(mapcoord$NEIGHBORHOOD)
#
geoPlot(mapcoord,zoom=12,color=mapcoord$NEIGHBORHOOD)
27
28
Did you forget the KNN?#almost there.
mapcoord$class<as.numeric(mapcoord$NEIGHBORHOOD)
nclass<-dim(mapcoord)[1]
split<-0.8
trainid<-sample.int(nclass,floor(split*nclass))
testid<-(1:nclass)[-trainid]
##mappred<-mapcoord[testid,]
##mappred$class<as.numeric(mappred$NEIGHBORHOOD) 29
KNN!kmax<-10
knnpred<-matrix(NA,ncol=kmax,nrow=length(testid))
knntesterr<-rep(NA,times=kmax)
for (i in 1:kmax){ # loop over k
knnpred[,i]<-knn(mapcoord[trainid,3:4],mapcoord[testid,3:4],cl=mapcoord[trainid,2],k=i)
knntesterr[i]<-sum(knnpred[,i]!=mapcoord[testid,2])/length(testid)
} 30
Finally K-Means!> mapmeans<-data.frame(adduse$ZIP.CODE, as.numeric(mapcoord$NEIGHBORHOOD), adduse$TOTAL.UNITS, adduse$"LAND.SQUARE.FEET", adduse$GROSS.SQUARE.FEET, adduse$SALE.PRICE, adduse$'querylist$latitude', adduse$'querylist$longitude')
> mapobj<-kmeans(mapmeans,5, iter.max=10, nstart=5, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen"))
> fitted(mapobj,method=c("centers","classes"))
> plot(mapmeans,mapobj$cluster) 31
32
Assignment 3• Preliminary and Statistical Analysis. Due ~
Feb 28. 15% (written)– Distribution analysis and comparison, visual
‘analysis’, statistical model fitting and testing of some of the nyt1…31 datasets.
33
Tentative assignments
• Assignment 4: Pattern, trend, relations: model development and evaluation. Due ~ early March. 15% (10% written and 5% oral; individual);
• Assignment 5: Term project proposal. Due ~ week 7. 5% (0% written and 5% oral; individual);
• Assignment 6: Predictive and Prescriptive Analytics. Due ~ week 8. 15% (15% written and 5% oral; individual);
• Term project. Due ~ week 13. 30% (25% written, 5% oral; individual).
34
Admin info (keep/ print this slide)• Class: ITWS-4963/ITWS 6965• Hours: 12:00pm-1:50pm Tuesday/ Friday• Location: SAGE 3101• Instructor: Peter Fox• Instructor contact: [email protected], 518.276.4862 (do not
leave a msg)• Contact hours: Monday** 3:00-4:00pm (or by email appt)• Contact location: Winslow 2120 (sometimes Lally 207A
announced by email)• TA: Lakshmi Chenicheri [email protected] • Web site: http://tw.rpi.edu/web/courses/DataAnalytics/2014
– Schedule, lectures, syllabus, reading, assignments, etc.
35
Table: Matlab/R/scipy-numpy• http://hyperpolyglot.org/numerical-analysis
36