R을 이용한 국토부 실거래가 사이트 웹 스크래핑

Preview:

Citation preview

Overview

• OPEN-API, Web 스크래핑에대한정의

• 구현을위한필수기술소개

• R에서 API로 FRED, ECOS 데이터입수

• R에서스크래핑으로아파트실거래가입수

• 입수데이터를활용한각종모형 분석

OPEN-API, Web 스크래핑

Open API (often referred to as OpenAPI new technology) is a word used to

describe sets of technologies that enable websites to interact with each other by

using REST, SOAP, JavaScript and other web technologies. While its

possibilities aren't limited to web-based applications, it's becoming an

increasing trend in so-called Web 2.0 applications.

Web scraping (web harvesting or web data extraction) is a computer

software technique of extracting information from websites. Usually, such

software programs simulate human exploration of the World Wide Web by

either implementing low-level Hypertext Transfer Protocol (HTTP), or

embedding a fully-fledged web browser, such as Internet Explorer or Mozilla

Firefox.

구현을위한필수기술소개 - R

R Base

패키지

R Studio

구현을위한필수기술소개 - JSON

JSON(JavaScript Object Notation)

<root><ZFPMember>

<name>문**</name></ZFPMemer><ZFPMember>

<name>박**</name></ZFPMemer><ZFPMember>

<name>김**</name></ZFPMemer><ZFPMember>

<name>최**</name></ZFPMemer></root>

{ ZFPMember =[

{ “name” : “문**”}, { “name”: “박**”}, {“name” : “김**”}, {“name”, “최**”}

] }

R에서 API로 ECOS, FRED 입수

API 입수를위한 5단계

① API KEY 유무확인

②필요한패키지받기(jsonlite)

③쿼리 만들기

④데이터입수

⑤분석(Parsing)

API KEY 유무확인

필요한패키지받기(jsonlite)

> install.packages(“jsonlite”)

> library(jsonlite)

쿼리만들기(FRED)

http://api.stlouisfed.org/fred/series/observations?

series_id=CPIAUCSL

&api_key=b55d00cc4e7ea4483038c2f6edad____

&file_type=json

데이터입수(FRED)

library(jsonlite)

series_id <- "CPIAUCSL"

api_key <- "b55d00cc4e7ea4483038c2f6edad____"

file_type <-"json"

url = paste0("http://api.stlouisfed.org/fred/series/observations",

"?series_id=",series_id,

"&api_key=",api_key,

"&file_type=",file_type)

raw.data <- readLines(url, warn = "F",encoding="UTF-8")

데이터처리(FRED)

> dat<- fromJSON(raw.data)

> str(dat)

List of 13

$ realtime_start : chr "2014-06-12"

$ realtime_end : chr "2014-06-12"

$ observation_start: chr "1776-07-04“

:

$ limit : num 1e+05

$ observations :'data.frame': 808 obs. of 4 variables:

..$ realtime_start: chr [1:808] "2014-06-12" "2014-06-12"

..$ realtime_end : chr [1:808] "2014-06-12" "2014-06-12

..$ date : chr [1:808] "1947-01-01" "1947-02-01"

..$ value : chr [1:808] "21.48" "21.62" "22.0" "22.0" ...

> dat$observations$value

쿼리만들기(ECOS)

http://ecos.bok.or.kr/api/StatisticTableList/SCES3Y78SI__/xml/kr/1/10

쿼리만들기(ECOS)http://ecos.bok.or.kr/api/StatisticItemList/sample/xml/kr/1/10/021Y123/

http://ecos.bok.or.kr/api/StatisticSearch/SCES3Y78SI__/xml/kr/1/1000/021Y123/MM/196501/201405/0/

데이터입수(ECOS)

library(jsonlite)

api_key = "SCES3Y78SI4P/“; file_type = "json/“; lang_type = "kr/"

start_no = "1/“; end_no ="100/"

stat_code = "021Y123/“; cycle_type = "MM/"

start_date = "196501/“; end_date = "201405/"

item_no = "0"

url = paste0("http://ecos.bok.or.kr/api/StatisticSearch/", api_key,file_type,lang_type,start_no,end_no,stat_code,cycle_type,start_date,end_date,item_no)

raw.data <- readLines(url, warn = "F",encoding="UTF-8")

데이터처리(ECOS)> raw.data <- readLines(url, warn = "F", encoding="UTF-8")

> dat<- fromJSON(raw.data)

> str(dat)

List of 1 $ StatisticSearch:List of 2

..$ list_total_count: num 25

..$ row :'data.frame': 10 obs. of 8 variables:

.. ..$ UNIT_NAME : chr [1:10] "십억원 " "십억원

.. .. ..$ STAT_NAME : chr [1:10] "1.1.주요 통화금융지

.. .. ..$ STAT_CODE : chr [1:10] "010Y002" "010Y002" "010Y002"

.. .. ..$ ITEM_NAME1: chr [1:10] "화폐발행잔액(말잔)" "화폐발

.. .. ..$ ITEM_NAME2: chr [1:10] " " " " " " "

.. .. ..$ DATA_VALUE: chr [1:10] "49777.5" "50528" "50226.

.. .. ..$ ITEM_NAME3: chr [1:10] " " " " " " " "

.. .. ..$ TIME : chr [1:10] "201204" "201205“

> dat$StatisticSearch$row$DATA_VALUE

데이터분석(FRED)

library(zoo)

lst_series <- list("CPIAUCSL","UNRATE","FEDFUNDS") #소비자 물가지수,실업률, 기준금리

api_key <- "b55d00cc4e7ea4483038c2f6edad____"

file_type <-"json"

ts<-zoo()

for(i in 1:length(lst_series)){

url = paste0("http://api.stlouisfed.org/fred/series/observations",

"?series_id=",lst_series[i], "&api_key=",api_key, "&file_type=",file_type)

raw.data <- readLines(url, warn = "F",encoding="UTF-8")

dat<- fromJSON(raw.data)

temp<-zoo(as.numeric(dat$observations$value),as.Date(c(dat$observations$date)))

if(i==1){ ts<-temp }else{

ts<-na.locf(merge(ts,temp))

colnames(ts)[i]<-lst_series[i] }

}

colnames(ts)[1] <- lst_series[1] #첫번째 컬럼이름을 정의

데이터분석(FRED)#NA값 제거

ts<-ts[!is.na(ts[,3]),]

#1차차분

ts.diff1 <- diff(ts,lag=1)

#ACF(autocorrelation) 그래프

acf(as.numeric(ts.diff1[,1]),main=colnames(ts)[1])

#전기대비 증감

ts.rate <- ts.diff1/ts

#dataframe으로 변환

df<- data.frame(ts)

#Plot 그리기

plot(x=as.Date(rownames(df)),y=df[,1],type="l", xlab="date",ylab=colnames(df)[1])

#회귀분석

summary(lm(CPIAUCSL~UNRATE+FEDFUNDS, data=df))

Web Scrapping(국토교통부)

Web Scrapping(국토교통부)dongCode = "1168010600"

danjiCode = "ALL"

srhYear = "2014"

srhPeriod = "1"

gubunRadio2 = "1"

url = paste0("http://rt.molit.go.kr/rtApt.do?cmd=getTradeAptLocal&dongCode=",

dongCode,"&danjiCode=",danjiCode,"&srhYear=",srhYear,

"&srhPeriod=",srhPeriod,"&gubunRadio2=",gubunRadio2)

raw.data <- readLines(url, warn = "F",encoding="UTF-8")

dat<- fromJSON(raw.data)

str(dat)

df<-data.frame(cbind(

dat$detailList$APT_CODE,dat$detailList$AREA,

dat$detailList$MONTH,dat$detailList$SUM_AMT))

write.csv(df, file=“aptTrans.csv”)

Web Scrapping(국토교통부) –대용량dongCode = "1168010600"

danjiCode = "ALL"

gubunRadio2 = "1“

dft <- data.frame()

for(i in 2006:2014){

for(j in 1:4){

url = paste0("http://rt.molit.go.kr/rtApt.do?cmd=getTradeAptLocal&dongCode=",

dongCode,"&danjiCode=",danjiCode,"&srhYear=",i,

"&srhPeriod=",j,"&gubunRadio2=",gubunRadio2)

raw.data <- readLines(url, warn = "F",encoding="UTF-8")

dat<- fromJSON(raw.data)

df<-data.frame(cbind(dat$detailList$APT_CODE,dat$detailList$AREA,

dat$detailList$MONTH,dat$detailList$SUM_AMT))

dft<-rbind(dft,df)

}

}

But, Quantmod

Yahoo! Finance, FRED, Google Finance, Oanda,

The Currency Site 의 데이터를 함수형식으로 제공

- http://www.quantmod.com/

And, Quandl

9백만개가넘는데이터셋에서 함수형태로데이터를제공

- http://www.quandl.com/

Recommended