Upload
alexandra-gabriela-grecu
View
212
Download
0
Embed Size (px)
Citation preview
8/19/2019 03 Data Input Output
1/43
Data Analysis & DataScience with R
Data Input and Output(Import and Export)
By Marin otache
Al.I. Cuza University of IașiFaculty of Economics and Business AdministrationDepartment of Accounting, Information ystems and
tatistics
8/19/2019 03 Data Input Output
2/43
Scripts associated with thispresentation
! cripts◦ "#a$%asics$of$data$input$output.!& 'ttp&(()drv.ms()D!*+ C◦ "#%$intermediate$data$input$output&
'ttp&(()drv.ms()- e /i
0ostgre 12 scripts 3for creating t'e DB to %e importedin ! 4 see script "#a...5
◦ ""a$)$creating$ta%les$$sales.s6l&'ttp&(()drv.ms()- e!%'◦
""a$7$populating$ta%les$$sales.s6l&'ttp&(()drv.ms()- e8U"◦ ")$creare$%d$vinzari$0ostgre 12.s6l&
'ttp&(()drv.ms()- f977◦ "7$populare$%d$vinzari$0ostgre 12 .s6l&
'ttp&(()drv.ms()- fem9
http://1drv.ms/1DRMOTChttp://1drv.ms/1JKeKwihttp://1drv.ms/1JKeRbhhttp://1drv.ms/1JKeYU0http://1drv.ms/1JKf522http://1drv.ms/1JKfem5http://1drv.ms/1JKfem5http://1drv.ms/1y1UeAwhttp://1drv.ms/1JKf522http://1drv.ms/1y1UeAwhttp://1drv.ms/1JKeYU0http://1drv.ms/1y1UeAwhttp://1drv.ms/1JKeRbhhttp://1drv.ms/1y1UeAwhttp://1drv.ms/1JKeKwihttp://1drv.ms/1AqPIvPhttp://1drv.ms/1DRMOTChttp://1drv.ms/1AqPFjG
8/19/2019 03 Data Input Output
3/43
Scripts associated with thispresentation (cont!)
+racle scripts 3for creating t'e DB to %e imported in ! 4see script "#%...5◦ "):")a$creating$ta%les$$sales.s6l
'ttp&(()drv.ms()2B;Img◦ "):")a$ro$creare$%d$vinzari.s6l
'ttp&(()drv.ms()A61 vg◦ "):")%$populating$ta%les$$sales.s6l
'ttp&(()drv.ms()2B;
8/19/2019 03 Data Input Output
4/43
"e# sites with R tutorials $or datainput%output
! Data Import(E>port'ttp&((cran.r:pro?ect.org(doc(manuals(r:release(!:data.'tml
Beginner@s guide to !& Hlist 02? l>%:/ v ; Dfc 0FJ7gzJ
8/19/2019 03 Data Input Output
5/43
"e# sites with R tutorials $or datainput%output (cont!)
Importing Data Into ! from DiLerentources
'ttp&((///.r:%loggers.com(importing:data:i
nto:r:from:diLerent:sources(Data Import H E>port in !'ttp&((science.nature.nps.gov(im(datamgmt(statistics(r(fundamentals(inde>.cfm
!eading data from t'e ne/ version of
8/19/2019 03 Data Input Output
6/43
oadin' data into statisticalpac a'es
raditional solutions&◦ Direct import from external data files (Excel, CSV, text files etc.) using
their menus◦ Save intermediate results from the data sources into common format
files (XM , CSV, !S"# ) and then import these intermediate files into
the pac$age%◦ Create data sources using "D&C or !D&C
ome more recent options&◦ Customi'ed (for data source and the destination pac$age) E
procedures◦ Connecting to special *+s or e- data services hich provide data sets
in formats eas/ to import (e.g. 0oogle nal/tics)◦ +mport data from e- servers log into #oS1 data stores◦
*erforming data-ase 2uer/ in a data-ase server directl/ from thestatistical pac$age.
8/19/2019 03 Data Input Output
7/43
Sources o$ Data in R (adaptated$rom *a#aco+, -.//0)
;o 12Datatores
Jadoop
8/19/2019 03 Data Input Output
8/43
oadin' data sets storedwithin pac a'es
ee previous presentation*any pacGages include datasets, suc' as ''plot- M aftera pacGage isloaded, all of its datasets are availa%le&
> library(ggplot2)> str(diamonds)'data.frame': 53940 obs. of 10 variables: !arat : n"m 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.2# 0.22 0.23 ... !"t : $rd.fa!tor %& 5 levels air *ood ..: 5 4 2 4 2 3 3 3 1 3 ... !olor : $rd.fa!tor %& + levels , - * ..: 2 2 2 # + + # 5 2 5 ... !larity: $rd.fa!tor %& levels /1 /2 /1 ..: 2 3 5 4 2 # + 3 45 ... dept : n"m #1.5 59. 5#.9 #2.4 #3.3 #2. #2.3 #1.9 #5.1 59.4 ... table : n"m 55 #1 #5 5 5 5+ 5+ 55 #1 #1 ... pri!e : int 32# 32# 32+ 334 335 33# 33# 33+ 33+ 33 ... : n"m 3.95 3. 9 4.05 4.2 4.34 3.94 3.95 4.0+ 3. + 4 ... y : n"m 3.9 3. 4 4.0+ 4.23 4.35 3.9# 3.9 4.11 3.+ 4.05 ... : n"m 2.43 2.31 2.31 2.#3 2.+5 2.4 2.4+ 2.53 2.49 2.39 ...
8/19/2019 03 Data Input Output
9/43
oadin' data o#1ects sa2ed in thewor space o$ a pre2ious session
ave t'e /orGspace associated /it' t'e currentsession 3t'e /orGspace contains all t'e of t'ee>isting data o%?ects at a point in time 5&
> %s.name paste( %or 6 ys.,ate()67 .8,ata 6 sep )> save.image(file %s.name)
!estore 3load5 a previously saved /orGspace
3and all t'e data o%?ect in t'e /orGspace5&> load( %or 2014 09 12.8,ata )
'en more /orspaced 'ave %een saved, onecan c'oos /ic' one to load
> load(file.! oose())
8/19/2019 03 Data Input Output
10/43
Data entered $rom the ey#oard (/)
'e simplest met'od of data entryfunction edit() launc'es a te>t editor t'atallo/s entering your data manually
If t'e data frame e>ists&
> st"dent gi edit(st"dent gi)or> fi (st"dent gi)
8/19/2019 03 Data Input Output
11/43
Data entered $rom the ey#oard(-)
If t'e data frame does not e>ist, follo/ t/osteps&◦ 3 Create an empt/ data frame (or matrix) ith the
varia-le names and t/pes /ou ant to have in the final
dataset. > mydata data.frame(age n"meri!(0)67 gender ! ara!ter(0)6
%eig t n"meri!(0))
◦ 4 +nvo$e the text editor on this data o-5ect, enter data,and save the results -ac$ to the data o-5ect.
> mydata edit(mydata) +r > fi (mydata)
8/19/2019 03 Data Input Output
12/43
Data entered $rom clip#oard+ne can copy into clip%oard small sections of data in a ta%le3e.g. a spreads'eet, a e% J *2 ta%le5 using control34 3copy command5
+n indo/s, command read!ta#le 'andles clip%oard data/it' a 'eader ro/ t'at is separated %y ta%s, and stores t'e
data in a data frame 3>5&> read.table(file !lipboard 6 sep ;t 6
7 eader read.table(pipe( pbpaste )6 sep ;t 6
7 eader ? -)◦ cop/ ith header
> y read.table(pipe( pbpaste )6 sep ;t 6
7 eader
8/19/2019 03 Data Input Output
13/43
Import $rom local 4S6%delimitedtext 7les
read!ta#le() reads a Nle in ta%le formatand saves it as a data frame
> mydataframe read.table(file67 eader logi!al val"e6
7 sep delimiter 6 ro%.names name )◦ file is a delimited SC++ file◦ header is a logical value indicating hether the first
ro contains varia-le names ( 67E or 8 SE )◦ sep specifies the separating data values◦ row.names is an optional parameter specif/ing
one or more varia-les to represent ro identifiers.
8/19/2019 03 Data Input Output
14/43
Data input $rom local delimited text7les
Data frame #irths-..8 is located in directoryDataSets%#irths-..8 Delimitator in t'e source Nle is a% 3Ot5
'e name of t'e Nle to %e imported is #irths-..8!txt1ualifying t'e su%directory is a %it diLerent from an operating
system to anot'er> s%it! ( ys.info()@@'sysname'AA67 Bindo%s Cbirt s200# read.table(7 birt s200#;;birt s200#.t t 67 file-n!oding =< 6 eader
8/19/2019 03 Data Input Output
15/43
Data input $rom local delimited text7les (cont!)
'en /e are not sure a%out t'e Nle name,instead of t'e Nlename one can use t'efunction 7le!choose() &
birt s200#.2 read.table(file.! oose()67 file-n!oding =< 67 eader
8/19/2019 03 Data Input Output
16/43
Data input $rom local 4S6 7lesImport one dataset from Irina Dan@s 0'.D. t'esis concerninga study of using e:documents in companiesData source is a ta% delimited Nle : companyin$o!cs2Current /orGing directory is !!!%DataSetsFile to %e imported is located in directory!!!%DataSets%IrinaDan
> s%it! ( ys.info()@@'sysname'AA67 Bindo%s C!omp read.table(7 /rina,an;;!ompanyinfo.!sv 67 eader
8/19/2019 03 Data Input Output
17/43
Data input $rom local 4S6 7les(cont!)
Import a dataset from Dragos Cogean@s 0'.D. t'esis/'ic' compares t/o cloud data%ase services, *ongoand *y 12Data sources are ta% delimited Nles
'e !txt 3ta%:delimited5 Nle resides in directory
9DataSets9Dra'os4o'ean;otice t'e second version of function switch()> /nsertFongo ?? s%it! ( ys.info()@@'sysname'AA67 Bindo%s C read.table(7 ,ragosGogean;;/nsertFongo ??.t t 67 file-n!oding =< 6 eader
8/19/2019 03 Data Input Output
18/43
Data input $rom local 4S6 7les(cont!)
Import :oyota 4orolla second hand carsdata set 3located in...ODataSets9:oyota4orolla directory5MAlso notice t'e t'ird version of functionswitch()
>
8/19/2019 03 Data Input Output
19/43
Data input $rom text 7le a2aila#le onwe#
;eart attac data set◦ Description availa-le at9
http9 courses.statistics.com soft are 6 ta-les:6.htm◦ he data set (as delimited text file) availa-le at9
http9 courses.statistics.com +ntro3 esson4 heartat$:6.txt> eart.att read.table(
ttp:&&!o"rses.statisti!s.!om&/ntro1&?esson2& eartat48.t t6 eader ead( eart.att) Hatient ,/ *I$ / -J ,8* ,/-, GK 8*- ?$ *-1 1 41041 122 0 4+52.00 10 +92 2 41041 122 0 3941.00 # 343 3 41091 122 0 3#5+.00 5 +#4 4 410 1 122 0 14 1.00 2 0
5 5 41091 F 122 0 1# 1.00 1 55
http://courses.statistics.com/software/R/tables4R.htmhttp://courses.statistics.com/Intro1/Lesson2/heartatk4R.txthttp://courses.statistics.com/Intro1/Lesson2/heartatk4R.txthttp://courses.statistics.com/Intro1/Lesson2/heartatk4R.txthttp://courses.statistics.com/Intro1/Lesson2/heartatk4R.txthttp://courses.statistics.com/Intro1/Lesson2/heartatk4R.txthttp://courses.statistics.com/Intro1/Lesson2/heartatk4R.txthttp://courses.statistics.com/software/R/tables4R.htm
8/19/2019 03 Data Input Output
20/43
Data input $rom 4S6 7le a2aila#le onwe#
Smo in' data set& #9= people polled on t'eirsmoGing status 3 moGe5 and t'eir socioeconomicstatus 3 E 5.:he data 7le contains only two columns, andwhen read R interprets them #oth as $actors<
> smo er read.!sv( ttp:&&%%%.!y!lismo.org&t"torial&8& stati!&smo er.!sv )> ead(smo er) mo e -
1 former Kig2 former Kig3 former Kig4 former Kig5 former Kig# former Kig
8/19/2019 03 Data Input Output
21/43
Data input $rom 4S6 7le a2aila#le onwe#
Smo in' data set& #9= people polled on t'eirsmoGing status 3 moGe5 and t'eir socioeconomicstatus 3 E 5.:he data 7le contains only two columns, andwhen read R interprets them #oth as $actors<
> smo er read.!sv( ttp:&&%%%.!y!lismo.org&t"torial&8& stati!&smo er.!sv )> ead(smo er) mo e -
1 former Kig2 former Kig3 former Kig4 former Kig5 former Kig# former Kig
8/19/2019 03 Data Input Output
22/43
Download and read'en a data set is large, instead of t'e direct import...
> dat.!sv read.!sv( ttp:&&%%%.ats."!la.ed"&stat&data& sb2.!sv )... one can proceed in t/o steps&◦ 3. do nload the file
>
do%nload.file( ttp:&&ar! ive.i!s."!i.ed"&ml&ma! inelearning databases&arr yt mia&&arr yt mia.data 6destfile data.!sv )trying =8? ' ttp:&&ar! ive.i!s."!i.ed"&ml&ma! inelearning databases&arr yt mia&&arr yt mia.data'Gontent type 'te t&plainE ! arset =< ' lengt 402355bytes (392 Lb)opened =8?
do%nloaded 392 Lb◦ 4. import the do nloaded file
> df.2 read.!sv( data.!sv )
8/19/2019 03 Data Input Output
23/43
Importin' data $rom Excel 7les 'e PsimplestP /ay to read an E>cel3.>ls(.>ls>5 Nle is to save it in E>cel as a te>t3ta% delimited5 or csv Nle and t'en to readit as in previous slides
2oading directly into ! .>ls(.>ls> Nles ispossi%le t'roug' various pacGages&◦ 8 $ , M G◦ gdata◦ ls8eadB rite◦ J?Gonne!t◦ xlsx
8/19/2019 03 Data Input Output
24/43
=ro#lems (on "indows systems)when loadin' some pac a'es
> install.pa! ages( ls )> library( ls )?oading reN"ired pa! age: rOava?oading reN"ired pa! age: ls Pars
>ls> re6uires pacGage r-avaM on indo/s systems t'atsometimes creates pro%lems 3e.g. ! = %its on indo/s
5
+n my computer 3 indo/s = %it5, a do/nloaded ?avaruntime in directory 4 options(Pava. ome G:;;Hrogramiles;;Oava;;Pre+;; )
8/19/2019 03 Data Input Output
25/43
8/19/2019 03 Data Input Output
26/43
8/19/2019 03 Data Input Output
27/43
Import data $rom local =ost'reS@data#ases (cont!)
2aunc' t'e 0ostgre 12 6ueryM t'e result of t'e 6uery /ill %e saved into data
frame in2oice detailed &> invoi!e detailed
7 db*etQ"ery(!on6
7 -?-G< i.invoi!eIo6 invoi!e,ate6 i.!"stomer/d6
7 !"stomerIame6 pla!e6 !o"ntyIame6 region6
7 !omments6 invoi!e8o%I"mber6 i d.prod"!t/d67 prod"!tIame6 "nit$fFeas"rement6 !ategory6
7 N"antity6 "nitHri!e6 N"antity R "nitHri!e amo"ntBit o"tS
8/19/2019 03 Data Input Output
28/43
Import data $rom local =ost'reS@data#ases (cont!)
2aunc' t'e 0ostgre 12 6ueryM t'e result of t'e 6uery /ill %e saved
into data frame in2oice detailed &> ead(invoi!e detailed63) invoi!eno invoi!edate !"stomerid !"stomername pla!e
1 1111 2012 0 01 1001 Glient 1 8? /asi
2 1111 2012 0 01 1001 Glient 1 8? /asi
3 1111 2012 0 01 1001 Glient 1 8? /asi !o"ntyname region !omments invoi!ero%n"mber prod"!tid
1 /asi Foldova I > 1 1
2 /asi Foldova I > 2 2
3 /asi Foldova I > 3 5
prod"!tname "nitofmeas"rement !ategory N"antity
1 Hrod"!t 1 b500ml Gategory 50
2 Hrod"!t 2 g Gategory M +5
3 Hrod"!t 5 "nit Gategory 50
"nitpri!e amo"nt%it o"tvat amo"nt
1 1000 50000 #2000
2 1050 + +50 2003 +0#0 353000 43++20
8/19/2019 03 Data Input Output
29/43
Sa2in' the data $rame(s)
Data frame3s5 /ill %e saved 3for furt'er use5 indirectory !!!%DataSets%sales
0at' 6ualiNcation is diLerent %et/een indo/sand *ac systems&
> file.name s%it! ( ys.info()@@'sysname'AA6
Bindo%sC sales;;invoi!e detailed.8,ata D6
,ar%in C sales&invoi!e detailed.8,ata D)
> save(invoi!e detailed6 file file.name)
After saving, /'enever needed, t'e data frame can%e loaded into !studio session /it' load function
8/19/2019 03 Data Input Output
30/43
4lose connections%dri2ers
After t'e import, t'e resources must %e freed
Close all 0ostgre 12 connections
for (!onne!tion in db?istGonne!tions(drv) )C
db,is!onne!t(!onne!tion)
DFrees all t'e resources on t'e driver
> db=nload,river(drv)
8/19/2019 03 Data Input Output
31/43
Import data $rom a remote =ost'reS@data#ases ( )
...for t e moment it is impossible to e ternally(o"tside - ) a!!ess t e database servers
8/19/2019 03 Data Input Output
32/43
Access Oracle data#ases throu'h >DB4
0acGage !+racle /as intended to provide access to +racle data%ases
Unfortunately, no/ pacGage !+racle is not availa%le;e>t e>ample /as inspired %y'ttp&((///.r:%loggers.com(connecting:r:to:an:oracle:data%ase:/it':r?d%c(As t'e name suggests, t'e solution needs dealing /it' some -avaPt'ingsP
◦ 6e2uirements9 !D< !6E previousl/ installed◦ Do nload o5d-c 5ar from .oracle.com (in m/ case, o5d-c=.5ar)◦ Set ! V >?"ME, set max. memor/, and load r!ava li-rar/◦ ys.setenv(O S K$ F - '&pat &to&Pava om e')◦ on m/ Mac "S9
>ys.setenv(O S K$F- '&?ibrary&Oava&OavaSirt"alFa! ines&Pd 1.+
.0 45.Pd &Gontents&Kome')
> options(Pava.parameters Jm 2g )
> install.pa! ages( rOava )
http://www.r-bloggers.com/connecting-r-to-an-oracle-database-with-rjdbc/http://www.r-bloggers.com/connecting-r-to-an-oracle-database-with-rjdbc/http://www.r-bloggers.com/connecting-r-to-an-oracle-database-with-rjdbc/http://www.r-bloggers.com/connecting-r-to-an-oracle-database-with-rjdbc/
8/19/2019 03 Data Input Output
33/43
Access Oracle data#ases throu'h >DB4 (cont!)
.Pinit()
> print(.P!all( Pava&lang& ystem 6 6 getHroperty 6Pava.version ))
◦ class*ath (5ust for the record)
> .P!lassHat ()2oad !-DBC pacGage
> /nstall.pa! ages(8O,MG)
> library(8O,MG)Create connection driver and open connection
> Pdb!,river O,MG(driverGlass ora!le.Pdb!.$ra!le,river 6
!lassHat &=sers&admin&,o%nloads&oPdb!#.Par )
8/19/2019 03 Data Input Output
34/43
8/19/2019 03 Data Input Output
35/43
Import data $rom Mon'oDB
8/19/2019 03 Data Input Output
36/43
Import data $rom 4assandra
8/19/2019 03 Data Input Output
37/43
Import data $rom ;adoop
8/19/2019 03 Data Input Output
38/43
Read ;:M ta#les $rom the we#
0acGage needed& CM> install.pa! ages( JF? )> library(JF?)
> my=8?
ttp:&&%%%.Paredlander.!om&2012&02&anot er indof s"per bo%l pool&> dfK
8/19/2019 03 Data Input Output
39/43
Import CM 7les0acGage needed& CM
> library(JF?)e% address of t'e >ml Nle
> "rl ttp:&&%%%.statisti!s.life. ".d &primer&mydata. ml
Import> indata ml
8/19/2019 03 Data Input Output
40/43
Readin' ;:M pa'es with multipleta#les
0acGage needed& CM> library(JF?)
'e /e% page contains te>t and a num%er of ta%les
> "rl.1 ' ttp:&&en.%i ipedia.org&%i i&Borld pop"lation'
> tbls.1 readK
8/19/2019 03 Data Input Output
41/43
8/19/2019 03 Data Input Output
42/43
Sa2e%export R data o#1ects
ave a data frame as a !cs2 Nle> %rite.!sv(spss26 file spss2.!sv )
ave a data frame as a ta% delimited te>t Nle> %rite.table(spss26 file spss2.t t 67 sep ;t 6 file-n!oding =< )
ave a data frame as an E>cel 3>ls>5 Nle 3re6uirespacGage xlsx 5
> %rite. ls (spss26 file spss2. ls 6s eetIame spss2 )> %rite. ls (e! ipe.467 file Gentrali ator M,2 2013 / 1. ls 67 s eetIame t4.e! ipe 67 ro%.names ? -6 append
8/19/2019 03 Data Input Output
43/43
Sa2e%export R data o#1ects (cont!)
ave a dataframe as a !dta Nle 3re6uires pacGage$orei'n 5
> %rite.dta(spss26 file spss2.dta )
ave to %inary ! format 3can save multipledatasets and ! o%?ects5
> save(invoi!e.details.ro67 file invoi!e.details.ro.8,ata )
> save(states6 spss26 dat. ls67 file temp.8,ata )