03 Data Input Output

Embed Size (px)

Citation preview

  • 8/19/2019 03 Data Input Output

    1/43

    Data Analysis & DataScience with R

    Data Input and Output(Import and Export)

    By Marin otache

    Al.I. Cuza University of IașiFaculty of Economics and Business AdministrationDepartment of Accounting, Information ystems and

    tatistics

  • 8/19/2019 03 Data Input Output

    2/43

    Scripts associated with thispresentation

    ! cripts◦ "#a$%asics$of$data$input$output.!& 'ttp&(()drv.ms()D!*+ C◦ "#%$intermediate$data$input$output&

    'ttp&(()drv.ms()- e /i

    0ostgre 12 scripts 3for creating t'e DB to %e importedin ! 4 see script "#a...5

    ◦ ""a$)$creating$ta%les$$sales.s6l&'ttp&(()drv.ms()- e!%'◦

    ""a$7$populating$ta%les$$sales.s6l&'ttp&(()drv.ms()- e8U"◦ ")$creare$%d$vinzari$0ostgre 12.s6l&

    'ttp&(()drv.ms()- f977◦ "7$populare$%d$vinzari$0ostgre 12 .s6l&

    'ttp&(()drv.ms()- fem9

    http://1drv.ms/1DRMOTChttp://1drv.ms/1JKeKwihttp://1drv.ms/1JKeRbhhttp://1drv.ms/1JKeYU0http://1drv.ms/1JKf522http://1drv.ms/1JKfem5http://1drv.ms/1JKfem5http://1drv.ms/1y1UeAwhttp://1drv.ms/1JKf522http://1drv.ms/1y1UeAwhttp://1drv.ms/1JKeYU0http://1drv.ms/1y1UeAwhttp://1drv.ms/1JKeRbhhttp://1drv.ms/1y1UeAwhttp://1drv.ms/1JKeKwihttp://1drv.ms/1AqPIvPhttp://1drv.ms/1DRMOTChttp://1drv.ms/1AqPFjG

  • 8/19/2019 03 Data Input Output

    3/43

    Scripts associated with thispresentation (cont!)

    +racle scripts 3for creating t'e DB to %e imported in ! 4see script "#%...5◦ "):")a$creating$ta%les$$sales.s6l

    'ttp&(()drv.ms()2B;Img◦ "):")a$ro$creare$%d$vinzari.s6l

    'ttp&(()drv.ms()A61 vg◦ "):")%$populating$ta%les$$sales.s6l

    'ttp&(()drv.ms()2B;

  • 8/19/2019 03 Data Input Output

    4/43

    "e# sites with R tutorials $or datainput%output

    ! Data Import(E>port'ttp&((cran.r:pro?ect.org(doc(manuals(r:release(!:data.'tml

    Beginner@s guide to !& Hlist 02? l>%:/ v ; Dfc 0FJ7gzJ

  • 8/19/2019 03 Data Input Output

    5/43

    "e# sites with R tutorials $or datainput%output (cont!)

    Importing Data Into ! from DiLerentources

    'ttp&((///.r:%loggers.com(importing:data:i

    nto:r:from:diLerent:sources(Data Import H E>port in !'ttp&((science.nature.nps.gov(im(datamgmt(statistics(r(fundamentals(inde>.cfm

    !eading data from t'e ne/ version of

  • 8/19/2019 03 Data Input Output

    6/43

    oadin' data into statisticalpac a'es

    raditional solutions&◦ Direct import from external data files (Excel, CSV, text files etc.) using

    their menus◦ Save intermediate results from the data sources into common format

    files (XM , CSV, !S"# ) and then import these intermediate files into

    the pac$age%◦ Create data sources using "D&C or !D&C

    ome more recent options&◦ Customi'ed (for data source and the destination pac$age) E

    procedures◦ Connecting to special *+s or e- data services hich provide data sets

    in formats eas/ to import (e.g. 0oogle nal/tics)◦ +mport data from e- servers log into #oS1 data stores◦

    *erforming data-ase 2uer/ in a data-ase server directl/ from thestatistical pac$age.

  • 8/19/2019 03 Data Input Output

    7/43

    Sources o$ Data in R (adaptated$rom *a#aco+, -.//0)

    ;o 12Datatores

    Jadoop

  • 8/19/2019 03 Data Input Output

    8/43

    oadin' data sets storedwithin pac a'es

    ee previous presentation*any pacGages include datasets, suc' as ''plot- M aftera pacGage isloaded, all of its datasets are availa%le&

    > library(ggplot2)> str(diamonds)'data.frame': 53940 obs. of 10 variables: !arat : n"m 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.2# 0.22 0.23 ... !"t : $rd.fa!tor %& 5 levels air *ood ..: 5 4 2 4 2 3 3 3 1 3 ... !olor : $rd.fa!tor %& + levels , - * ..: 2 2 2 # + + # 5 2 5 ... !larity: $rd.fa!tor %& levels /1 /2 /1 ..: 2 3 5 4 2 # + 3 45 ... dept : n"m #1.5 59. 5#.9 #2.4 #3.3 #2. #2.3 #1.9 #5.1 59.4 ... table : n"m 55 #1 #5 5 5 5+ 5+ 55 #1 #1 ... pri!e : int 32# 32# 32+ 334 335 33# 33# 33+ 33+ 33 ... : n"m 3.95 3. 9 4.05 4.2 4.34 3.94 3.95 4.0+ 3. + 4 ... y : n"m 3.9 3. 4 4.0+ 4.23 4.35 3.9# 3.9 4.11 3.+ 4.05 ... : n"m 2.43 2.31 2.31 2.#3 2.+5 2.4 2.4+ 2.53 2.49 2.39 ...

  • 8/19/2019 03 Data Input Output

    9/43

    oadin' data o#1ects sa2ed in thewor space o$ a pre2ious session

    ave t'e /orGspace associated /it' t'e currentsession 3t'e /orGspace contains all t'e of t'ee>isting data o%?ects at a point in time 5&

    > %s.name paste( %or 6 ys.,ate()67 .8,ata 6 sep )> save.image(file %s.name)

    !estore 3load5 a previously saved /orGspace

    3and all t'e data o%?ect in t'e /orGspace5&> load( %or 2014 09 12.8,ata )

    'en more /orspaced 'ave %een saved, onecan c'oos /ic' one to load

    > load(file.! oose())

  • 8/19/2019 03 Data Input Output

    10/43

    Data entered $rom the ey#oard (/)

    'e simplest met'od of data entryfunction edit() launc'es a te>t editor t'atallo/s entering your data manually

    If t'e data frame e>ists&

    > st"dent gi edit(st"dent gi)or> fi (st"dent gi)

  • 8/19/2019 03 Data Input Output

    11/43

    Data entered $rom the ey#oard(-)

    If t'e data frame does not e>ist, follo/ t/osteps&◦ 3 Create an empt/ data frame (or matrix) ith the

    varia-le names and t/pes /ou ant to have in the final

    dataset. > mydata data.frame(age n"meri!(0)67 gender ! ara!ter(0)6

    %eig t n"meri!(0))

    ◦ 4 +nvo$e the text editor on this data o-5ect, enter data,and save the results -ac$ to the data o-5ect.

    > mydata edit(mydata) +r > fi (mydata)

  • 8/19/2019 03 Data Input Output

    12/43

    Data entered $rom clip#oard+ne can copy into clip%oard small sections of data in a ta%le3e.g. a spreads'eet, a e% J *2 ta%le5 using control34 3copy command5

    +n indo/s, command read!ta#le 'andles clip%oard data/it' a 'eader ro/ t'at is separated %y ta%s, and stores t'e

    data in a data frame 3>5&> read.table(file !lipboard 6 sep ;t 6

    7 eader read.table(pipe( pbpaste )6 sep ;t 6

    7 eader ? -)◦ cop/ ith header

    > y read.table(pipe( pbpaste )6 sep ;t 6

    7 eader

  • 8/19/2019 03 Data Input Output

    13/43

    Import $rom local 4S6%delimitedtext 7les

    read!ta#le() reads a Nle in ta%le formatand saves it as a data frame

    > mydataframe read.table(file67 eader logi!al val"e6

    7 sep delimiter 6 ro%.names name )◦ file is a delimited SC++ file◦ header is a logical value indicating hether the first

    ro contains varia-le names ( 67E or 8 SE )◦ sep specifies the separating data values◦ row.names is an optional parameter specif/ing

    one or more varia-les to represent ro identifiers.

  • 8/19/2019 03 Data Input Output

    14/43

    Data input $rom local delimited text7les

    Data frame #irths-..8 is located in directoryDataSets%#irths-..8 Delimitator in t'e source Nle is a% 3Ot5

    'e name of t'e Nle to %e imported is #irths-..8!txt1ualifying t'e su%directory is a %it diLerent from an operating

    system to anot'er> s%it! ( ys.info()@@'sysname'AA67 Bindo%s Cbirt s200# read.table(7 birt s200#;;birt s200#.t t 67 file-n!oding =< 6 eader

  • 8/19/2019 03 Data Input Output

    15/43

    Data input $rom local delimited text7les (cont!)

    'en /e are not sure a%out t'e Nle name,instead of t'e Nlename one can use t'efunction 7le!choose() &

    birt s200#.2 read.table(file.! oose()67 file-n!oding =< 67 eader

  • 8/19/2019 03 Data Input Output

    16/43

    Data input $rom local 4S6 7lesImport one dataset from Irina Dan@s 0'.D. t'esis concerninga study of using e:documents in companiesData source is a ta% delimited Nle : companyin$o!cs2Current /orGing directory is !!!%DataSetsFile to %e imported is located in directory!!!%DataSets%IrinaDan

    > s%it! ( ys.info()@@'sysname'AA67 Bindo%s C!omp read.table(7 /rina,an;;!ompanyinfo.!sv 67 eader

  • 8/19/2019 03 Data Input Output

    17/43

    Data input $rom local 4S6 7les(cont!)

    Import a dataset from Dragos Cogean@s 0'.D. t'esis/'ic' compares t/o cloud data%ase services, *ongoand *y 12Data sources are ta% delimited Nles

    'e !txt 3ta%:delimited5 Nle resides in directory

    9DataSets9Dra'os4o'ean;otice t'e second version of function switch()> /nsertFongo ?? s%it! ( ys.info()@@'sysname'AA67 Bindo%s C read.table(7 ,ragosGogean;;/nsertFongo ??.t t 67 file-n!oding =< 6 eader

  • 8/19/2019 03 Data Input Output

    18/43

    Data input $rom local 4S6 7les(cont!)

    Import :oyota 4orolla second hand carsdata set 3located in...ODataSets9:oyota4orolla directory5MAlso notice t'e t'ird version of functionswitch()

    >

  • 8/19/2019 03 Data Input Output

    19/43

    Data input $rom text 7le a2aila#le onwe#

    ;eart attac data set◦ Description availa-le at9

    http9 courses.statistics.com soft are 6 ta-les:6.htm◦ he data set (as delimited text file) availa-le at9

    http9 courses.statistics.com +ntro3 esson4 heartat$:6.txt> eart.att read.table(

    ttp:&&!o"rses.statisti!s.!om&/ntro1&?esson2& eartat48.t t6 eader ead( eart.att) Hatient ,/ *I$ / -J ,8* ,/-, GK 8*- ?$ *-1 1 41041 122 0 4+52.00 10 +92 2 41041 122 0 3941.00 # 343 3 41091 122 0 3#5+.00 5 +#4 4 410 1 122 0 14 1.00 2 0

    5 5 41091 F 122 0 1# 1.00 1 55

    http://courses.statistics.com/software/R/tables4R.htmhttp://courses.statistics.com/Intro1/Lesson2/heartatk4R.txthttp://courses.statistics.com/Intro1/Lesson2/heartatk4R.txthttp://courses.statistics.com/Intro1/Lesson2/heartatk4R.txthttp://courses.statistics.com/Intro1/Lesson2/heartatk4R.txthttp://courses.statistics.com/Intro1/Lesson2/heartatk4R.txthttp://courses.statistics.com/Intro1/Lesson2/heartatk4R.txthttp://courses.statistics.com/software/R/tables4R.htm

  • 8/19/2019 03 Data Input Output

    20/43

    Data input $rom 4S6 7le a2aila#le onwe#

    Smo in' data set& #9= people polled on t'eirsmoGing status 3 moGe5 and t'eir socioeconomicstatus 3 E 5.:he data 7le contains only two columns, andwhen read R interprets them #oth as $actors<

    > smo er read.!sv( ttp:&&%%%.!y!lismo.org&t"torial&8& stati!&smo er.!sv )> ead(smo er) mo e -

    1 former Kig2 former Kig3 former Kig4 former Kig5 former Kig# former Kig

  • 8/19/2019 03 Data Input Output

    21/43

    Data input $rom 4S6 7le a2aila#le onwe#

    Smo in' data set& #9= people polled on t'eirsmoGing status 3 moGe5 and t'eir socioeconomicstatus 3 E 5.:he data 7le contains only two columns, andwhen read R interprets them #oth as $actors<

    > smo er read.!sv( ttp:&&%%%.!y!lismo.org&t"torial&8& stati!&smo er.!sv )> ead(smo er) mo e -

    1 former Kig2 former Kig3 former Kig4 former Kig5 former Kig# former Kig

  • 8/19/2019 03 Data Input Output

    22/43

    Download and read'en a data set is large, instead of t'e direct import...

    > dat.!sv read.!sv( ttp:&&%%%.ats."!la.ed"&stat&data& sb2.!sv )... one can proceed in t/o steps&◦ 3. do nload the file

    >

    do%nload.file( ttp:&&ar! ive.i!s."!i.ed"&ml&ma! inelearning databases&arr yt mia&&arr yt mia.data 6destfile data.!sv )trying =8? ' ttp:&&ar! ive.i!s."!i.ed"&ml&ma! inelearning databases&arr yt mia&&arr yt mia.data'Gontent type 'te t&plainE ! arset =< ' lengt 402355bytes (392 Lb)opened =8?

    do%nloaded 392 Lb◦ 4. import the do nloaded file

    > df.2 read.!sv( data.!sv )

  • 8/19/2019 03 Data Input Output

    23/43

    Importin' data $rom Excel 7les 'e PsimplestP /ay to read an E>cel3.>ls(.>ls>5 Nle is to save it in E>cel as a te>t3ta% delimited5 or csv Nle and t'en to readit as in previous slides

    2oading directly into ! .>ls(.>ls> Nles ispossi%le t'roug' various pacGages&◦ 8 $ , M G◦ gdata◦ ls8eadB rite◦ J?Gonne!t◦ xlsx

  • 8/19/2019 03 Data Input Output

    24/43

    =ro#lems (on "indows systems)when loadin' some pac a'es

    > install.pa! ages( ls )> library( ls )?oading reN"ired pa! age: rOava?oading reN"ired pa! age: ls Pars

    >ls> re6uires pacGage r-avaM on indo/s systems t'atsometimes creates pro%lems 3e.g. ! = %its on indo/s

    5

    +n my computer 3 indo/s = %it5, a do/nloaded ?avaruntime in directory 4 options(Pava. ome G:;;Hrogramiles;;Oava;;Pre+;; )

  • 8/19/2019 03 Data Input Output

    25/43

  • 8/19/2019 03 Data Input Output

    26/43

  • 8/19/2019 03 Data Input Output

    27/43

    Import data $rom local =ost'reS@data#ases (cont!)

    2aunc' t'e 0ostgre 12 6ueryM t'e result of t'e 6uery /ill %e saved into data

    frame in2oice detailed &> invoi!e detailed

    7 db*etQ"ery(!on6

    7 -?-G< i.invoi!eIo6 invoi!e,ate6 i.!"stomer/d6

    7 !"stomerIame6 pla!e6 !o"ntyIame6 region6

    7 !omments6 invoi!e8o%I"mber6 i d.prod"!t/d67 prod"!tIame6 "nit$fFeas"rement6 !ategory6

    7 N"antity6 "nitHri!e6 N"antity R "nitHri!e amo"ntBit o"tS

  • 8/19/2019 03 Data Input Output

    28/43

    Import data $rom local =ost'reS@data#ases (cont!)

    2aunc' t'e 0ostgre 12 6ueryM t'e result of t'e 6uery /ill %e saved

    into data frame in2oice detailed &> ead(invoi!e detailed63) invoi!eno invoi!edate !"stomerid !"stomername pla!e

    1 1111 2012 0 01 1001 Glient 1 8? /asi

    2 1111 2012 0 01 1001 Glient 1 8? /asi

    3 1111 2012 0 01 1001 Glient 1 8? /asi !o"ntyname region !omments invoi!ero%n"mber prod"!tid

    1 /asi Foldova I > 1 1

    2 /asi Foldova I > 2 2

    3 /asi Foldova I > 3 5

    prod"!tname "nitofmeas"rement !ategory N"antity

    1 Hrod"!t 1 b500ml Gategory 50

    2 Hrod"!t 2 g Gategory M +5

    3 Hrod"!t 5 "nit Gategory 50

    "nitpri!e amo"nt%it o"tvat amo"nt

    1 1000 50000 #2000

    2 1050 + +50 2003 +0#0 353000 43++20

  • 8/19/2019 03 Data Input Output

    29/43

    Sa2in' the data $rame(s)

    Data frame3s5 /ill %e saved 3for furt'er use5 indirectory !!!%DataSets%sales

    0at' 6ualiNcation is diLerent %et/een indo/sand *ac systems&

    > file.name s%it! ( ys.info()@@'sysname'AA6

    Bindo%sC sales;;invoi!e detailed.8,ata D6

    ,ar%in C sales&invoi!e detailed.8,ata D)

    > save(invoi!e detailed6 file file.name)

    After saving, /'enever needed, t'e data frame can%e loaded into !studio session /it' load function

  • 8/19/2019 03 Data Input Output

    30/43

    4lose connections%dri2ers

    After t'e import, t'e resources must %e freed

    Close all 0ostgre 12 connections

    for (!onne!tion in db?istGonne!tions(drv) )C

    db,is!onne!t(!onne!tion)

    DFrees all t'e resources on t'e driver

    > db=nload,river(drv)

  • 8/19/2019 03 Data Input Output

    31/43

    Import data $rom a remote =ost'reS@data#ases ( )

    ...for t e moment it is impossible to e ternally(o"tside - ) a!!ess t e database servers

  • 8/19/2019 03 Data Input Output

    32/43

    Access Oracle data#ases throu'h >DB4

    0acGage !+racle /as intended to provide access to +racle data%ases

    Unfortunately, no/ pacGage !+racle is not availa%le;e>t e>ample /as inspired %y'ttp&((///.r:%loggers.com(connecting:r:to:an:oracle:data%ase:/it':r?d%c(As t'e name suggests, t'e solution needs dealing /it' some -avaPt'ingsP

    ◦ 6e2uirements9 !D< !6E previousl/ installed◦ Do nload o5d-c 5ar from .oracle.com (in m/ case, o5d-c=.5ar)◦ Set ! V >?"ME, set max. memor/, and load r!ava li-rar/◦ ys.setenv(O S K$ F - '&pat &to&Pava om e')◦ on m/ Mac "S9

    >ys.setenv(O S K$F- '&?ibrary&Oava&OavaSirt"alFa! ines&Pd 1.+

    .0 45.Pd &Gontents&Kome')

    > options(Pava.parameters Jm 2g )

    > install.pa! ages( rOava )

    http://www.r-bloggers.com/connecting-r-to-an-oracle-database-with-rjdbc/http://www.r-bloggers.com/connecting-r-to-an-oracle-database-with-rjdbc/http://www.r-bloggers.com/connecting-r-to-an-oracle-database-with-rjdbc/http://www.r-bloggers.com/connecting-r-to-an-oracle-database-with-rjdbc/

  • 8/19/2019 03 Data Input Output

    33/43

    Access Oracle data#ases throu'h >DB4 (cont!)

    .Pinit()

    > print(.P!all( Pava&lang& ystem 6 6 getHroperty 6Pava.version ))

    ◦ class*ath (5ust for the record)

    > .P!lassHat ()2oad !-DBC pacGage

    > /nstall.pa! ages(8O,MG)

    > library(8O,MG)Create connection driver and open connection

    > Pdb!,river O,MG(driverGlass ora!le.Pdb!.$ra!le,river 6

    !lassHat &=sers&admin&,o%nloads&oPdb!#.Par )

  • 8/19/2019 03 Data Input Output

    34/43

  • 8/19/2019 03 Data Input Output

    35/43

    Import data $rom Mon'oDB

  • 8/19/2019 03 Data Input Output

    36/43

    Import data $rom 4assandra

  • 8/19/2019 03 Data Input Output

    37/43

    Import data $rom ;adoop

  • 8/19/2019 03 Data Input Output

    38/43

    Read ;:M ta#les $rom the we#

    0acGage needed& CM> install.pa! ages( JF? )> library(JF?)

    > my=8?

    ttp:&&%%%.Paredlander.!om&2012&02&anot er indof s"per bo%l pool&> dfK

  • 8/19/2019 03 Data Input Output

    39/43

    Import CM 7les0acGage needed& CM

    > library(JF?)e% address of t'e >ml Nle

    > "rl ttp:&&%%%.statisti!s.life. ".d &primer&mydata. ml

    Import> indata ml

  • 8/19/2019 03 Data Input Output

    40/43

    Readin' ;:M pa'es with multipleta#les

    0acGage needed& CM> library(JF?)

    'e /e% page contains te>t and a num%er of ta%les

    > "rl.1 ' ttp:&&en.%i ipedia.org&%i i&Borld pop"lation'

    > tbls.1 readK

  • 8/19/2019 03 Data Input Output

    41/43

  • 8/19/2019 03 Data Input Output

    42/43

    Sa2e%export R data o#1ects

    ave a data frame as a !cs2 Nle> %rite.!sv(spss26 file spss2.!sv )

    ave a data frame as a ta% delimited te>t Nle> %rite.table(spss26 file spss2.t t 67 sep ;t 6 file-n!oding =< )

    ave a data frame as an E>cel 3>ls>5 Nle 3re6uirespacGage xlsx 5

    > %rite. ls (spss26 file spss2. ls 6s eetIame spss2 )> %rite. ls (e! ipe.467 file Gentrali ator M,2 2013 / 1. ls 67 s eetIame t4.e! ipe 67 ro%.names ? -6 append

  • 8/19/2019 03 Data Input Output

    43/43

    Sa2e%export R data o#1ects (cont!)

    ave a dataframe as a !dta Nle 3re6uires pacGage$orei'n 5

    > %rite.dta(spss26 file spss2.dta )

    ave to %inary ! format 3can save multipledatasets and ! o%?ects5

    > save(invoi!e.details.ro67 file invoi!e.details.ro.8,ata )

    > save(states6 spss26 dat. ls67 file temp.8,ata )