View
19
Download
0
Category
Preview:
Citation preview
2/10/2021
1
Using Data FilesR with XML Files
Dr Bill Mihajlovic20212021
Topics
• R & XML– Installing & Loading XML package– Loading XML‐file into a R data frame
D t t t & l t t d t fil• Data structures & complex structure data files
2/10/2021
2
File & OSFS
• Operating System File System (OSFS) is OS software managing– Storage devices, and– Data delivery (streaming data) to application programs such as
R‐console.
Operating System(Software)
I/OI/O
File SystemFile System
Application Program Code(Software)
HardwareHardware StorageStorage
DriversDrivers
I/OI/O
Questions: User Session
• When two entities interact (Communicate 2‐way) we have a session. A session is dynamic process that covers a time interval.
• What sort of session precisely we have here?
Command
User
Reports
2/10/2021
3
Answer: User Session
• When two entities interact (Communicate 2‐way) we have a session. A session is dynamic process that covers a time interval.
• What sort of session precisely we have here? • User‐to‐R (User2Rshell) session.
Command
User
Reports
Running R Integrated Development Environment (IDE) & Command Line Integrated (CLI) Shell
• Start R‐IDE developers tool‐application (with R‐CLI‐shell) • R‐CLI‐Shell is embedded in the R‐IDE and R‐Console.
OS
I/OI/O
File SystemFile System
R‐Console
All in Memory
DriversDrivers
I/OI/O
HardwareHardware StorageStorage
2/10/2021
4
R & XML Data Files
• XML is a Markup Language & a file format which are available to file reader programs for easier data extraction.
R ‐ XML Files
• XML is a file format which shares both the file format and the data on the World Wide Web, intranets, and elsewhere using standard ASCII text.
• It stands for Extensible Markup Language (XML). Similar to HTML it contains markup tags.
• But unlike HTML where the markup tag describes structure of the page, in xml the markup tags describe the meaning of the data contained into he file.
2/10/2021
5
XML Data File Format
• XML data files are ASCII files easy to read & exchange between different application programs.
Jumbled/CombedData & Metadata
in the same ASCII text file
FileX.xml
ASCII text‐file
Sender Program(XML Creator, Processor, Parser)
Receiver Program(XML Parser)
XML Languages (Standard & Custom)
• It stands for Extensible Markup Language (XML). – XML is not one data description language but many languages.
• Some XML’s are standardized like SVG• Some XML’s are custom made to accommodate given application program
2/10/2021
6
HTML & XML Files
• XML is similar to HTML (Both are ML’s)– Both contains markup tags.
• XML and HTML are fundamentally different.– HTML markup tags describe structure of the rendered
(Displayed) page.– XML markup tags describe the structure of data records and the
type of individual record fields contained in the file.
FileX.HTML FileY.XMLWeb Browser XML Parser
HTML & XML Files
• XML file data can be read in R using the "XML" package. – This package has to be
• Installed from the CRAN repository site of your choice, and• Loaded into the memory code/instructions segment of R
2/10/2021
7
R ‐ XML Files
• Install XML package and load into the CSR all its functions.
> # Install package on the local storage (OSFS) > install.packages("XML")--- Please select a CRAN mirror for use in this session ---trying URL 'https://cloud.r-project.org/bin/windows/contrib/3.5/XML_3.99-0.3.zip'Content type 'application/zip' length 4246397 bytes (4.0 MB)downloaded 4.0 MB
package ‘XML’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\User\AppData\Local\Temp\RtmpKUjscb\downloaded_packages> # Load the package required to read XML files.> library("XML")> # Also load the other required package.> library("methods")>
Edit XML File
• Manually edit the following 8‐records XML data file:
2/10/2021
8
R ‐ XML Files
• Input XML file input.xml into the DSR as “result” data object.
> # Give the input file name to the function.> result <- xmlParse(file = "input.xml")>
R ‐ XML Files
• Print the content of the “result” data object (R‐variable).
> # Give the input file name to the function.> result <- xmlParse(file = "input.xml")> # Print the result.> print(result)<?xml version="1.0"?><RECORDS><EMPLOYEE>
<ID>1</ID><NAME>Rick</NAME><SALARY>623.3</SALARY><STARTDATE>1/1/2012</STARTDATE><DEPT>IT</DEPT><DEPT>IT</DEPT>
</EMPLOYEE><EMPLOYEE>
<ID>2</ID><NAME>Dan</NAME><SALARY>515.2</SALARY><STARTDATE>9/23/2013</STARTDATE><DEPT>Operations</DEPT>
</EMPLOYEE>
2/10/2021
9
R ‐ XML Files<EMPLOYEE>
<ID>3</ID><NAME>Michelle</NAME><SALARY>611</SALARY><STARTDATE>11/15/2014</STARTDATE><DEPT>IT</DEPT>
</EMPLOYEE><EMPLOYEE>
<ID>4</ID><NAME>Ryan</NAME><SALARY>729</SALARY><STARTDATE>5/11/2014</STARTDATE><DEPT>HR</DEPT>
</EMPLOYEE><EMPLOYEE><EMPLOYEE>
<ID>5</ID><NAME>Gary</NAME><SALARY>843.25</SALARY><STARTDATE>3/27/2015</STARTDATE><DEPT>Finance</DEPT>
</EMPLOYEE>
R ‐ XML Files<EMPLOYEE>
<ID>6</ID><NAME>Nina</NAME><SALARY>578</SALARY><STARTDATE>5/21/2013</STARTDATE><DEPT>IT</DEPT>
</EMPLOYEE><EMPLOYEE>
<ID>7</ID><NAME>Simon</NAME><SALARY>632.8</SALARY><STARTDATE>7/30/2013</STARTDATE><DEPT>Operations</DEPT>
</EMPLOYEE><EMPLOYEE><EMPLOYEE>
<ID>8</ID><NAME>Guru</NAME><SALARY>722.5</SALARY><STARTDATE>6/17/2014</STARTDATE><DEPT>Finance</DEPT>
</EMPLOYEE></RECORDS>
2/10/2021
10
Wrangling with the XML File
• Get Number of Nodes Present in XML File
> # Exract the root node form the xml file.> rootnode <- xmlRoot(result)> > # Find number of nodes in the root.> rootsize <- xmlSize(rootnode)> > # Print the result.> print(rootsize)[1] 8>
Wrangling with the XML File
• Get Number of Nodes Present in XML File
> # Print the result.> print(rootnode[1])$EMPLOYEE<EMPLOYEE><ID>1</ID><NAME>Rick</NAME><SALARY>623.3</SALARY><STARTDATE>1/1/2012</STARTDATE><DEPT>IT</DEPT>
</EMPLOYEE>
attr( "class")attr(,"class")[1] "XMLInternalNodeList" "XMLNodeList" >
2/10/2021
11
Wrangling with the XML File
• Get Different Elements of a Node
> # Get the first element of the first node.> print( rootnode[[1]][[1]] )<ID>1</ID> > > # Get the fifth element of the first node.> print( rootnode[[1]][[5]] )<DEPT>IT</DEPT> > > # Get the second element of the third node.> print( rootnode[[3]][[2]] )<NAME>Michelle</NAME> >>
XML to Data Frame
• To handle the data effectively in large files we read the data in the xml file as a data frame. – Then process the data frame for data analysis.
> # Convert the input xml file to a data frame.> xmldataframe <- xmlToDataFrame("input.xml")> print(xmldataframe)ID NAME SALARY STARTDATE DEPT
1 1 Rick 623.3 1/1/2012 IT2 2 Dan 515.2 9/23/2013 Operations3 3 Michelle 611 11/15/2014 IT4 4 Ryan 729 5/11/2014 HR5 5 Gary 843.25 3/27/2015 Financey6 6 Nina 578 5/21/2013 IT7 7 Simon 632.8 7/30/2013 Operations8 8 Guru 722.5 6/17/2014 Finance>
• As data is now available as R dataframe object proper R function can be used to manipulate loaded file content.
2/10/2021
12
Question: File Format
• What does file format name precisely mean?
Answer: File Format
• What does file format name precisely mean?• File format name (e.g., CSV, XML, DOC, etc.) indicates the
internal file data higher level organizational code or structural g glayout of the lower level encoded data elements (e.g., int, float, char, etc.)
2/10/2021
13
Question: File Format
• What is the purpose of the Data Structures course?
Answer: File Format
• What is the purpose of the Data Structures course?• Analysis or study of different reusable (High Level data Codes
or HLdC) organizational codes of basic data elements and ) gdesign of applicable algorithms/operations.
2/10/2021
14
Data Structures
• Elementary/basic/trivial/simple Low‐Level data Codes (LLdC) used to encode elementary/primitive/simple data values:– Numerical Integer (int)– Numerical Floating Point (FP float)– Character (char) . . .
– Example: int binary data code (LLdC)
Binary Encoded Value
Sign bit 0/+ and 1/‐
Data Structures
• Compound/combined/aggregated Mid‐Level Codes (MLdC) are used to organize data as:– Array of single elements – Records/C‐Structures made of fields of the fixed type/code– Unions made of fields of variable type/code in the same field– Arrays of records (Table, Data‐frame)
– Example: Array of records as memory compact MLdC data object.
R d Fi ld[0] Record Fields
Record Fields
Record Fields
r[0]
r[1]
r[n‐1]
. . .
2/10/2021
15
Data Structures• Complex High Level Codes (HLdC) are used to
encode/organize structure sets of LLdC and/or MLdC encoded data elements. – Trees– Trees,– Linked‐Lists,– Queues,– Stacks, ….
– Example: Linked list is not memory compact HLdC data object• NEXT, HEAD & NULL are pointer meta data, data about‐data, h l i d t th t h l l t ti l l t d thelping data, that helps locate particular elementary data sub‐onject.
Complex Structure Data Files
• Data file can contain:– An array of text lines (e.g., ASCII text file)– An array of DataBase (DB) records (RDBMS/SQL‐DBMS data file) – A tree of tagged XML or HTML (Marked Up) elements
• Marked Up files (HTML or XML) are known as DOCUMENTS– A forest or graph of linked data elements– . . .
2/10/2021
16
Other Data File Formats
• .csv and XML are not the only common data file format. Other formats include:– .tsv (tab‐separated values), – pipe‐separated files, – Microsoft Excel workbooks, – JSON data.
• R’s built‐in read.table() command can be made to read most separated value formats.
• Many deeper data formats have corresponding R packages:XLS/XLSXhttp //cran r project org/doc/manuals/R data html#Reading Excel spreadsheets– XLS/XLSXhttp://cran.r‐project.org/doc/manuals/R‐data.html#Reading‐Excel‐spreadsheets
– JSON— http://cran.r‐project.org/web/packages/rjson/index.html
– MongoDB— http://cran.r‐project.org/web/packages/rmongodb/index.html
– SQL — http://cran.r‐project.org/web/packages/DBI/index.html
Topics Covered
• R & XML– Installing & Loading XML package– Loading XML‐file into a R data frame
D t t t & l t t d t fil• Data structures & complex structure data files
2/10/2021
17
Homework
• Repeat all presented user sessions• Repeat/rewrite all questions and answers
The End
Recommended