57
Introduction to Introduction to XML XML and and RSS RSS Data Management Issues Data Management Issues

Introduction to XML and RSS Data Management Issues

Embed Size (px)

Citation preview

Introduction to Introduction to XMLXML

and and RSSRSS

Data Management IssuesData Management Issues

Types of dataTypes of data

StructuredStructured Semi-structuredSemi-structured

Structured DataStructured Data

data is organized in data is organized in entities ( entities ( tablestables))

entities have entities have attributesattributes

Current Database Current Database WorldWorld

– StructureStructure Relational Database Management System Relational Database Management System

(DBMS):(DBMS): everything is a tableeverything is a table

– Query languages: SQLQuery languages: SQL

– Software: MS Access, Oracle….Software: MS Access, Oracle….

Example of a table (patients)Example of a table (patients)

Example ofExample ofa group of a group of tablestables

MS Access Table LinksMS Access Table Links

World of Web DataWorld of Web Data

– Easy document exchangeEasy document exchange

– Unstructured (or poorly structured) Unstructured (or poorly structured) datadata Everything is a documentEverything is a document

– No standard for query languagesNo standard for query languages

World of Web DataWorld of Web Data

ExampleExample– An organization An organization AA publishes financial publishes financial

data on its web pages (HTML), data on its web pages (HTML), generated from DBMS.generated from DBMS.

– A second organization A second organization BB wants some wants some financial analyses; can access only financial analyses; can access only web data.web data.

RDBMS

A BHTML

Semi-structured DataSemi-structured Data

data can be of any type data can be of any type not necessarily following any format not necessarily following any format does not follow any rules does not follow any rules is not predictable is not predictable examples include examples include

– text text – video video – sound sound – images images

Characteristics of Semi-Characteristics of Semi-Structured DataStructured Data

structure is structure is irregularirregular: missing or : missing or additional attributes additional attributes

parts of data parts of data lacklack structure, e.g., structure, e.g., images images

some may yield some may yield littlelittle structure, structure, e.g., plain text e.g., plain text

Semi-structured Data Semi-structured Data DefinitionDefinition

Data that is inherently Data that is inherently self-self-describingdescribing and does not conform to and does not conform to an explicit and fixed schema is an explicit and fixed schema is known as known as Semistructured DataSemistructured Data

Data Structure is contained within Data Structure is contained within data itselfdata itself

Example of Semi-Structured Example of Semi-Structured DataData

name: name: Peter WoodPeter Wood email: email: [email protected], [email protected],

[email protected]@bbk.ac.uk ------------------------------------------------------------------------------------------------------------------------------------ name:name:

• first name: first name: MarkMark • last name: last name: LeveneLevene

email: email: [email protected]@dcs.bbk.ac.uk ------------------------------------------------------------------------------------------------------------------------------------ name: name: Alex SmithAlex Smith affiliation: affiliation: StFXStFX

IMDB – A Motivating IMDB – A Motivating ExampleExample

The The Internet Movie DatabaseInternet Movie Database is a is a classical example of a collection classical example of a collection of semi-structured dataof semi-structured data

Although the information Although the information pertaining to different movies pertaining to different movies may be essentially similar, their may be essentially similar, their structure may be different!structure may be different!

Let us consider an example movie Let us consider an example movie databasedatabase

An Example Movie An Example Movie DatabaseDatabase

IMDB-Irregularity In IMDB-Irregularity In StructureStructure

• Different layout for movies and TV seriesDifferent layout for movies and TV series• Movie entries show Movie entries show Director, Writers Director, Writers andand

StarsStars• TV entries show just TV entries show just StarsStars

Captain Phillips (Movie)

Lost (TV Series)Lost (TV Series)

Traditional Data Traditional Data ManagementManagement

Universe of Discourse

Model of the UoD

DatabaseQuery

Post-Internet Data Post-Internet Data ManagementManagement

Universe of Discourse

Retrieval?

DataQuery

XML – An Embodiment XML – An Embodiment of Semi-structured of Semi-structured DataData XML can be used to represent XML can be used to represent

semistructured datasemistructured data

What is XML? What is XML?

XML stands for EXML stands for EXXtensible tensible MMarkup arkup LLanguage anguage

XML is a XML is a markup languagemarkup language much much like HTML (tags)like HTML (tags)

XML was designed to XML was designed to describe describe datadata

XML tags are XML tags are not predefinednot predefined. . You must You must define your own tagsdefine your own tags

The main difference The main difference between XML and HTML between XML and HTML

XML and HTML were designed with XML and HTML were designed with different goalsdifferent goals::

XMLXML was designed to was designed to describe datadescribe data and and to focus on what data is.to focus on what data is.

HTMLHTML was designed to was designed to display datadisplay data and and to focus on how data looks.to focus on how data looks.

It is important to understand that It is important to understand that XML is XML is not a replacement for HTMLnot a replacement for HTML..

XML does not DO XML does not DO anythinganything Maybe it is a little hard to understand, but XML DOES NOT DO Maybe it is a little hard to understand, but XML DOES NOT DO

ANYTHING. XML is created to structure, store and to send ANYTHING. XML is created to structure, store and to send information.information.

The note has a header and a message body. It also has sender and The note has a header and a message body. It also has sender and receiver information. But still, this XML document does not DO receiver information. But still, this XML document does not DO anything. It is just pure information wrapped in XML tags. Someone anything. It is just pure information wrapped in XML tags. Someone must write a piece of software to send, receive or display it.must write a piece of software to send, receive or display it.

<note>

<to>John</to>

<from>Mary</from>

<heading>Reminder</heading>

<body>Don't forget me this weekend!</body>

</note>

XML is free and XML is free and extensibleextensible XML tags are not predefined. You must XML tags are not predefined. You must

""inventinvent" your own tags." your own tags. The tags used to mark up The tags used to mark up HTMLHTML documents documents

and the structure of HTML documents are and the structure of HTML documents are predefinedpredefined. (like <b>, <i>, <h1>, etc.).. (like <b>, <i>, <h1>, etc.).

XML allows authors to define their own tags XML allows authors to define their own tags and their own document structure.and their own document structure.

The tags in the example above (like The tags in the example above (like <to><to> and and <from>)<from>) are not defined in any XML are not defined in any XML standard. These tags are "invented" by the standard. These tags are "invented" by the author of the XML document.author of the XML document.

XML is used to Exchange XML is used to Exchange DataData

With XML, data can be exchanged between With XML, data can be exchanged between incompatible systems.incompatible systems.

In the real world, computer systems and In the real world, computer systems and databases contain data in databases contain data in incompatible incompatible formatsformats. One of the most time-consuming . One of the most time-consuming challenges for developers has been to challenges for developers has been to exchange data between such systems over exchange data between such systems over the Internet.the Internet.

Since XML data is stored in Since XML data is stored in plain text formatplain text format, , XML provides a XML provides a software- and hardware-software- and hardware-independent independent way of sharing data.way of sharing data.

XML can be used to Create XML can be used to Create new Languagesnew Languages

XML is the mother of XML is the mother of WAPWAP( ( Wireless Wireless Application ProtocolApplication Protocol)) and and WMLWML ( (The The Wireless Markup Language)Wireless Markup Language)..

WML used to markup Internet applications WML used to markup Internet applications for for handheld deviceshandheld devices like like mobile phonesmobile phones..

MathMLMathML, for creating Math formula and , for creating Math formula and CMLCML (Chemical Markup language), (Chemical Markup language), comicMLcomicML ( for ( for describing comic characters) and describing comic characters) and musicXMLmusicXML (for musical notes) is written in XML.(for musical notes) is written in XML.

XML and Microsoft XML and Microsoft OfficeOffice

Starting with Office 2007, Microsoft changed Starting with Office 2007, Microsoft changed the format of all Office documents.the format of all Office documents.

They are all saved in XML format.They are all saved in XML format. So a Word file is a ZIP folder holding a So a Word file is a ZIP folder holding a

number of files including the text in XML number of files including the text in XML format.format.

Advantages:Advantages:– Small file sizeSmall file size– Compatibility with other softwareCompatibility with other software– Older Word files have the extension Older Word files have the extension DOCDOC, ,

new ones use new ones use DOCXDOCX

XML Syntax XML Syntax

The syntax rules of XML The syntax rules of XML are very are very simplesimple and and very strictvery strict. The rules . The rules are very easy to learn, and very are very easy to learn, and very easy to use.easy to use.

Because of this, creating software Because of this, creating software that can read and manipulate that can read and manipulate XML is very easy to do.XML is very easy to do.

All XML elements must have All XML elements must have a closing taga closing tag

Elements or tags Elements or tags are basic blocks of any are basic blocks of any XML documentXML document

With XML, it is illegal to omit the closing tag.With XML, it is illegal to omit the closing tag.

In HTML some elements do not have to have In HTML some elements do not have to have a closing tag. The following code is legal in a closing tag. The following code is legal in HTMLHTML::

<p>This is a paragraph<p>This is a paragraph In In XMLXML all elements all elements mustmust have a closing have a closing

tag, like this:tag, like this:

<par>This is a paragraph</par><par>This is a paragraph</par>

XML tags are case XML tags are case sensitivesensitive Unlike HTML, XML tags are Unlike HTML, XML tags are

case sensitive.case sensitive. With XML, the tag With XML, the tag <Letter> <Letter> is is

different from the tag different from the tag <letter><letter>.. Opening and closing tags must Opening and closing tags must

therefore be written with the therefore be written with the same case:same case:<Message>This is incorrect</message> <Message>This is incorrect</message> <message>This is correct</message><message>This is correct</message>

All XML elements must be All XML elements must be properly nestedproperly nested

Improper nesting of tags makes no sense to Improper nesting of tags makes no sense to XML.XML.

In HTML some elements can be improperly nested In HTML some elements can be improperly nested within each other like this:within each other like this:

<b><i>This text is bold and italic</b></i><b><i>This text is bold and italic</b></i> In XML all elements must be properly nested within In XML all elements must be properly nested within

each other like this:each other like this:<bold><italic><bold><italic>

This text is bold and italicThis text is bold and italic

</italic></bold></italic></bold>

All XML documents must All XML documents must have a root element (tag)have a root element (tag)

All XML documents must contain a single All XML documents must contain a single tag pair to define a root element.tag pair to define a root element.

All other elements must be within this root All other elements must be within this root element.element.

All elements can have sub elements (child All elements can have sub elements (child elements). Sub elements must be correctly elements). Sub elements must be correctly nested within their parent element:nested within their parent element:<root><root>

<child><child> <subchild>.....</subchild><subchild>.....</subchild>

</child> </child> </root> </root>

With XML, white space is With XML, white space is

preservedpreserved With XML, white space is preservedWith XML, white space is preserved With XML, the white space in your With XML, the white space in your

document is not truncateddocument is not truncated.. This is unlike HTML. With HTML, a This is unlike HTML. With HTML, a

sentence like this:sentence like this:

Hello              my name is JohnHello              my name is John,,

will be displayed like this:will be displayed like this:

Hello my name is JohnHello my name is John,,

because HTML strips off the white space.because HTML strips off the white space.

Element NamingElement Naming

XML elements must follow these naming XML elements must follow these naming rules:rules:

Names can contain Names can contain letters, numbers, and letters, numbers, and other characters other characters

Names must Names must not start with a number or not start with a number or punctuation character punctuation character

Names must Names must not start with the letters xml not start with the letters xml (or (or XML or Xml ..) XML or Xml ..)

Names cannot contain spaces Names cannot contain spaces

Element NamingElement Naming

Any name can be used, no words are Any name can be used, no words are reserved, but the idea is to make reserved, but the idea is to make names descriptivenames descriptive

XML documents often have a XML documents often have a

corresponding databasecorresponding database, in which fields , in which fields exist corresponding to elements in the exist corresponding to elements in the XML document. A good practice is to XML document. A good practice is to use the naming rules of your database use the naming rules of your database for the elements in the XML documents.for the elements in the XML documents.

Comments in XMLComments in XML

The syntax for writing comments The syntax for writing comments in XML is similar to that of HTML.in XML is similar to that of HTML.

<!-- This is a comment --<!-- This is a comment -->>

XML Attributes XML Attributes

XML elements can have attributes in XML elements can have attributes in the start tag, just like HTML.the start tag, just like HTML.Attributes are used to provide Attributes are used to provide additional information about elements.additional information about elements.In HTML (and also in XML) attributes In HTML (and also in XML) attributes provide additional information about provide additional information about elements:elements:

<img src="computer.gif"><img src="computer.gif"><a href="demo.asp"><a href="demo.asp">

XML AttributesXML Attributes

Attribute values must always be Attribute values must always be enclosed in quotesenclosed in quotes

<person sex="female"><person sex="female">

XML Attributes Cont.XML Attributes Cont.

<?xml version="1.0" encoding="UTF-8"?><?xml version="1.0" encoding="UTF-8"?><note date=12/11/2002><note date=12/11/2002> <to>John</to><to>John</to> <from>Mary</from><from>Mary</from> </note></note>--------------------------------------------------------------------------------------------------------------------------------------------<?xml version="1.0" encoding="UTF-8"?><?xml version="1.0" encoding="UTF-8"?><note date="12/11/2002"> <note date="12/11/2002"> <to>John</to> <to>John</to> <from>Mary</from><from>Mary</from> </note></note>

The error in the first document is that the date attribute in The error in the first document is that the date attribute in the note element is not quoted. the note element is not quoted.

The first line in the document is the XML declarationThe first line in the document is the XML declaration

Use of Elements vs. Use of Elements vs. Attributes Attributes Data can be stored in child elements or in attributes.Data can be stored in child elements or in attributes.

Take a look at these examples:Take a look at these examples:

<person <person sex="female"sex="female">><firstname>Anna</firstname><firstname>Anna</firstname><lastname>Smith</lastname> <lastname>Smith</lastname>

</person></person>----------------------------------------------------------------------------------------------------<person><person> <sex>female</sex> <sex>female</sex>

<firstname>Anna</firstname> <firstname>Anna</firstname> <lastname>Smith</lastname> <lastname>Smith</lastname>

</person></person>

In the first example sex is an attribute. In the last, sex is a child In the first example sex is an attribute. In the last, sex is a child element. Both examples provide the same information.element. Both examples provide the same information.

Errors in XML will stop the XML Errors in XML will stop the XML programprogram

The World Wide Web Consortium (W3C) XML The World Wide Web Consortium (W3C) XML specification states that a program should not specification states that a program should not continue to process an XML document if it finds a continue to process an XML document if it finds a validation error. The reason is that XML software validation error. The reason is that XML software should be easy to write, and that all XML documents should be easy to write, and that all XML documents should be compatible.should be compatible.

With HTML it was possible to create documents with With HTML it was possible to create documents with

lots of errors (like when you forget an end tag). One of lots of errors (like when you forget an end tag). One of the main reasons that HTML browsers are so big and the main reasons that HTML browsers are so big and incompatible, is that they have their own ways to incompatible, is that they have their own ways to figure out what a document should look like when figure out what a document should look like when they encounter an HTML error.they encounter an HTML error.

With XML this should not be possible.With XML this should not be possible.

XML and Web XML and Web BrowsersBrowsers

Internet Explorer Internet Explorer 5.0+, 5.0+, Google Google Chrome Chrome & & FirefoxFirefox support XMLsupport XML

Viewing XML Files Viewing XML Files

If you open an XML document in IE ( or If you open an XML document in IE ( or other browsers), it will display the other browsers), it will display the document with document with color color codedcoded root and root and child elements. A plus (child elements. A plus (++) or minus sign ) or minus sign ((--) to the left of the elements can be ) to the left of the elements can be clicked to expand or collapse the clicked to expand or collapse the element structure.element structure.

   If you want to view the raw XML source, If you want to view the raw XML source,

you must select "View Source" from the you must select "View Source" from the browser menu. browser menu.

If an erroneous XML file is opened, the If an erroneous XML file is opened, the browser will report the error.browser will report the error.

Other Examples Other Examples

Viewing some XML documents will Viewing some XML documents will help you get the XML feeling.help you get the XML feeling.

An XML CD catalogAn XML CD catalogThis is some CD collection, stored as XML dataThis is some CD collection, stored as XML data

An XML plant catalogAn XML plant catalogThis is a plant catalog from a plant shop, This is a plant catalog from a plant shop, stored as XML data.stored as XML data.

A Simple Food MenuA Simple Food MenuThis is a breakfast food menu from a This is a breakfast food menu from a restaurant, stored as XML data.restaurant, stored as XML data.

Why does XML display like Why does XML display like this?this?

XML documents do not carry XML documents do not carry information about how to display the information about how to display the data.data.

Since XML tags are "invented" by the author Since XML tags are "invented" by the author of the XML document, browsers do not know of the XML document, browsers do not know if a tag like <table> describes an HTML if a tag like <table> describes an HTML tabletable or a or a dining tabledining table..

Without any information about how to Without any information about how to display the data, most browsers will just display the data, most browsers will just display the XML document as it is.display the XML document as it is.

The XML Rules The XML Rules (Summary)(Summary)

1.1. Single, unique root Single, unique root elementelement

2.2. Matching open/close Matching open/close tagstags

3.3. Consistent Consistent capitalisationcapitalisation

4.4. Correctly nested Correctly nested elements elements

5.5. Attribute values Attribute values enclosed in quotesenclosed in quotes

<?xml version=“1.0”?>

<company id=“4859”>

<name>3Months.com</name>

<type>Web Development</type>

<address>

<street>Wakefield st</street>

<city>Wellington</city>

<country>New Zealand</country>

</address>

</company>

Authoring XML Authoring XML DocumentsDocuments

A basic XML document is an XML A basic XML document is an XML element that can, but might not, include element that can, but might not, include nested XML elements.nested XML elements.

Example:Example: <<booksbooks>> <<book ISBN=“123456789”book ISBN=“123456789”>> <<titletitle> Second Chance <> Second Chance </title/title>> <<authorauthor> Matthew Dunn > Matthew Dunn

<</author/author>> <</book/book>> <</books/books>>

Use of XML and HTML Use of XML and HTML togethertogether

This is pure data in XML fileThis is pure data in XML file This is a pure Format file to display the This is a pure Format file to display the

same datasame data

View the result with Google Chrome or IE View the result with Google Chrome or IE 6+ 6+

Converting Relational Database to Converting Relational Database to XMLXML

ExampleExample:: Export the following data into XML and Export the following data into XML and group books by storegroup books by store

Relational Database:Relational Database:

Store (Store (sidsid, name, phone), name, phone)

Book (Book (bidbid, title, authors), title, authors)

StoreBook (StoreBook (sid sid , , bidbid, price, stock), price, stock)

Store BookStoreBook

phone

authors

bidtitlesid

name

price stock

Converting Relational Converting Relational Database to XML (Cont’d)Database to XML (Cont’d)

XML:XML:<<storestore> >

<<sidsid> 123 </> 123 </sidsid>><<namename> Chapter <> Chapter </name/name>><<phonephone> 429-8976<> 429-8976</phone/phone>><<bookbook> >

<<titletitle> The Da Vinci Code<> The Da Vinci Code</title/title> > <<authorsauthors> Dan Brown<> Dan Brown</authors/authors>><<bidbid> 987<> 987</bid/bid>>

<</book/book>><<bookbook>…<>…</book/book> > … …

<</store/store>>

ExamplesExamples

example of databaseexample of database

Example of database converted Example of database converted to XMLto XML

XML representation of a XML representation of a sample Movie Databasesample Movie Database

<?xml version="1.0" encoding="ISO-8859-1“ standalone=“yes”?><?xml version="1.0" encoding="ISO-8859-1“ standalone=“yes”?> <IMDb><IMDb>

<Movies> <Movies> <Movie> <Movie>

<Title> The Notebook</Title><Title> The Notebook</Title><Actor> Ryan Gosling</Actor><Actor> Ryan Gosling</Actor><Actor> Rachel McAdams</Actor><Actor> Rachel McAdams</Actor><Director> Nick Cassavetes</Director><Director> Nick Cassavetes</Director>

</Movie></Movie><Movie> <Movie>

<Title> 300 </Title><Title> 300 </Title><Actor> Gerard Butler</Actor><Actor> Gerard Butler</Actor><Actor> Lena Headey </Actor><Actor> Lena Headey </Actor><Director> Zack Snyder</Director><Director> Zack Snyder</Director>

</Movie></Movie>

</Movies></Movies><TVShow> FRIENDS </TVShow><TVShow> FRIENDS </TVShow><TVShow> Seinfeld </TVShow><TVShow> Seinfeld </TVShow>

</IMDb></IMDb>

Brief Introduction to RSSBrief Introduction to RSS

RSS ( RSS ( RReally eally SSimple imple SSyndication)yndication)

RSSRSS is a family of web feed formats used to publish is a family of web feed formats used to publish frequently updated digital content, such as frequently updated digital content, such as blogsblogs, , newsnews feeds or feeds or podcastspodcasts..

Users of RSS content use programs called feed Users of RSS content use programs called feed "readers" or "aggregators": the user "subscribes" to a "readers" or "aggregators": the user "subscribes" to a feed by supplying to their reader a link to the feed; feed by supplying to their reader a link to the feed; the reader can then check the user's subscribed feeds the reader can then check the user's subscribed feeds to see if any of those feeds have new content since to see if any of those feeds have new content since the last time it checked, and if so, retrieve that the last time it checked, and if so, retrieve that content and present it to the user.content and present it to the user.

RSS formats are specified in RSS formats are specified in XMLXML (a generic (a generic specification for data formats). RSS delivers its specification for data formats). RSS delivers its information as an XML file called an "RSS feed," information as an XML file called an "RSS feed," "webfeed," "RSS stream," or "RSS channel"."webfeed," "RSS stream," or "RSS channel".

RSS Feed representationRSS Feed representation

On Web pages, web feeds (RSS) are On Web pages, web feeds (RSS) are typically linked with the word typically linked with the word "Subscribe", an orange square, "Subscribe", an orange square,

or a rectangle with the letters or a rectangle with the letters Or Or Many news aggregators such as Many news aggregators such as

msnbc.commsnbc.com publish subscription buttons publish subscription buttons for use on Web pages to simplify the for use on Web pages to simplify the process of adding news feeds.process of adding news feeds.

PodcastingPodcasting

A A podcastpodcast is a media file that is distributed is a media file that is distributed over the Internet using syndication feeds, for over the Internet using syndication feeds, for playback on playback on portable media players and portable media players and personal computerspersonal computers..

The term "podcast" is derived from Apple's The term "podcast" is derived from Apple's portable music player, the iPod. portable music player, the iPod.

Though podcasters' web sites may also offer Though podcasters' web sites may also offer direct download or streaming of their direct download or streaming of their content, a podcast is distinguished from content, a podcast is distinguished from other digital audio formats by its ability to be other digital audio formats by its ability to be downloaded automatically, using software downloaded automatically, using software capable of reading feed formats such as RSS.capable of reading feed formats such as RSS.

PodcastingPodcasting

Podcasting is an automatic mechanism Podcasting is an automatic mechanism whereby multimedia computer files whereby multimedia computer files are transferred are transferred from a server to a from a server to a clientclient, which pulls down , which pulls down XML filesXML files containing the Internet addresses of containing the Internet addresses of the media files. In general, these files the media files. In general, these files contain audio or video, but also could contain audio or video, but also could be images, text, PDF, or any file type.be images, text, PDF, or any file type.

Example: StFX PosdcastExample: StFX Posdcast

XML JokeXML Joke

Question: When should I use Question: When should I use XML?XML?

Answer: When you need a Answer: When you need a buzzword in your resume. buzzword in your resume.