The World Wide Web and The World Wide Web and What We Would Like to Do with ItWhat We Would Like to Do with It
• XML has a lot of hype surrounding it
• This week we discuss:
– Why XML is needed
– Basic technologies used together with XML
• In the next few weeks: challenges in using
XML
XML in One SlideXML in One Slide
• Basically, XML looks like HTML.
• However, in XML, you can use any tag
names that you want
• Example:<person><name> Lisa Simpson</name><tel> 02-828-1234 </tel><tel> 054-470-777 </tel><email> [email protected] </email>
</person>
Is that all? Big Deal?!
Example 1: A Homepage on the WebExample 1: A Homepage on the Web
Tom Sawyer's Homepage
Tom's Friends
Tom's Hobbies:•Boating on the Mississippi River•Chewing Gum•Painting the Fence
Web Pages are Written in HTMLWeb Pages are Written in HTML
• HTML is a markup language
• An HTML page consists of tags with
attributes and data
• HTML describes the style of the page (e.g.,
color, font type, etc.)
<html> <body>
<h1>Tom Sawyer's Homepage</h1> <img src="tom.jpg">
Hi'ya all. Did you know that my best friend is <b>Huckleberry Finn</b>? Sometimes, I like <b>Becky Thatcher</b>?
<p> <font color = "red">
Here are some of my hobbies:
<ul>
<li> Boating on the Mississippi River
<li> Chewing gum
<li> Painting the fence
</ul>
</font>
If you want to discuss common interests, contact me at
<a href="mailto:[email protected]">[email protected]</a>
</body></html>
Automatically Using InformationAutomatically Using Information
• Tom Sawyer has a homepage. So do a lot of
other people. It would be nice to be able to
do the following things automatically (via a
computer program)
– Querying the Page: Find Tom Sawyer's email
address and the names of his friends
– Querying Similar Pages: Find people who
have interests in common with Tom Sawyer
Automatically Using InformationAutomatically Using Information
• Site Personalization: Tom Sawyer's interests
should be automatically recognized by sites
– When Tom Sawyer enters Amazon, he should get
"book recommendations" that match his interests
– When Tom Sawyer enters a site that sells food, he
should be told about sales on gum
– This should all happen without Tom having to tell
every site about his interests
Can we Automatically use the Can we Automatically use the Information?Information?
• In order to perform the tasks described before, we
have to:
– Find web pages that describe people
– Extract the relevant information
• Problems:
– How can we know if a page describes a person?
– How can we know what to extract? (Everyone has their
own style for their homepage...)
– How can we "understand" the extracted information
(What parts of the page describe which information?)
Example 2: Weather ForecastingExample 2: Weather Forecasting
National Weather Service: Weather Forecasting and
Weather AlertsFlood Alerts in
Mississippi
Wouldn't it be great if…Wouldn't it be great if…
Wouldn't it be great if Tom could get
automatic updates of weather problems in
Mississippi? It is dangerous to go boating if there are
floods…
Wouldn't it be great if…Wouldn't it be great if…
Wouldn't it be great if Tom could get
automatic updates of important news
related to Mississippi?
He might want to choose a different
river to go boating…
Can these things be done?Can these things be done?• Once again, we need to FIND the relevant pages
and EXTRACT the relevant data
• HTML pages are constantly changing
• How can we figure out what data is relevant and
what the data is talking about automatically? (even
when the page changes)
• HTML describes only style and not meaning (or
semantics)
It is difficult (perhaps impossible) to perform these tasks
Two Basic ApproachesTwo Basic Approaches
• If the information on the Web was neatly organized
in a huge database, these problems could be
solved.
But its not – What should we do?
• AI, NLP Approach: Use smart techniques to
recognize information, e.g., recognize patterns
about how things are written
• DB Approach: Turn the Web in to a “database”, by
writing it in XML
The Semantic WebThe Semantic Web
• The Semantic Web is a machine-understandable
Web
• The meaning of data (i.e., the semantics of data)
should be encoded together with the data
• Tim Berners-Lee, the inventor of the Web (by
putting together the ideas of hyper-text, TCP/IP,
DNS) is one of the main people behind the
Semantic Web
Main Technologies NeededMain Technologies Needed
• XML: The syntax for marking up text with meaning
• RDF: Defines objects and relationships between
them
• OWL: Defines ontologies which connect different
concepts (e.g., a car is an automobile, a car is a
type of locamotive)
• Web Services: Allow services given online to be
accessed programmatically
Here is a simplified version of how it could work
<Person>
<name>Thomas Sawyer</name>
<gender>Male</gender>
<mbox resource="mailto:[email protected]"/>
<picture resource="http://www.cs.huji.ac.il/~sarina/tom.jpg"/>
<speaks>English</speaks>
<interest resource="Boating on the Mississippi"/>
<interest resource="Chewing Gum" />
<knows>
<Person>
<name>Huckleberry Finn</name>
<mbox resource="mailto:[email protected]"/>
<Person>
</knows>
</Person>
Simplified version of the FOAF standard
Is there XML on the Web? (1)Is there XML on the Web? (1)
• The weather forecasting site exports its forecasts
as RSS (a standard for marking up news) - this
data can easily be used by a program
Is there XML on the Web? (2)Is there XML on the Web? (2)
• Yahoo News (seen before) exports its news as
RSS - this data can easily be used by a program
The Sky’s The Limit: Doctor’s appointmentThe Sky’s The Limit: Doctor’s appointment“The Semantic Web”, “The Semantic Web”, Scientific American, May 2001Scientific American, May 2001
MomPhysician’s Agent
Lucy’s Agent
requiredtreatment
Schedule appointment
Insurance Co.
Provider sites
Rating
in-plan?close-by?
Specialist?
Pete’s Agent
Driving schedule
Exchanging DataExchanging Data
• Problem: Many data sources, each of a
different type (different vendor), with a different
schema.
– How can the data be combined and used together?
– How can different companies collaborate on their
data?
– What (proprietary?) format should be used to
exchange the data?
Usage Scenario: Company Usage Scenario: Company CollaborationCollaboration
• Several companies want to collaborate
• Need to share data
• Each company has a different type of database
system with a different schema
• Solution: Agree on a XML schema for exchange.
Import to and export from this schema
Web Site DevelopmentWeb Site Development
• Web sites develop over time
• Important to separate style from data in order to allow changes to the site structure and appearance
• CSS separates style from data only in a limited way – HTML will still have tables, lists, etc
• Using XML, we can store data alone
• Using XSL, this data can be translated into HTML
• The data can be translated differently as the site develops
Write Once Use EverywhereWrite Once Use Everywhere
XML Stock Data
XSL
WML(hand-held
devices)
XSL
HTML(web browser
XSL
TEXT(Excel)
HTMLHTML
• Used for publishing hypertext on the World-
Wide Web
• Designed to describe how a Web browser
should arrange text, images and push-
buttons on a page
• Easy to learn, but does not convey structure
• Fixed tag set
HTML ExampleHTML Example
<HTML><HEAD><TITLE>Welcome to the DBI course</TITLE></HEAD><BODY>
<H1>Introduction</H1><IMG SRC= "dragon.gif" WIDTH="200" HEIGHT="150" >
</BODY></HTML>
Opening tag
Closing tag
Text (PCDATA)
“Bachelor” tag
Attribute nameAttribute value
XML Vs. HTMLXML Vs. HTML
• XML and HTML are “brothers”. They are both
special cases of SGML.
• HTML has specific tag and attribute names. These
are associated with a specific meaning
• XML can have any tag and attribute name. These
are not associated with any meaning
• HTML is used to specify visual style
• XML is used to specify meaningHTML XML
SGML
TerminologyTerminology
The segment of an XML document between an
opening and a corresponding closing tag is called
an element
<person> <name> Bart Simpson </name>
<tel> 02 – 444 7777 </tel> <tel> 051 – 011 022 </tel>
<email> [email protected] </email> </person>
element
element, a sub-element of
not an element
XML Document is a TreeXML Document is a Tree
• XML documents are abstractly modeled as trees,
as reflected by their nesting
• Sometimes, XML documents are graphs (by using
IDs and IDREFs)
person
name emailtel tel
Bart Simpson
02 – 444 7777
051 – 011 022
Example XML FragmentExample XML Fragment
<addresses><person>
<name> Donald Duck</name><tel> 04-828-1345 </tel><tel> 04-828-1374 </tel><email> [email protected] </email>
</person><person>
<name> Miki Mouse</name><tel> 03-426-1142 </tel>
</person></addresses>
Another ExampleAnother Example
An element may contain a mixture of sub-
elements and PCDATA
<airline> <name> British Airways </name> <motto> World’s <dubious> favorite</dubious>
airline </motto></airline>
A Complete XML DocumentA Complete XML Document
<?XML version ="1.0" encoding="UTF-8" standalone="no"?><!DOCTYPE addresses SYSTEM "http://www.addbook.com/addresses.dtd"><addresses>
<person><name>Lisa Simpson</name><tel> 02-828-1234 </tel><tel> 054-470-777 </tel><email> [email protected] </email>
</person></addresses>
Required
Optional
AttributesAttributes
• An opening tag may contain attributes
• These are typically used to describe the
contents of an element
<entry> <word language = “en”> cheese</word> <word language = “fr”> fromage</word> <word language = “ro”> branza </word> <meaning> A food made … </meaning></entry>
When to Use AttributesWhen to Use Attributes
• It’s not always clear when to use attributes
<person ssno= “123 4589”> <name> L. Simpson
</name> <email> [email protected] </email> ...</person>
<person> <ssno> 123 4589 </ssno> <name> L. Simpson </name> <email> [email protected] </email> ...</person>
When to Use AttributesWhen to Use Attributes
• It’s not always clear when to use attributes
<person ssno= “123 4589”> <name> L. Simpson
</name> <email> [email protected] </email> ...</person>
<person> <ssno> 123 4589 </ssno> <name> L. Simpson </name> <email> [email protected] </email> ...</person>
General Rule:
Use an element if you need to nest dataUse an attribute for “IDs”, i.e., identifying data
Rules for XML (1)Rules for XML (1)
• XML is order sensitive, i.e. the following are
different:
• XML is case-sensitive, i.e., the following are
different: <person>, <Person>, <PERSON>
<entry> <word language = “en”> cheese</word> <word language = “fr”> fromage</word></entry><entry> <word language = “fr”> fromage</word> <word language = “en”> cheese</word></entry>
Rules for XML (2) Rules for XML (2)
• Tags come in pairs <date> ...</date>
• They must be properly nested. Which of the following are good?– <date> ... <day> ... </day> ... </date>
– <date> ... <day> ... </date>... </day>
– <date> ... <day> ... </day> </Date>
• There is a special shortcut for tags that have no text in between them (bachelor tags)– <person fname=“Sam” lname=“Iam” />
– <person fname=“Sam” lname=“Iam” ></person>
Rules for XML (3)Rules for XML (3)
• There should be exactly one top-level element.
This element is also called the root element
• Which of the following is legal?
<?xml version=“1.0”?><Question> Is this legal? </Question>
<?xml version=“1.0”?><Question> Is this legal? </Question><Answer> You tell me. </Answer>
Well Formed DocumentsWell Formed Documents
• A document is well-formed if it
– obeys all the above rules, and in addition
– does not repeat an attribute within a tag, i.e., the
following is illegal:
<a val=’12’ val=’13’> … </a>