12
From Semistructured Data to XML: Migrating The Lore Data Model and Query Language Roy Goldman, Jason McHugh, Jennifer Widom Stanford University http://www-db.stanford.edu/lore/

From Semistructured Data to XML: Migrating The Lore Data Model and Query Language Roy Goldman, Jason McHugh, Jennifer Widom Stanford University

  • View
    219

  • Download
    1

Embed Size (px)

Citation preview

From Semistructured Data to XML:Migrating The Lore Data Model and

Query Language

Roy Goldman, Jason McHugh, Jennifer WidomStanford University

http://www-db.stanford.edu/lore/

Introduction

• Lore– Originally a DBMS designed specifically for

semistructured data– Semistructured data models and XML share

many similarities– Migrating Lore to work with XML

• Modifications to data model

• Changes to query language

• Changes to DataGuides

OEM (Object Exchange Model)

• Lore’s original data model

• All entities are atomic or complex objects

• Each object has a unique object identifier (oid)

• Atomic objects contain a value from one of the atomic types (integer, real, string, etc…)

• Complex objects are sets of <label, subobject> pairs

• Can be thought of as a labeled directed graph– objects are nodes

– complex objects have labeled outgoing edges

– atomic objects contain their value

Differences between XML and OEM• XML has attributes

• XML is ordered, OEM is not

• XML does not directly support graph structure– Uses special attribute types to encode graph structure

– Example:<Person Id = ‘P1’ Name = ‘Jeff Ullman’ Colleague = ‘P2’/>

<Person Id = ‘P2’ Name = ‘Jennifer Widom’ Colleague = ‘P1’/>

<Publication Title = ‘A First Course In Database Systems’ Author = ‘P1 P2’/>

Attribute Id is of type ID, Colleague is of type IDREF, and Author is of type IDREFS

Colleague

Colleague

AuthorAuthor

Jennifer Widom Jeff Ullman

Literal vs. Semantic Data Model

• Should an XML data model be a literal tree corresponding to XML’s text representation? (where IDREF(S) are nothing but string attributes)

• Or should it be a graph that includes all the intended links? (preserving the semantic graph structure)

• It should be... BOTH!– Both literal and semantic modes should be supported

– The user or application can select between the two

Lore’s XML Data Model• An XML element is a pair <eid, value>

• eid is a unique element identifier

• value is either an atomic text string or a complex value containing the following four components:– A string-valued tag corresponding to the XML tag for that element

– An ordered list of attribute-name/atomic-value pairs (attribute-name is a string, atomic-value has an atomic type)

– An ordered list of crosslink subelements of the form <label, eid> where label is a string. Crosslink subelements are introduced via an attribute of type IDREF(S)

– An ordered list of normal subelements of the form <label, eid> where label is a string. Normal subelements are introduced via lexical nesting within an XML document

XML Document/Graph Example

• eids appear within nodes (&1, &2, etc…)• Attributes appear within brackets next to the nodes

• Two types of edges:• Normal subelement edges labeled with destination subelement’s tag (solid line)• Crosslink edges labeled with the attribute name that introduced the link (dashed line)

• Semantic vs. Literal:• In semantic mode, omit attributes of type IDREF(S)• In literal mode, omit crosslink edges

Migrating Lorel (Lore’s query language)

• Distinguishing between attributes and subelements– Lorel uses path expressions

• A sequence of labels such as DBGroup.Member.Project.Title

• Can also contain wildcards and regular expressions

– Path expression qualifiers differentiate between attributes and subelements

• Placing a ‘>‘ before a label matches subelements only

• Placing a ‘@’ before a label matches attributes only

• Absence of qualifier means match both

– Examples:• DBGroup.Member.>Name will match name elements that are

subelements of DBGroup.Member elements

• DBGroup.Member.@Name will match name attributes of DBGroup.Member elements

• DBGroup.Member.Name will match both

Migrating Lorel (continued...)

• Comparisons– How do we compare two different things? (for example,

comparing constants with attribute values)• All XML components are treated as atomic values...

• Functions that transform elements into strings:– Flatten(e) : Ignoring all tags, recursively serialize all text values in the

subtree rooted at element e

– Concatenate(e) : Concatenates all immediate text children of element e (subelements are ignored)

– Tag(e) : Returns the XML tag of element e

– Eid(e) : Returns the eid of element e as a string

– XML(e) : Tranforms the graph, starting with element e, into an XML document

• Default Semantics (when no functions are specified):– atomic (Text) element : the text itself

– elements with no attributes and only one or more Text elements as children : concatenation of the children’s text values

– all others : the element’s eid represented as a string

Migrating Lorel (continued...)

• Range qualifiers– The expression [range] can be optionally applied to any path

expression component or variable• Example: select y from DBGroup.Member x, x.Office[1-2] y

– returns the first two Office subelements of every group member

• Example: select y[1-2] from DBGroup.Member x, x.Office y– returns the first two Office subelements over ALL members

• Order-by clause– Query results are ordered lists of eids that identify the elements

selected by the query (attributes are coerced into elements)

– order-by-document-order orders results based on original XML document

• Newly constructed elements are placed at the end of the document order with no specified order among them

Migrating Lorel (continued...)

• Transformations and structured results– Using queries to restructure XML data

• The with clause (added to the standard select-from-where construct)– Query result will replicate all data selected by the select clause, along

with all data reachable via a set of path expressions in the with clause

• Skolem functions– Allows more expressive data restructuring

– Accepts a list of variables as arguments and produces one unique element for every binding of elements and/or attributes to the arguments

• Updates– Lorel supports an expressive update language

– Changes for XML model:• ability to create both attributes and elements

• order-relevant updates

Migrating Lorel (continued...)

• DataGuides– Can be used when a DTD is not supplied

– A notion of order must be introduced• Problem - could result in very large DataGuides

– When DTD’s exist, DataGuides are built from those DTD’s

– Combining DTD’s and DataGuides• DTD’s available for specific portions of an XML database

• DataGuides can be used over portions not specified by DTD’s

Conclusion

• As of June 1999, the migration of Lore to an XML model is nearly complete