49
Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ University of Wisconsin-Madison/ IBM Almaden Research Center IBM Almaden Research Center Joint work with: Rimon Barr Michael Carey Bruce Lindsay Hamid Pirahesh Berthold Reinwald Eugene Shekita

Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

Efficiently Publishing Relational Data as XML Documents

Jayavel Shanmugasundaram

University of Wisconsin-Madison/University of Wisconsin-Madison/IBM Almaden Research CenterIBM Almaden Research Center

Joint work with: Rimon BarrMichael CareyBruce LindsayHamid PiraheshBerthold ReinwaldEugene Shekita

Page 2: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

Outline

• Why?

• How?

• Which?

• Hence

Page 3: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

XML Example<department name=“Purchasing”>

<emplist>

<employee> John </employee>

<employee> Mary </employee>

</emplist>

<projlist>

<project> Internet </project>

<project> Recycling </project>

</projlist>

</department>

Page 4: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

What is the big deal about XML?

• Elegantly models complex, hierarchical/ graph-structured data

• Domain-specific tags (unlike HTML)

• Simple!

Fast emerging as dominant standard for data exchange on the WWW

Page 5: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

Why Relational Data?

• Most business data stored in relational databases

• Unlikely to change in the near future– Scalability, Reliability, Performance, Tools

Need efficient means to publish relational data as XML documents

Page 6: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

Usage Scenario

Existing Database System

(RDBMS)

Application/User Query to produce XML Documents

XML Result (processed or

displayed in browser)

The Internet

Page 7: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

Example Relational Schema

Department

DeptId DeptName

10 PurchasingProject

ProjId DeptId ProjName

888 10 Internet

795 10 Recycling

EmployeeEmpId DeptId EmpName

101 10 John

91 10 Mary

Salary

50K

70K

Page 8: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

XML Representation<department name=“Purchasing”> <emplist> <employee> John </employee> <employee> Mary </employee> </emplist> <projlist> <project> Internet </project> <project> Recycling </project> </projlist></department>

Page 9: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

Main Issues

• Relational data is flat, XML is a tagged graph

• How do we specify translation from flat model to a graph model?– A query language to map from relations to XML

• How do we transform flat representations to tagged nested representations?– Efficient implementation strategies

Page 10: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

Outline

• Why?

• How?– Language?– Mechanism?

• Which?

• Hence

Page 11: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

Transformation Languages• Two obvious choices:

– XML Query Language– SQL

Page 12: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

Example Relational Schema

Department

DeptId DeptName

10 PurchasingProject

ProjId DeptId ProjName

888 10 Internet

795 10 Recycling

EmployeeEmpId DeptId EmpName

101 10 John

91 10 Mary

Salary

50K

70K

Page 13: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

XMLQL: Default XML View

<defaultview>

<department>

<row> <deptid>10</> <deptname>Purchasing</> </row>

</department>

<employee>

<row> <empid>101</> <deptid>10</> <empname>John</> <salary>50K</> </row>

<row> <empid>91</> <deptid>10</> <empname>Mary</> <salary>70K</> </row>

</employee>

<project>

<row> <projid>888</> <deptid>10</> <projname>Internet</> </row>

<row> <projid>795</> <deptid>10</> <projname>Recycling</> </row>

</project>

</defaultview>

Page 14: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

XMLQL: Query Over Default ViewWHERE <defaultview.department.row>

<deptid> $did </> <deptname> $dname </>

</> IN DefaultView

CONSTRUCT <department name=$dname>

<emplist>

</emplist>

<projlist>

</projlist> </>

{ WHERE <defaultview.employee.row>

<deptid> $did </> <empname> $ename </> </> IN DefaultView CONSTRUCT <employee> $ename </> }

{ WHERE <defaultview.project.row>

<deptid> $did </> <projname> $pname </> </> IN DefaultView CONSTRUCT <project> $pname </> }

Page 15: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

XMLQL: Query Result<department name=“Purchasing”> <emplist> <employee> John </employee> <employee> Mary </employee> </emplist> <projlist> <project> Internet </project> <project> Recycling </project> </projlist></department>

Page 16: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

XMLQL: Pros and Cons

• Pros:– Natural for XML users– Infrastructure to build hierarchies of XML views– One query language for XML and relational data

• Cons:– Ignores existing API (JDBC), tools, support– Need to mature new query language (aggregates etc.)

Page 17: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

SQL: Key Ideas

• Sub-queries to specify nesting

• Scalar functions to specify tags/attributes– XML Constructors

• Aggregate functions to group child elements

Page 18: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

SQL: Query to publish XML

Select DEPT(d.name,

<subquery to produce emplist>,

<subquery to produce projlist>

)From Department d

Page 19: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

SQL: XML Constructor

Define XML Constructor DEPT(dname: varchar(20), emplist: xml, projlist: xml) As ( <department name=$dname> <emplist> $emplist </emplist> <projlist> $projlist </projlist></department>

)

Page 20: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

SQL: Query to publish XML

Select DEPT(d.name,

<subquery to produce emplist>,

<subquery to produce projlist>

)From Department d

Page 21: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

SQL: Query to publish XML

Select DEPT(d.name, (Select XMLAGG(EMP(e.name)) From Employee e Where e.deptno = d.deptno), <subquery to produce projlist>

)From Department d

Page 22: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

SQL: XML Constructor

Define XML Constructor EMP(ename: varchar(20)) As (

<employee> <name> $ename </name></employee>

)

Page 23: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

SQL: Query to publish XML

Select DEPT(d.name, (Select XMLAGG(EMP(e.name)) From Employee e Where e.deptno = d.deptno), <subquery to produce projlist>

)From Department d

Page 24: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

SQL: Query to publish XML

Select DEPT(d.name, (Select XMLAGG(EMP(e.name)) From Employee e Where e.deptno = d.deptno), (Select XMLAGG(PROJ(p.name)) From Project p Where p.deptno = d.deptno) )From Department d

Page 25: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

Query Result

<department name=“Purchasing”>

<emplist>

<employee> John </employee>

<employee> Mary </employee>

</emplist>

<projlist>

<project> Internet </project>

<project> Recycling </project>

</projlist>

</department>

(<XML Result>)

Page 26: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

SQL: Pros and Cons

• Pros:– Reuses SQL infrastructure/API– Natural for SQL users– Efficient execution inside relational engine

• Cons:– Limited support for XML View Composition

Page 27: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

Outline

• Why?

• How?– Language?– Mechanism?

• Which?

• Hence

Page 28: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

Relations to XML: Issues

• Two main differences:– Nesting (structuring)– Tagging

• Space of alternatives:Late TaggingEarly Tagging

Late Structuring

Early StructuringInside Engine Inside Engine

Inside Engine

Outside Engine Outside Engine

Outside Engine

Page 29: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

Stored Procedure Approach

• Issue queries for sub-structures and tag them

• Could be a Stored Procedure

DBMS EngineDepartment

Employee

Project

• Problem: Too many SQL queries!

(10, Purchasing)

(John)

(Mary)

(Internet)

(Recycling)

Early Tagging, Early Structuring, Outside Engine

Page 30: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

Correlated CLOB Approach

• Problem: Correlated execution of sub-queries

Select DEPT(d.name, (Select XMLAGG(EMP(e.name)) From Employee e Where e.deptno = d.deptno), (Select XMLAGG(PROJ(p.name)) From Project p Where p.deptno = d.deptno) )From Department d

Early Tagging, Early Structuring, Inside Engine

Page 31: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

De-Correlated CLOB Approach

• Problem: CLOBs during processing

With EmpStruct (deptname, empinfo) AS (

Select d.deptname,

XMLAGG(EMP(employee, e.empname))

From department d left join employee e

on d.deptid = e.deptid

Group By d.deptname)

With ProjStruct (deptname, projinfo) AS (

Select d.deptname,

XMLAGG(PROJ(employee, p.projname))

From department d left join project p

on d.deptid = e.deptid

Group By d.deptname)

Select DEPT(name, d1.empinfo, d2.projinfo))

From EmpStruct d1 full join ProjStruct d2

on d1.deptname = d2.deptname

Early Tagging, Early Structuring, Inside Engine

Page 32: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

Late Tagging, Late Structuring• XML document content produced without

structure (in arbitrary order)

• Tagger enforces order as final step

Relational QueryProcessing

Unstructured content

TaggingResult XML Document

Page 33: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

Redundant Relation Approach• How do we represent nested content as relations?

(10, Purchasing)

(10, Internet)

(10, Recycling)

(10, John)

(10, Mary) (Purchasing, John, Internet)

(Purchasing, John, Recycling)

(Purchasing, Mary, Internet)

(Purchasing, Mary, Recycling)

• Problem: Large relation due to data redundancy!

Late Tagging, Late Structuring

Page 34: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

Outer Union Approach• How do we represent nested content as relations?

• Problem: Wide tuples (having many columns)

Department

Employee ProjectDepartment

Employee Project

Union

(Purchasing, Internet)

(Purchasing, Recycling)

(Purchasing, John)

(Purchasing, Mary)

(10, Purchasing)

(Purchasing, null, Internet , 0)

(Purchasing, null, Recycling, 0)

(Purchasing, John, null , 1)

(Purchasing, Mary, null , 1)

Late Tagging, Late Structuring

Page 35: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

Hash-based Tagger

• Results not structured early– In arbitrary order

• Tagger has to enforce order during tagging– Hash-based approach

• Inside/Outside engine tagger

Late Tagging, Late Structuring

• Problem: Requires memory for entire document

Page 36: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

Late Tagging, Early Structuring• Structured XML document content produced

• Tagger just adds tags (constant space)

Relational QueryProcessing

Structured content

TaggingResult XML Document

Page 37: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

Sorted Outer Union Approach

A

B C

D E F G

A B n n E n n

A n C n n F n

A n C n n n G

Late Tagging, Early Structuring

A B n D n n n

Sort By: Aid, Bid, Cid

• Problem: Only partial ordering required

Page 38: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

Constant Space Tagger

• Detects changes in XML document hierarchy

• Adds appropriate opening/closing tags

• Inside/outside engine

Late Tagging, Late Structuring

Page 39: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

Classification of AlternativesLate TaggingEarly Tagging

LateStructuring

EarlyStructuring

Inside Engine

Inside Engine

De-Correlated CLOB

Out

side

Eng

ine

Stored Procedure

Inside Engine

Out

side

Eng

ine

Sorted Outer Union(Tagging inside)

Sorted Outer Union(Tagging outside)

Unsorted Outer Union(Tagging inside)

Unsorted Outer Union(Tagging outside)

Out

side

Eng

ine

Correlated CLOB

Page 40: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

Outline

• Why?

• How?– Language?– Mechanism?

• Which?

• Hence

Page 41: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

Performance Evaluation

TABLE000 TABLE001 TABLE011TABLE010

TABLE00 TABLE01

TABLE0

Query Depth

Query Fan Out

Database Size

Page 42: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

Inside vs. Outside Engine

0

10

20

30

40

50

60

2 3 4

Query Fan Out

Tim

e (in

sec

onds

)

Stored Proc

CLOB-Corr

CLOB-DeCorr

Redundant R

Unsorted OU (Out)

Unsorted OU (In)

Sorted OU (Out)

Sorted OU (In)

Page 43: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

Where Does Time Go?

05

101520253035

Tim

e (in

sec

onds

)

XML File

Tagging

Bind Out

Execution

Page 44: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

Effect of Query Fan Out

0

5

10

15

2 3 4

Query Fan Out

Time (

in sec

onds

)

CLOB-Corr

CLOB-DeCorr

Unsorted OU

Sorted OU

Page 45: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

Effect of Query Depth

0

20

40

60

2 3 4

Query Depth

Time (

in se

cond

s)

CLOB-Corr

CLOB-DeCorr

Unsorted OU

Sorted OU

Page 46: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

Memory Considerations

• Sorted outer union more robust

• Relational sort highly scalable!

Page 47: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

Outline

• Why?

• How?– Language?– Mechanism?

• Which?

• Hence

Page 48: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work

Conclusion

• Publishing XML from relational sources important in Internet

• Language alternatives:– SQL based

– XML query language based

• Implementation Alternatives– Inside engine >> Outside engine

– Unsorted Outer Union : sufficient main memory

– Sorted Outer Union : otherwise

Page 49: Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work