52
AutoJoin: Providing AutoJoin: Providing Freedom from Specifying Freedom from Specifying Joins Joins Terrence Mason ([email protected] ) Lixin Wang ([email protected] ) Dr. Ramon Lawrence ( [email protected] ) Iowa Database and Emerging Application Laboratory University of Iowa 7th International Conference on Enterprise Information Systems ICEIS 2005 Miami, Florida

AutoJoin: Providing Freedom from Specifying Joins Terrence Mason ([email protected])[email protected] Lixin Wang ([email protected])[email protected]

  • View
    225

  • Download
    0

Embed Size (px)

Citation preview

AutoJoin: Providing Freedom AutoJoin: Providing Freedom from Specifying Joinsfrom Specifying Joins

Terrence Mason ([email protected])Lixin Wang ([email protected])

Dr. Ramon Lawrence ([email protected])

Iowa Database and Emerging Application Laboratory

University of Iowa

7th International Conference on Enterprise Information Systems ICEIS 2005 Miami, Florida

Presentation OutlinePresentation Outline

Define Query InferenceDefine Query Inference Query Languages that require Query Languages that require

InferenceInference AutoJoin ArchitectureAutoJoin Architecture

Join Graph represent a schemaJoin Graph represent a schema Queries and Query Interpretations on a Join Queries and Query Interpretations on a Join

GraphGraph Pre-compute maximal join treesPre-compute maximal join trees

Algorithm EMOAlgorithm EMO Query time processing – ExampleQuery time processing – Example Performance EvaluationPerformance Evaluation

Query Inference Query Inference Problem Problem New LanguagesNew Languages

The The query inference query inference problemproblem requires requires

enumeratingenumerating and and rankingranking query interpretations of a query interpretations of a query such that the query such that the query query interpretation desiredinterpretation desired by by

the user is among the the user is among the highest rankedhighest ranked interpretations.interpretations.

State of the art query languages State of the art query languages require itrequire it Keyword SearchKeyword Search – automatically relate – automatically relate

keywords across relations of a schemakeywords across relations of a schema Conceptual QueriesConceptual Queries – Concepts mapped – Concepts mapped

to database must be relatedto database must be related Natural Language QueriesNatural Language Queries

Natural language query mapped to conceptsNatural language query mapped to concepts Relate concepts as in Conceptual Queries Relate concepts as in Conceptual Queries

Current approaches not scalable Current approaches not scalable Tied to specific language Tied to specific language Or conceptual modelOr conceptual model

Motivation for Query InferenceMotivation for Query Inference

Reduces to graph problem Reduces to graph problem Connect relations (nodes) with joins (edges)Connect relations (nodes) with joins (edges) Exponential solutions for highly connected graphs Exponential solutions for highly connected graphs

(database graphs less connected)(database graphs less connected) Approaches to join determinationApproaches to join determination

Grow all waysGrow all ways Universal Relation Universal Relation (Maier and Ullman, 1983)(Maier and Ullman, 1983) Discover (Keyword) Discover (Keyword) (Hristidis and Papakonstantinou, (Hristidis and Papakonstantinou,

2002, 2003, 2004)2002, 2003, 2004) Shortest PathsShortest Paths

CQL Conceptual Query Language CQL Conceptual Query Language (Owei and Navathe, (Owei and Navathe, 2001)2001)

Limited InterpretationsLimited Interpretations Steiner Tree (2-Trees) Steiner Tree (2-Trees) (Wald and Sorenson, 1984)(Wald and Sorenson, 1984) Limit number of joins and interpretations Limit number of joins and interpretations (Zhang et al., (Zhang et al.,

1999)1999) Query time find spanning trees of keywordsQuery time find spanning trees of keywords

DBXplorer Keyword Search (Agrawal et al. 2002)DBXplorer Keyword Search (Agrawal et al. 2002)

Motivation for Query InferenceMotivation for Query Inference

Goal of AutoJoinGoal of AutoJoin

Consistent, Scalable Inference Engine Abstract database schema from users Automatically determine joins to relate

relations and attributes Consistent approach to handle ambiguity in

queries Efficient algorithm to pre-compute potential

joins Minimal overhead at query time Demonstrate efficiency and scalability Structured on relational model without any

required conceptual models

Example Query on TPC-H Example Query on TPC-H SchemaSchema

English Query: List all parts ordered by Customers

in the United States.

Attribute-only SQL Determine Joins with AutoJoin New formulation for Query Inference

problem.

Table AttributesPart partkey, name, mfgr, brand, type, size, container, retailprice,

commentSupplier supkey, name, address, nationkey, phone, acctbal, commentPartSupp partkey, suppkey, availqty, supplycost, commentCustomer custkey, name, address, nationkey, phone, acctbal, mktsegment,

commentOrder orderkey, custkey, orderstatus, totalprice, orderdate,

orderpriority, clerk, shippriority, commentLineItem orderkey, partkey, suppkey, linenumber, quantity, extendedprice,

discount, returnflag, tax, linestatus, shipdate, commitdate, receiptdate, shipinstruct, shipmode, comment

Nation nationkey, name, regionkey, commentRegion regionkey, name, comment

TPC-H Schema

TPC-H BENCHMARK™ (http://www.tpc.org/)List all parts ordered by Customers in the United States.

Attribute-only Query: Select Part.Name where Nation.Name=‘United States’;

Part.Name - name attribute in Part Table Nation.Name – name attribute in Nation Table Select and where similar to SQL No From clause or joins specified

Keyword Query:Part ‘United States’

Maps Part to Part relation Maps ‘United States’ to tuple in Nation relation No joins specified

SQL QuerySQL Query Select Part.Name where Nation.Name =

‘United States’;

SELECT P.name FROM part P, nation N, partsupp PS, lineitem LI,

orders O, customer C WHERE N.name = ‘United States’

And P.partkey = PS.partkey And PS.partkey = LI.partkey And PS.suppkey = LI.suppkey And O.custkey = C.custkey And C.nationkey = N.nationkey And LI.orderkey = O.orderkey;

Specified

Joins and

Tables

User

Query Interface

Inference Request

Query Builder

Generator Ranker

Iterator

LoaderXML

DocumentAutoJoin Inference EngineAutoJoin Inference Engine

RelationalDatabase

Execute Queries

Interpretations

AutoJoin Architecture

Representing Joins of a Representing Joins of a SchemaSchemaJoin GraphJoin Graph

Graph representation of relational schema

Nodes Relations in schema

Directed Edges Foreign key constraint between relations

Edges directed from N to 1 cardinality of relationships

Maintain Lossless property (No spurious tuples on joins)

Create Join Graph TPC-HCreate Join Graph TPC-HNodes Joined

Foreign key/Join

Line Item to Part

partkey partkey

Line Item to PartSupp

partkey, suppkey partkey, suppkey

Line Item to Supplier

suppkey suppkey

Line Item to Order

l_orderkey o_orderkey

PartSupp to Part

ps_partkey p_partkey

PartSupp to Supplier

ps_suppkey s_suppkey

Supplier to Nation

s_nationkey n_nationkey

Order to Customer

o_custkey c_custkey

Customer to Nation

c_nationkey n_nationkey

Nation to Region

n_regionkey r_regionkey

PartSupp

Nation

SupplierPart

Line Item

Order

Customer

Region

Tables as

Nodes

Pre-compute Maximal Pre-compute Maximal Join TreesJoin Trees

EMO Algorithm on Join Graph Efficiently computes all Trees Executes where previous strategy failed Direction of edges results in lossless

join trees Pre-computed

Executed once prior to query time Structures built for query time

performance

Compute Lossless JoinsCompute Lossless Joins

Maximal sets of lossless joinsMaximal sets of lossless joins Ambiguity inherent in the schemaAmbiguity inherent in the schema Two types of ambiguity:Two types of ambiguity:

Single relation that plays Single relation that plays multiple rolesmultiple roles Node with more than one incoming edge in Node with more than one incoming edge in

join graphjoin graph Multiple semantic relationships between Multiple semantic relationships between

entitiesentities Strongly connected componentsStrongly connected components greater than greater than

one nodeone node

Creation of Maximal Join Creation of Maximal Join TreesTrees

Lossless JoinsLossless Joins Efficient Algorithm EMO

Determine all reachable graphs from nodes that may be a root for Maximal Set of Lossless Joins

Identify all Strong Connected Components (SCC)

For each SCC If SCC is single node and no incoming edges, create reachable graph from this node

If SCC has multiple nodes, for each node in SCC with no incoming edges that are not part of SCC create reachable graph.

For each reachable graph find all spanning trees

Spanning trees represent Maximal Join Trees

Maximal Join Trees of Maximal Join Trees of TPC-HTPC-H

LineItem is the only root for a reachable graph. No strongly connected components

Join graph is reachable graph Enumerate spanning trees on

original graph Remove shortcut joins and re-

compute

PartSupp

Nation

SupplierPart

Line Item

Order

Customer

Region

TPC-H Join Graph

PartSupp

Nation

SupplierPart

Line Item

Order

Customer

Region

TPC-H Maximal Join Trees

PartSupp

Nation

SupplierPart

Line Item

Order

Customer

Region

PartSupp

Nation

SupplierPart

Line Item

Order

Customer

Region

PartSupp

Nation

SupplierPart

Line Item

Order

Customer

Region

PartSupp

Nation

SupplierPart

Line Item

Order

Customer

Region

PartSupp

Nation

SupplierPart

Line Item

Order

Customer

Region

PartSupp

Nation

SupplierPart

Line Item

Order

Customer

Region

PartSupp

Nation

SupplierPart

Line Item

Order

Customer

Region

1

876

2

5

3 4

Shortcut JoinsShortcut Joins

Semantically equivalent join paths A shortcut join is a join that is semantically

equivalent to a longer join path Core join path (longer) preserved in join

graph Shortcut join removed for join determination

Appears to be a semantically different interpretation of the query

Substituted back into query No nodes on core path in query (faster) execution)

TPC-H has two shortcut joins

PartSupp

Nation

SupplierPart

Line Item

Order

Customer

Region

TPC-H Join GraphRemove Shortcut Joins

Red – Shortcut Joins

PartSupp

Nation

SupplierPart

Line Item

Order

Customer

Region

Original TPC-H Maximal Join Trees

PartSupp

Nation

SupplierPart

Line Item

Order

Customer

Region

PartSupp

Nation

SupplierPart

Line Item

Order

Customer

Region

PartSupp

Nation

SupplierPart

Line Item

Order

Customer

Region

PartSupp

Nation

SupplierPart

Line Item

Order

Customer

Region

PartSupp

Nation

SupplierPart

Line Item

Order

Customer

Region

PartSupp

Nation

SupplierPart

Line Item

Order

Customer

Region

PartSupp

Nation

SupplierPart

Line Item

Order

Customer

Region

1

876

2

5

3 4

TPC-H Semantically Unique Maximal Join Trees

PartSupp

Nation

SupplierPart

Line Item

Order

Customer

Region

PartSupp

Nation

SupplierPart

Line Item

Order

Customer

Region

1 2

Query and Query Query and Query Interpretation AutoJoinInterpretation AutoJoin

Join Graphs Query:

Sub-graph of the join graph Nodes and (optionally) edges

Not connected requires inference

Query Interpretation: Connected sub-graph of the join graph Includes all specified nodes and edges

Example QueryExample Query

SELECT Part.Name SELECT Part.Name WHERE Nation.Name = ‘United States’;WHERE Nation.Name = ‘United States’;

Relate Part.Name to Nation.Name Part and Nation Nodes.

Query of Part and Nation nodes to AutoJoin. The query is ambiguous

More than one query interpretation Nation relates to Supplier and Customer

Return the query with fewest joins first

Efficient Query Time Efficient Query Time ExecutionExecution

Find maximal join trees with query nodes Reverse index - relation to its set of join trees Intersect lists

Build Interpretations Least common ancestor (vs. recursive prune) Pre-compute ancestor lists

No lossless interpretations (no trees) Find lossy interpretation

Rank interpretations by cost function

maximal sets of lossless joinsmaximal sets of lossless joins

Both Trees Contain Query NodesBoth Trees Contain Query NodesSelect Part.Name where Nation.Name = ‘United States’;Select Part.Name where Nation.Name = ‘United States’;

PartSupp

Nation

SupplierPart

Line Item

Order

Customer

Region

PartSupp

Nation

SupplierPart

Line Item

Order

Customer

Region

1 2

Red – Target Nodes

Query ProcessingQuery Processing

PartSupp

Nation

SupplierPart

Line Item

Order

Customer

Region

PartSupp

Nation

SupplierPart

Line Item

Order

Customer

Region

1 2

Red – Target Nodes

Blue – Tree Nodes

Gray – Nodes to Prune

Query Interpretations Query Interpretations

PartSupp

Nation

Part

Line Item

Order

Customer

PartSupp

Nation

SupplierPart

1 2

Select Part.Name where Select Part.Name where Customer.Nation.NameCustomer.Nation.Name = =

‘United States’;‘United States’;

Select Part.Name where Select Part.Name where Supplier.Nation.NameSupplier.Nation.Name = =

‘United States’;‘United States’;

Unambiguous QueryUnambiguous QuerySelect Supplier.Name where Order.Id = 73;Select Supplier.Name where Order.Id = 73;

PartSupp

Nation

SupplierPart

Line Item

Order

Customer

Region

PartSupp

Nation

SupplierPart

Line Item

Order

Customer

Region

1 2

Red – Target Nodes

Query ProcessingQuery ProcessingSelect Supplier.Name where Order.Id = 73;Select Supplier.Name where Order.Id = 73;

Red – Target Nodes

Blue – Tree Nodes

Gray – Nodes to Prune

PartSupp

Nation

SupplierPart

Line Item

Order

Customer

Region

PartSupp

Nation

SupplierPart

Line Item

Order

Customer

Region

1 2

Query Interpretations Query Interpretations Select Supplier.Name where Order.Id = 73;Select Supplier.Name where Order.Id = 73;

PartSupp

Supplier

Line Item

Order

1 2

PartSupp

Supplier

Line Item

Order

The Unambiguous Query Interpretation The Unambiguous Query Interpretation Select Supplier.Name where Order.Id = 73;Select Supplier.Name where Order.Id = 73;

PartSupp

Supplier

Line Item

Order

Additional InterpretationsAdditional InterpretationsLossy JoinsLossy Joins

Related through a node involved in two distinct roles Two maximal join trees contain all query nodes and

have at least one node in common Union maximal join trees Common nodes provide relation for trees. Interpretation where node will have two incoming

edges No longer lossless

Example Customer and Supplier related through Nation in TPC-H.

Cross products of Customers and Suppliers with the same nation

Beyond Natural JoinsBeyond Natural Joins

Theta joins Merge the two nodes related by theta join into

single node and re-compute maximal objects. Expand this node for final query interpretation

with theta join Tuple Variables

A query interface may specify tuple variables Additional nodes and edges will be added to

join graph to complete the query interpretations

Performance Performance ExperimentsExperiments

Broad Range of Schemas caBIO (NCI) 149 relations, 213 joins,

and 1253 maximal join trees TPC-H Standard Database

Inferred standard queries (21 specified queries)

Ambiguity reduced by removing shortcut joins

Tenant – 9 nodes, 50 joins, and 1286 maximal join trees

Peformance ResultsPeformance Results Time to generate all Maximal Join Trees

Handles schemas where previous method failed Worst test 2.7 seconds Average < 1 second

Reduce Ambiguity Removing shortcut joins reduces ambiguity Increased number of unambiguous query

From 45% to 68% for TPC-H Benchmark Queries Minimal overhead of inference at query

time Average < 1 millisecond Worst test 7.4 milliseconds

Compute Maximal Join Compute Maximal Join TreesTrees

EMO vs. All WaysEMO vs. All Ways

4.906

18.187

0.0930.1870.031 0.110.0472.652

0.7970.0790.0160

10

20

30

40

50

TPC-H (8) Claims (31) Tenant (1286) caBIO (1253) ACID (20) MONDIAL (117)

Schema (Maximal Objects)

Tim

e (S

eco

nd

s)

All Ways

EMO

Reducing Ambiguity Reducing Ambiguity Remove Shortcut JoinsRemove Shortcut Joins

8%

45%

33%

26%

68%

100%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

TPC-H (All) TPC-H (Benchmark) EDS

Per

cent

Una

mbi

guou

s Jo

ins

Original

Shortcuts Removed

Query Inference TimeQuery Inference Time(Milliseconds)(Milliseconds)

0.355 0.282 0.3920.057 0.055 0.147

7.420

0.173

2.764

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

TPC-HQueries

TPC-HShortcutQueries

TPC-H All TPC-HShortcut All

ACID Claims Tenant MONDIAL caBIO

Ave

rag

e Q

uer

y T

ime,

mil

lise

con

ds

per

qu

ery

AutoJoin ConclusionsAutoJoin Conclusions

Scalable inference engine Efficiently pre-compute maximal

join trees Reduced ambiguity by removing

shortcut joins Overhead is minimal Complex queries can be inferred Built directly on relational model

Future WorkFuture Work

Develop a query language Develop a query language Remove requirement of Remove requirement of

understanding the underlying understanding the underlying schemaschema

Automatically determines joinsAutomatically determines joins End user interface based on End user interface based on

AutoJoin AutoJoin Query inference for integration Query inference for integration

systems.systems.

Query InferenceQuery Inference(Previous)(Previous)

The translation of a query The translation of a query in a query language into in a query language into

an unambiguous an unambiguous representation of the representation of the

query query [Wald and Sorenson, 1984][Wald and Sorenson, 1984]

Universal RelationUniversal Relation First model to require query inferenceFirst model to require query inference Maximal Objects (Maier and Ullman, 1983)Maximal Objects (Maier and Ullman, 1983)

Lossless Join property to identify potential joinsLossless Join property to identify potential joins Grows all waysGrows all ways on hyper-graph on hyper-graph Returns a union of all query interpretationsReturns a union of all query interpretations

Minimum Directed Cost Steiner Tree (Wald and Minimum Directed Cost Steiner Tree (Wald and Sorenson, 1984)Sorenson, 1984) Limited to Partial 2-TreesLimited to Partial 2-Trees Returns only lowest cost query interpretationReturns only lowest cost query interpretation

Generate a Generate a single interpretationsingle interpretation Do not meet need of new query languagesDo not meet need of new query languages Limited query interpretations possibleLimited query interpretations possible

State of the Art Query State of the Art Query LanguagesLanguages

Keyword SearchesKeyword Searches Keywords map to either Keywords map to either specificspecific data, data,

attribute names, or relation namesattribute names, or relation names in a in a database. database.

Must identify Must identify joins to relate keywordsjoins to relate keywords spread across multiple relations.spread across multiple relations.

Multiple approaches to identifying the Multiple approaches to identifying the top-ktop-k relationships between keywords. relationships between keywords.

Keyword SearchKeyword SearchTTop-K Relationshipsop-K Relationships

Discover (Hristidis and Discover (Hristidis and Papakonstantinou, 2002, 2003, 2004)Papakonstantinou, 2002, 2003, 2004) Grow all ways from a keywordGrow all ways from a keyword Limit on number of joinsLimit on number of joins Creates Creates extra graphsextra graphs

DBXplorer (Agrawal et al. 2002)DBXplorer (Agrawal et al. 2002) Generates Generates spanning trees at query timespanning trees at query time

BANKS ( )BANKS ( ) Graph of all tuplesGraph of all tuples related by joins related by joins Must fit in memory (limited to smaller Must fit in memory (limited to smaller

databases)databases)

State of the Art Query State of the Art Query LanguagesLanguages

Conceptual Query Languages or ModelsConceptual Query Languages or Models Queries built with Queries built with conceptsconcepts that that map to a map to a

database.database. Remove the burden of knowledge of the Remove the burden of knowledge of the

schema.schema. Must determine Must determine joins to relate conceptsjoins to relate concepts in in

query.query. Use conceptual model to determine joinsUse conceptual model to determine joins

Conceptual Query Conceptual Query LanguagesLanguages

CQL (Owei and Navathe, 2001)CQL (Owei and Navathe, 2001) Queries may include roles or joins required for Queries may include roles or joins required for

a querya query Pathfinder algorithm for completing the queryPathfinder algorithm for completing the query

Based on shortest path between source and target Based on shortest path between source and target concepts in queryconcepts in query

Semantically Constrained ER Diagram as a graph Semantically Constrained ER Diagram as a graph used to determine joins.used to determine joins.

Conceptual Model (Zhang et al., 1999)Conceptual Model (Zhang et al., 1999) Semantic graph of databaseSemantic graph of database Search algorithm constrained by number of Search algorithm constrained by number of

joins or number of interpretationsjoins or number of interpretations

State of the Art Query State of the Art Query LanguagesLanguages

Natural Language QueriesNatural Language Queries Natural language queries map the Natural language queries map the

language to concepts in a databaselanguage to concepts in a database Joins must be determined to relate Joins must be determined to relate

concepts in database similar to concepts in database similar to Conceptual Query LanguagesConceptual Query Languages

Functional Dependencies due to Primary KeysFunctional Dependencies due to Primary KeysTPC-HTPC-H

Table Functional Dependencies

Part p_partkey p_name, p_mfgr, p_brand, p_type, p_size, p_container, p_retailprice, p_comment

Supplier s_suppkey s_name, s_address, s_nationkey, s_phone, s_acctbal, s_comment

PartSupp ps_partkey, ps_suppkey ps_availqty, ps_supplycost, ps_comment

Customer c_custkey c_name, c_address, c_nationkey, c_phone, c_acctbal, c_mktsegment, c_comment

Order o_orderkey o_custkey, o_orderstatus, o_totalprice, o_orderdate, o_orderpriority, o_clerk, o_shippriority, o_comment

LineItem l_orderkey, l_linenumber l_partkey, l_suppkey, l_orderkey , l_quantity, l_extendedprice, l_discount, l_returnflag, l_tax, l_linestatus, l_shipdate, l_commitdate, l_receiptdate, l_shipinstruct, l_shipmode, l_comment

Nation n_nationkey n_name, n_regionkey, n_comment

Region r_regionkey r_name, r_comment

Primary Keys Foreign Keys

Table with Foreign Key

Table Referenced

Functional Dependencies

LineItem Part l_partkey p_partkey

LineItem Supplier l_suppkey s_suppkey

LineItem PartSupp l_partkey, l_suppkey ps_partkey, ps_suppkey

LineItem Order l_orderkey o_orderkey

PartSupp Part ps_partkey p_partkey

PartSupp Supplier ps_suppkey s_suppkey

Supplier Nation s_nationkey n_nationkey

Customer Nation c_nationkey n_nationkey

Order Customer o_custkey c_custkey

Nation Region n_regionkey r_regionkey

Primary Keys Foreign Keys

Function Dependencies TPC-H Function Dependencies TPC-H implied by Foreign Keysimplied by Foreign Keys

PartSupp

Nation

SupplierPart

Line Item

Order

Customer

Region

TPC-H Join Graph