45
© 2008 Hewlett- Packard © 2010 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Explaining Structured Queries in Natural Language Georgia Koutrika, Stanford Univ., USA, [email protected] (now with IBM Almaden) Alkis Simitsis, HP Labs, USA, [email protected] Yannis Ioannidis, University of Athens, [email protected]

© 2008 Hewlett-Packard © 2010 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Explaining

Embed Size (px)

Citation preview

© 2008 Hewlett-Packard

© 2010 Hewlett-Packard Development Company, L.P.The information contained herein is subject to change without notice

Explaining Structured Queries in Natural Language

Georgia Koutrika, Stanford Univ., USA, [email protected] (now with IBM Almaden)

Alkis Simitsis, HP Labs, USA, [email protected]

Yannis Ioannidis, University of Athens, [email protected]

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Outline• Introduction• Modeling

−Graph model−Template Mechanism

• Traversal strategies−Graph-traversal−Template-based

• Experiments• Summary

2

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Outline• Introduction• Modeling

−Graph model−Template Mechanism

• Traversal strategies−Graph-traversal−Template-based

• Experiments• Summary

3

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Role of Natural Language

4

Data Management SystemData Management System

NL SQL XQuery

NL Tables XML RDF

SPARQL

VisUI

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

From Natural Language to Databases

5

NL SQL

NL Tables

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

From Databases to Natural Language?

6

NL SQL

NL Tablesmainly: EDBT’08, CIDR’09

related: ICDE’06/’07, VLDB J.17(1) ’08

…this paper

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Motivation• Interesting applications

−education and training on query languages−query debugging

• e.g., (sub)queries responsible for empty results (sub)queries responsible for too many results

−query explanation and automatic commenting−do-it-yourself applications−query-by-form applications

7

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Motivation• Gartner analysis on technologies that will have a

broad impact on all aspects of people’s lives [2008]−seven most important IT challenges for next 25

years−one of them

8

Automating computer-to-human speech translation

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Example Query

9

What courses has Andreas taught?

SELECT titleFROM Instructors I, CourseSched S, Courses CWHERE I.name = “Andreas” and I.instrID = S.instrID and S.courseID = C.courseID

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Example Query #2

10

WHAT????

SELECT S.name, count(distinct CO.CourseID)FROM STUDENTS S, STUDENTHISTORY SH, COURSES CO, COURSESCHED R, INSTRUCTOR IWHERE S.name=I.name and S.class=SH.year and S.SuID=SH.SuID and SH.CourseID=CO.CourseID and CO.CourseID = R.CourseID and R.InstrID=I.InstrID and R.year > all (SELECT CO.year FROM COURSES CO, COURSESCHED R, INSTRUCTORS I WHERE CO.CourseID = R.CourseID and R.InstrID = I.InstrID and I.name = ‘Baeza’ and CourseID NOT IN (SELECT C.CourseID FROM COURSES CO, COURSESCHED R, INSTRUCTORS I WHERE CO.CourseID=R.CourseID and R.InstrID=I.InstrID and R.Year>2000 GROUP BY C.CourseID HAVING COUNT(distinct R.Year) > 3) ) and not exists (SELECT * FROM COURSES C1, STUDENTHISTORY SH1 WHERE SH1.SuID = S.SuID and SH1.CourseID = C1.CourseID and C1.DepID = D.DepID and D.name = ‘EE’)GROUP BY S.nameHAVING count(distinct CO.CourseID) =< ALL (SELECT count(C.CourseID) FROM STUDENTS S, STUDENTHISTORY SH, COURSES CO WHERE S.SuID = SH.SuID and SH.CourseID = CO.CourseID and S.CLASS >2008 GROUP BY S.SuID)

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Query Translation Challenges• Equivalent query expressions that should be

translated in the same natural-language expression−commutativity, associativity, and other algebraic

properties• Equivalent natural-language expressions among

which one should be chosen• Choice between declarative and procedural

translations• Natural translations that don’t follow the query

form but are based on mathematical semantics SELECT A.id, A.name FROM MOVIES M, CAST C, ACTOR

A WHERE M.id = C.mid and C.aid =

A.id GROUPBY A.id, A.name HAVING count(distinct M.year) = 1

Sometimes…“count = 1” means

“all”“Find actors whose movies are all in the same year”[CIDR’09]

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Outline• Introduction• Modeling

−Graph model−Template Mechanism

• Traversal strategies−Graph-traversal−Template-based

• Experiments• Summary

12

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

DB Graph• Nodes

−relations, attributes• Edges

−membership−selection−predicate

13

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

DB Graph - Example Schema

14

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

select s.name, s.GPA, c.title, i.name, co.textfrom students s, comments co

studenthistory h, courses c, departments d,coursesched cs, instructors i,

where s.suid = co.suid ands.suid = h.suid and h.courseid = c.courseid andc.depid = d.depid andc.courseid = cs.courseid and cs.instrid = i.instrid ands.class = 2011 and co.rating > 3 andcs.term = ‘spring’ and d.name = ‘CS’

Query Graph• Examples

−SPJ−Group-by, functions, …−Nested queries

15

select year, term, max(grade)from studenthistorygroup by year, termhaving avg(grade) > 3

select s.namefrom students swhere NOT EXISTS ( select * from students s2 where s2.GPA > s.GPA )

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Capturing Semantics• Labels

−each node v has a conceptual meaning l(v)• relation STUDENTS “students”• attribute NAME “name”• function MAX “the greatest” • operators = “is” , ≤ “does not exceed” like “looks like”

−each edge can be annotated by a label

16

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Capturing Semantics• Templates

−template label:−language with variables, loops, functions, and

macros• e.g.,

−generic templates• e.g., for

−specific templates

17

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Outline• Introduction• Modeling

−Graph model−Template Mechanism

• Traversal strategies−Graph-traversal−Template-based

• Experiments• Summary

18

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Example Query Graph

19

select s.name, s.GPA, c.title, i.name, co.textfrom students s, comments co studenthistory h, courses c, departments d, coursesched cs, instructors i,where s.suid=co.suid and s.suid=h.suid and h.courseid=c.courseid and c.depid=d.depid and c.courseid=cs.courseid and cs.instrid=i.instrid and s.class = 2011 and co.rating > 3 and cs.term = ‘spring’ and d.name = ‘CS’

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Query Subject (QS)

20

912

15

15

3

+3 +3 +3 +3

primary / secondary relations

• QS is the starting point of the translation−a “central” primary relation w/ attributes projected in the

query−…or the closest primary relation

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Traversal Strategies• BST Algorithm

−BST composes separate clauses for each query part, in the following order• pStr: translate the edges connecting relations to their

attributes• fStr: connect all query relations to the subject through

the query joins• wStr: translate the paths connecting relations to value

nodes

‘Find ’ + pStr + ‘ for ’ + fStr + ‘.’ +

‘ Return results only for ’ + wStr + ‘.’

21

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Traversal Strategies: BST

22

“Find the title of courses, the name of instructors, the gpa and name of students, and the description of comments for courses that are taught by instructors, are taken by students that gave comments, and are offered by departments. Return results only for courses whose term is spring, students whose class is 2011, comments whose rating is greater than 3, and departments whose name is CS.”

pStr

fStr

wStr

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Traversal Strategies• Multi-Reference Point Graph Traversal (MRP)

Algorithm−MRP avoids the creation of complex and lengthy

phrases• the translation is semantically split at multiple points,

called reference points (RPs)

• a reference point (RP) is −a relation with projections or −a branching point or −a leaf

23

“Find the names of students and the titles of the courses taken by these students and the names of the instructors that taught courses taken by these students”

“Find the names of students and the titles of the courses taken by these students and the names of the instructors that taught these courses”

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Traversal Strategies: MRP

24

1

26

5

4

3

8

7

“Find the title of courses for courses that are offered by departments whose name is CS, and also, the gpa and name of students for students whose class is 2011 and that have taken these courses, and also, the description of comments for comments whose rating is greater than 3 and that are given by these students, and also, the name of instructors that teach courses whose term is spring.”

RP

RP RP RP

RP

traversing fromQS outwards

translating fromRPs inwards

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Traversal Strategies• Template-based (TS & TMT) approach

−This strategy makes use of templates• find the minimum number of composeable templates• two templates are composeable if they share reference

points• combine templates over the query graph in the right

order

−Example templates:

25

C D <val> + l(C)gb name <val>

S H C T l(S) + “ have been in classes of ” +l(I)Iga

C T I

l(I) + “ ‘s lectures on ” + l(C) + “ in ” + <val>gc

term <val>

S: Students

H:StudentHistory

T:CourseSched

C:Courses

D:Department

I: Instructors

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

1

Traversal Strategies: TMT

26

2

3

“Find the gpa and name of students whose class is 2011 and have been in classes of instructors and find the name of these instructors, whose lectures on courses are in spring and find the title of these CS courses and the description of comments whose rating is greater than 3 given by these students.”

l(I) + “ ‘s lectures on ” + l(C) + “ in ” + <val>

l(S) + “ have been in classes of ” +l(I)

<val> + l(C)

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Outline• Introduction• Modeling

−Graph model−Template Mechanism

• Traversal strategies−Graph-traversal−Template-based

• Experiments• Summary

27

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Experiments: Effectiveness• SQL NL

28

− BST • is not good for queries with

many joins• behaves better for large

projection and selection lists

− TMT • works best for bigger

graphs with many joins• the templates cannot help

for large

projection/selection lists− MRP

• is a good compromise between BST &TMT

− USER • resemble BST for small

queries and MRP for larger ones queries

• further processing follows a TMT-like approach for simplifying query parts

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Experiments: Effectiveness• NLSQL

29

− BST • serves better the purpose of finding which SQL query is

described− TMT

• provides a higher level description of the query • makes it easier to catch the meaning, but it is harder to

reproduce the actual SQL query− MRP

• allows to reproduce the SQL query fairly easily, but that task is harder in the presence of many projections

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Experiments: Performance

30

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Outline• Introduction• Modeling

−Graph model−Template Mechanism

• Traversal strategies−Graph-traversal−Template-based

• Experiments• Summary

31

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Summary• Translating queries and other commands to

natural-language constructs is−useful in many applications−challenging and difficult in many ways−mostly ignored

• Goals include using minimal technology from AI and NLP

• Three methods for translating queries−TMT generates better translations, but we need

to invest some effort on designing templates−BST or MRP are fair methods for explaining a

query or helping a user write the query himself

32

Thank You!

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA33

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA34

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Discussion• Not just SPJ

−Work also with group-by, order-by, functions, …−Conjunctive and disjunctive predicates

• Corporate queries−Large chains of joins, numerous projected

attributes,…−Usefulness of the translation

• Result is not pretty (but neither the queries themselves are)

• Use templates for “query summaries”− Limit the projected attributes in the result− Use them as starting points: Start with heading attributes

and expand as needed

• “Impossible Queries” [CIDR09]−Only with templates35

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Relatively Easy Queries• Simple, single path on the schema• More or less using data translation mechanisms

SELECT titleFROM MOVIES M, CAST C, ACTOR AWHERE M.id = C.mid and C.aid = A.id

and A.name = “Brat Pitt”

“Find the titles of movies where Brat Pitt plays”

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Somewhat Easy Queries• Simple subgraph of the schema

SELECT A.name, titleFROM MOVIES M, CAST C, ACTOR A,

DIRECTED R, DIRECTOR D, GENRE GWHERE M.id = C.mid and C.aid = A.id

and M.id = R.mid and R.did = D.idand M.id = G.midand D.name = “G. Loucas” and G.genre =

“action”

“Find the actors and titles of action movies directed by G. Loucas”

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Difficult Queries• Multiple tuple variables from the same table

SELECT A1.name, A2.nameFROM MOVIES M, CAST C1, ACTOR A1,

CAST C2, ACTOR A2WHERE M.id = C1.mid and C1.aid = A1.id

and M.id = C2.mid and C2.aid = A2.idand A1.id > A2.id

“Find pairs of actor names who have played in the same movie”

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Difficult Queries• Cyclic queries not based on key-foreign key joins

SELECT titleFROM MOVIES, CASTWHERE id = mid and role = title

“Find movies whose title is one of their roles”

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Difficult Queries

• Nested queries

SELECT titleFROM MOVIESWHERE id in (SELECT mid

FROM CAST WHERE aid in (SELECT id

FROM ACTOR WHERE name = “Brat

Pitt”))

“Find the titles of movies where Brat Pitt plays”

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Equivalence of Easy & Difficult Queries• Un-nested equivalent

SELECT titleFROM MOVIES M, CAST C, ACTOR AWHERE M.id = C.mid and C.aid = A.id

and A.name = “Brat Pitt”

“Find the titles of movies where Brat Pitt plays”

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Difficult Queries

• Nested queries without equivalent un-nested forms

SELECT title

FROM MOVIES MWHERE not exists (SELECT *

FROM GENRE WHERE not exists (SELECT *

FROM GENRE WHERE mid = M.mid))

“Find movies that have all genres”

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

Difficult Queries• Aggregate queries

SELECT id, title, count(*)FROM MOVIES, CASTWHERE id=midGROUPBY id, titleHAVING 1< (SELECT count(*)

FROM GENRE WHERE mid=id)

“Find the number of actors in movies of more than one genre”

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

“Impossible” Queries• “Count = 1” means “all” … sometimes

SELECT A.id, A.nameFROM MOVIES M, CAST C, ACTOR AWHERE M.id = C.mid and C.aid = A.idGROUPBY A.id, A.nameHAVING count(distinct M.year) = 1

“Find actors whose movies are all in the same year”

G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA

“Impossible” Queries• “≤ all” means “earliest” … sometimes

SELECT A.nameFROM MOVIES M, CAST C, ACTOR AWHERE M.id = C.mid and C.aid = A.id and year <= all

(SELECT M1.year FROM MOVIES M1, MOVIES M2 WHERE M1.id != M2.id and M1.title = M.title and M2.title =

M.title)

“Find the actors who have played in the earliest versions of movies that have had sequels”