Upload
aubrey-green
View
223
Download
0
Embed Size (px)
Citation preview
© 2008 Hewlett-Packard
© 2010 Hewlett-Packard Development Company, L.P.The information contained herein is subject to change without notice
Explaining Structured Queries in Natural Language
Georgia Koutrika, Stanford Univ., USA, [email protected] (now with IBM Almaden)
Alkis Simitsis, HP Labs, USA, [email protected]
Yannis Ioannidis, University of Athens, [email protected]
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Outline• Introduction• Modeling
−Graph model−Template Mechanism
• Traversal strategies−Graph-traversal−Template-based
• Experiments• Summary
2
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Outline• Introduction• Modeling
−Graph model−Template Mechanism
• Traversal strategies−Graph-traversal−Template-based
• Experiments• Summary
3
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Role of Natural Language
4
Data Management SystemData Management System
NL SQL XQuery
NL Tables XML RDF
SPARQL
VisUI
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
From Natural Language to Databases
5
NL SQL
NL Tables
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
From Databases to Natural Language?
6
NL SQL
NL Tablesmainly: EDBT’08, CIDR’09
related: ICDE’06/’07, VLDB J.17(1) ’08
…this paper
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Motivation• Interesting applications
−education and training on query languages−query debugging
• e.g., (sub)queries responsible for empty results (sub)queries responsible for too many results
−query explanation and automatic commenting−do-it-yourself applications−query-by-form applications
7
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Motivation• Gartner analysis on technologies that will have a
broad impact on all aspects of people’s lives [2008]−seven most important IT challenges for next 25
years−one of them
8
Automating computer-to-human speech translation
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Example Query
9
What courses has Andreas taught?
SELECT titleFROM Instructors I, CourseSched S, Courses CWHERE I.name = “Andreas” and I.instrID = S.instrID and S.courseID = C.courseID
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Example Query #2
10
WHAT????
SELECT S.name, count(distinct CO.CourseID)FROM STUDENTS S, STUDENTHISTORY SH, COURSES CO, COURSESCHED R, INSTRUCTOR IWHERE S.name=I.name and S.class=SH.year and S.SuID=SH.SuID and SH.CourseID=CO.CourseID and CO.CourseID = R.CourseID and R.InstrID=I.InstrID and R.year > all (SELECT CO.year FROM COURSES CO, COURSESCHED R, INSTRUCTORS I WHERE CO.CourseID = R.CourseID and R.InstrID = I.InstrID and I.name = ‘Baeza’ and CourseID NOT IN (SELECT C.CourseID FROM COURSES CO, COURSESCHED R, INSTRUCTORS I WHERE CO.CourseID=R.CourseID and R.InstrID=I.InstrID and R.Year>2000 GROUP BY C.CourseID HAVING COUNT(distinct R.Year) > 3) ) and not exists (SELECT * FROM COURSES C1, STUDENTHISTORY SH1 WHERE SH1.SuID = S.SuID and SH1.CourseID = C1.CourseID and C1.DepID = D.DepID and D.name = ‘EE’)GROUP BY S.nameHAVING count(distinct CO.CourseID) =< ALL (SELECT count(C.CourseID) FROM STUDENTS S, STUDENTHISTORY SH, COURSES CO WHERE S.SuID = SH.SuID and SH.CourseID = CO.CourseID and S.CLASS >2008 GROUP BY S.SuID)
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Query Translation Challenges• Equivalent query expressions that should be
translated in the same natural-language expression−commutativity, associativity, and other algebraic
properties• Equivalent natural-language expressions among
which one should be chosen• Choice between declarative and procedural
translations• Natural translations that don’t follow the query
form but are based on mathematical semantics SELECT A.id, A.name FROM MOVIES M, CAST C, ACTOR
A WHERE M.id = C.mid and C.aid =
A.id GROUPBY A.id, A.name HAVING count(distinct M.year) = 1
Sometimes…“count = 1” means
“all”“Find actors whose movies are all in the same year”[CIDR’09]
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Outline• Introduction• Modeling
−Graph model−Template Mechanism
• Traversal strategies−Graph-traversal−Template-based
• Experiments• Summary
12
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
DB Graph• Nodes
−relations, attributes• Edges
−membership−selection−predicate
13
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
select s.name, s.GPA, c.title, i.name, co.textfrom students s, comments co
studenthistory h, courses c, departments d,coursesched cs, instructors i,
where s.suid = co.suid ands.suid = h.suid and h.courseid = c.courseid andc.depid = d.depid andc.courseid = cs.courseid and cs.instrid = i.instrid ands.class = 2011 and co.rating > 3 andcs.term = ‘spring’ and d.name = ‘CS’
Query Graph• Examples
−SPJ−Group-by, functions, …−Nested queries
15
select year, term, max(grade)from studenthistorygroup by year, termhaving avg(grade) > 3
select s.namefrom students swhere NOT EXISTS ( select * from students s2 where s2.GPA > s.GPA )
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Capturing Semantics• Labels
−each node v has a conceptual meaning l(v)• relation STUDENTS “students”• attribute NAME “name”• function MAX “the greatest” • operators = “is” , ≤ “does not exceed” like “looks like”
−each edge can be annotated by a label
16
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Capturing Semantics• Templates
−template label:−language with variables, loops, functions, and
macros• e.g.,
−generic templates• e.g., for
−specific templates
17
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Outline• Introduction• Modeling
−Graph model−Template Mechanism
• Traversal strategies−Graph-traversal−Template-based
• Experiments• Summary
18
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Example Query Graph
19
select s.name, s.GPA, c.title, i.name, co.textfrom students s, comments co studenthistory h, courses c, departments d, coursesched cs, instructors i,where s.suid=co.suid and s.suid=h.suid and h.courseid=c.courseid and c.depid=d.depid and c.courseid=cs.courseid and cs.instrid=i.instrid and s.class = 2011 and co.rating > 3 and cs.term = ‘spring’ and d.name = ‘CS’
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Query Subject (QS)
20
912
15
15
3
+3 +3 +3 +3
primary / secondary relations
• QS is the starting point of the translation−a “central” primary relation w/ attributes projected in the
query−…or the closest primary relation
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Traversal Strategies• BST Algorithm
−BST composes separate clauses for each query part, in the following order• pStr: translate the edges connecting relations to their
attributes• fStr: connect all query relations to the subject through
the query joins• wStr: translate the paths connecting relations to value
nodes
‘Find ’ + pStr + ‘ for ’ + fStr + ‘.’ +
‘ Return results only for ’ + wStr + ‘.’
21
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Traversal Strategies: BST
22
“Find the title of courses, the name of instructors, the gpa and name of students, and the description of comments for courses that are taught by instructors, are taken by students that gave comments, and are offered by departments. Return results only for courses whose term is spring, students whose class is 2011, comments whose rating is greater than 3, and departments whose name is CS.”
pStr
fStr
wStr
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Traversal Strategies• Multi-Reference Point Graph Traversal (MRP)
Algorithm−MRP avoids the creation of complex and lengthy
phrases• the translation is semantically split at multiple points,
called reference points (RPs)
• a reference point (RP) is −a relation with projections or −a branching point or −a leaf
23
“Find the names of students and the titles of the courses taken by these students and the names of the instructors that taught courses taken by these students”
“Find the names of students and the titles of the courses taken by these students and the names of the instructors that taught these courses”
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Traversal Strategies: MRP
24
1
26
5
4
3
8
7
“Find the title of courses for courses that are offered by departments whose name is CS, and also, the gpa and name of students for students whose class is 2011 and that have taken these courses, and also, the description of comments for comments whose rating is greater than 3 and that are given by these students, and also, the name of instructors that teach courses whose term is spring.”
RP
RP RP RP
RP
traversing fromQS outwards
translating fromRPs inwards
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Traversal Strategies• Template-based (TS & TMT) approach
−This strategy makes use of templates• find the minimum number of composeable templates• two templates are composeable if they share reference
points• combine templates over the query graph in the right
order
−Example templates:
25
C D <val> + l(C)gb name <val>
S H C T l(S) + “ have been in classes of ” +l(I)Iga
C T I
l(I) + “ ‘s lectures on ” + l(C) + “ in ” + <val>gc
term <val>
S: Students
H:StudentHistory
T:CourseSched
C:Courses
D:Department
I: Instructors
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
1
Traversal Strategies: TMT
26
2
3
“Find the gpa and name of students whose class is 2011 and have been in classes of instructors and find the name of these instructors, whose lectures on courses are in spring and find the title of these CS courses and the description of comments whose rating is greater than 3 given by these students.”
l(I) + “ ‘s lectures on ” + l(C) + “ in ” + <val>
l(S) + “ have been in classes of ” +l(I)
<val> + l(C)
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Outline• Introduction• Modeling
−Graph model−Template Mechanism
• Traversal strategies−Graph-traversal−Template-based
• Experiments• Summary
27
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Experiments: Effectiveness• SQL NL
28
− BST • is not good for queries with
many joins• behaves better for large
projection and selection lists
− TMT • works best for bigger
graphs with many joins• the templates cannot help
for large
projection/selection lists− MRP
• is a good compromise between BST &TMT
− USER • resemble BST for small
queries and MRP for larger ones queries
• further processing follows a TMT-like approach for simplifying query parts
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Experiments: Effectiveness• NLSQL
29
− BST • serves better the purpose of finding which SQL query is
described− TMT
• provides a higher level description of the query • makes it easier to catch the meaning, but it is harder to
reproduce the actual SQL query− MRP
• allows to reproduce the SQL query fairly easily, but that task is harder in the presence of many projections
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Outline• Introduction• Modeling
−Graph model−Template Mechanism
• Traversal strategies−Graph-traversal−Template-based
• Experiments• Summary
31
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Summary• Translating queries and other commands to
natural-language constructs is−useful in many applications−challenging and difficult in many ways−mostly ignored
• Goals include using minimal technology from AI and NLP
• Three methods for translating queries−TMT generates better translations, but we need
to invest some effort on designing templates−BST or MRP are fair methods for explaining a
query or helping a user write the query himself
32
Thank You!
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Discussion• Not just SPJ
−Work also with group-by, order-by, functions, …−Conjunctive and disjunctive predicates
• Corporate queries−Large chains of joins, numerous projected
attributes,…−Usefulness of the translation
• Result is not pretty (but neither the queries themselves are)
• Use templates for “query summaries”− Limit the projected attributes in the result− Use them as starting points: Start with heading attributes
and expand as needed
• “Impossible Queries” [CIDR09]−Only with templates35
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Relatively Easy Queries• Simple, single path on the schema• More or less using data translation mechanisms
SELECT titleFROM MOVIES M, CAST C, ACTOR AWHERE M.id = C.mid and C.aid = A.id
and A.name = “Brat Pitt”
“Find the titles of movies where Brat Pitt plays”
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Somewhat Easy Queries• Simple subgraph of the schema
SELECT A.name, titleFROM MOVIES M, CAST C, ACTOR A,
DIRECTED R, DIRECTOR D, GENRE GWHERE M.id = C.mid and C.aid = A.id
and M.id = R.mid and R.did = D.idand M.id = G.midand D.name = “G. Loucas” and G.genre =
“action”
“Find the actors and titles of action movies directed by G. Loucas”
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Difficult Queries• Multiple tuple variables from the same table
SELECT A1.name, A2.nameFROM MOVIES M, CAST C1, ACTOR A1,
CAST C2, ACTOR A2WHERE M.id = C1.mid and C1.aid = A1.id
and M.id = C2.mid and C2.aid = A2.idand A1.id > A2.id
“Find pairs of actor names who have played in the same movie”
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Difficult Queries• Cyclic queries not based on key-foreign key joins
SELECT titleFROM MOVIES, CASTWHERE id = mid and role = title
“Find movies whose title is one of their roles”
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Difficult Queries
• Nested queries
SELECT titleFROM MOVIESWHERE id in (SELECT mid
FROM CAST WHERE aid in (SELECT id
FROM ACTOR WHERE name = “Brat
Pitt”))
“Find the titles of movies where Brat Pitt plays”
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Equivalence of Easy & Difficult Queries• Un-nested equivalent
SELECT titleFROM MOVIES M, CAST C, ACTOR AWHERE M.id = C.mid and C.aid = A.id
and A.name = “Brat Pitt”
“Find the titles of movies where Brat Pitt plays”
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Difficult Queries
• Nested queries without equivalent un-nested forms
SELECT title
FROM MOVIES MWHERE not exists (SELECT *
FROM GENRE WHERE not exists (SELECT *
FROM GENRE WHERE mid = M.mid))
“Find movies that have all genres”
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
Difficult Queries• Aggregate queries
SELECT id, title, count(*)FROM MOVIES, CASTWHERE id=midGROUPBY id, titleHAVING 1< (SELECT count(*)
FROM GENRE WHERE mid=id)
“Find the number of actors in movies of more than one genre”
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
“Impossible” Queries• “Count = 1” means “all” … sometimes
SELECT A.id, A.nameFROM MOVIES M, CAST C, ACTOR AWHERE M.id = C.mid and C.aid = A.idGROUPBY A.id, A.nameHAVING count(distinct M.year) = 1
“Find actors whose movies are all in the same year”
G. Koutrika, A. Simitsis, Y. Ioannidis, ICDE’10 – Long Beach, CA, USA
“Impossible” Queries• “≤ all” means “earliest” … sometimes
SELECT A.nameFROM MOVIES M, CAST C, ACTOR AWHERE M.id = C.mid and C.aid = A.id and year <= all
(SELECT M1.year FROM MOVIES M1, MOVIES M2 WHERE M1.id != M2.id and M1.title = M.title and M2.title =
M.title)
“Find the actors who have played in the earliest versions of movies that have had sequels”