13
Formal Semantics of SQL (and Cypher) Paolo Guagliardo

Formal Semantics of SQL - Amazon S3 · 2019-10-18 · Syntax The syntax of core SQL is given in Figure 1. Below we describe abbreviations and naming conven-tions we use. Each letter

  • Upload
    others

  • View
    19

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Formal Semantics of SQL - Amazon S3 · 2019-10-18 · Syntax The syntax of core SQL is given in Figure 1. Below we describe abbreviations and naming conven-tions we use. Each letter

Formal Semantics of SQL (and Cypher)

Paolo Guagliardo

Page 2: Formal Semantics of SQL - Amazon S3 · 2019-10-18 · Syntax The syntax of core SQL is given in Figure 1. Below we describe abbreviations and naming conven-tions we use. Each letter

SQL• Standard query language for relational databases

• $30B/year business

• Implemented in all major RDBMSs (free and commercial)

• First standardized in 1986 (ANSI) and 1987 (ISO)

• Several revision afterwards (SQL-89, SQL-92, SQL:1999, SQL:2003, SQL:2006, SQL:2008, SQL:2011, SQL:2016)

“The nice thing about standards is that you have so many to choose from” — Andrew S. Tanenbaum

Page 3: Formal Semantics of SQL - Amazon S3 · 2019-10-18 · Syntax The syntax of core SQL is given in Figure 1. Below we describe abbreviations and naming conven-tions we use. Each letter

How standard is SQL?

SELECT * FROM R WHERE EXISTS ( SELECT * FROM ( SELECT R.A, R.A FROM R ) S )

Both PostgreSQL and Oracle output R

SELECT * FROM ( SELECT R.A, R.A FROM R ) S

PostgreSQL outputs a table with two columns named “A”

Oracle throws an ERROR: reference to column “A” is ambiguous

Page 4: Formal Semantics of SQL - Amazon S3 · 2019-10-18 · Syntax The syntax of core SQL is given in Figure 1. Below we describe abbreviations and naming conven-tions we use. Each letter

Who is right?

Let’s have a look at the standard!

A. If the <select list> * is simply contained in a <subquery> that is immediately contained in an <exists predicate>, then the <select list> is equivalent to a <value expression> that is an arbitrary <literal>.

B. Otherwise, the <select list> * is equivalent to a <value expression> sequence in which each <value expression> is a column reference that references a column of T and each column of T is referenced exactly once. The columns are referenced in the ascending sequence of their ordinal position within T.

Page 5: Formal Semantics of SQL - Amazon S3 · 2019-10-18 · Syntax The syntax of core SQL is given in Figure 1. Below we describe abbreviations and naming conven-tions we use. Each letter

… which means

SELECT * FROM RWHERE EXISTS ( SELECT * FROM ( SELECT R.A, R.A FROM R ) S)

SELECT *FROM ( SELECT R.A, R.A FROM R ) S

SELECT S.A, S.AFROM ( SELECT R.A, R.A FROM R ) S

SELECT R.A FROM RWHERE EXISTS ( SELECT 1 FROM ( SELECT R.A, R.A FROM R ) S)

Page 6: Formal Semantics of SQL - Amazon S3 · 2019-10-18 · Syntax The syntax of core SQL is given in Figure 1. Below we describe abbreviations and naming conven-tions we use. Each letter

The Need for a Formal Semantics• Avoid ambiguity of natural language

• Clearly defined and not subject to interpretation

• Easy to understand and implement

Previous attempts• Many simplifying assumptions: no bags, no nulls

• No justification of correctness

Page 7: Formal Semantics of SQL - Amazon S3 · 2019-10-18 · Syntax The syntax of core SQL is given in Figure 1. Below we describe abbreviations and naming conven-tions we use. Each letter

SELECT R.A FROM RWHERE R.A NOT IN ( SELECT S.A FROM S)

SELECT R.A FROM REXCEPTSELECT S.A FROM S

SELECT R.A FROM RWHERE NOT EXISTS ( SELECT S.A FROM S WHERE S.A=R.A )

A

1

NULL

A

NULLR S

A

1

A

1

NULL

A

Answer

Page 8: Formal Semantics of SQL - Amazon S3 · 2019-10-18 · Syntax The syntax of core SQL is given in Figure 1. Below we describe abbreviations and naming conven-tions we use. Each letter

Core SQL fragment

:= (T1, . . . , Tk

), �

:= (N1, . . . , Nk

), k > 0

:= (A1, . . . , Am

), �

0 := (N 01, . . . , N

0m

), m > 0

Queries:

Q

:= SELECT [DISTINCT] (↵:�0 | *) FROM ⌧ :� WHERE ✓

| Q (UNION | INTERSECT | EXCEPT) [ ALL ] Q

Conditions:

:= TRUE | t (= | 6=) t | t IS [ NOT ] NULL

| t [ NOT ] IN Q | EXISTS Q

| ✓ AND ✓ | ✓ OR ✓ | NOT ✓

Figure 1: Syntax of core SQL

tribute), or a pair of names (e.g., R1.A), and values areeither constants in C or NULL.The field name f

i

and the field value v

i

of the i-thelement (field) of r are denoted by `

i

(r) and ⌫

i

(r). Thetuple of field names in a record r is denoted by `(r),and the tuple of field values by ⌫(r).Two records r and s are uniform if they have precisely

the same number of fields and `(r) = `(s). A table T isa bag of uniform records; we write `(T ) for the tuple offield names across all of its records.

Syntax The syntax of core SQL is given in Figure 1.Below we describe abbreviations and naming conven-tions we use. Each letter may appear with subscripts.

• N ranges over names in N.• A over elements of N2 (think of then as full names ofattributes, e.g., S.B, and in SQL style, we separatethese names with a dot).

• ↵ ranges over tuples of elements of N2 (full names).• � ranges over tuples of elements of N (names).• R ranges over relation names.• c ranges over elements of C.• T ranges over relation names or queries.• ⌧ ranges over tuples (T1, . . . , Tk

).

When ↵ = (A1, . . . , Ak

), � = (N1, . . . , Nk

), and ⌧ =(T1, . . . , Tk

), we use abbreviations

↵ : � for A1 AS N1, . . . , A

k

AS N

k

⌧ : � for T1 AS N1, . . . , T

k

AS N

k

.

As explained earlier, these give explicit names to at-tributes in SELECT and relations/queries in FROM.A term t is either a constant c 2 C, a full name A 2

N

2, or NULL. We let t stand for tuples of terms.For queries SELECT ↵ : �0 FROM ⌧ : �, for each A =

N1.N2 2 ↵, the name N1 must occur in �. That is,attributes come from relations defined in FROM. Forinstance, in the fully annotated query shown earlier, weuse full names R.A and U.B, and of course R and U arenames given to the relation and the subquery in FROM.Features of SQL that we model include nulls, IN and

EXISTS subqueries, operations of union, intersection,and di↵erence (both set and bag versions), arbitraryBoolean combinations of conditions, duplicate elimina-tion, and subqueries in FROM. For now, we use equality

as the only comparison, but we shall later explain howarbitrary comparisons can be added for free. The WHEREclause is compulsory (for queries that do not have it, wecan simply attach WHERE TRUE).

2.2 SemanticsThe semantics of a query is given with respect to a

database D and an environment ⌘. An environment isa partial map from elements of N2 to values. Intuitively,⌘ provides the binding for each pair of relation/queryname and attribute name (e.g., S.B) on which it is de-fined. One starts with the empty environment, but asthe semantics of subqueries needs to be given, the en-vironment gets populated with bindings. We shall usethe notation JQK

D,⌘

for the semantics of Q with respectto D and ⌘; the semantics of an SQL query Q is thenJQK

D,?.To explain the semantics in Figure 2, we need a few

definitions related to names and bindings, and opera-tions on relations.

Names and bindings Given a table T and a tuple �

of names of the same length as the arity of T , by T : �we mean T in which field names are renamed to be �.The operationN.r prefixes the field names of a record

by N ; that is, the label of the i-th field of N.r isN.`

i

(r). This operation extends to tables. We alsodefine a special renaming �

�1: for a table where allfields names are pairs of the form N1.N2, applying �

�1

removes N1 and only keeps N2 as field name.Given two environments ⌘ and ⌘

0, by ⌘; ⌘0 we mean ⌘

overridden by ⌘

0. That is, ⌘; ⌘0(A) = ⌘(A) if ⌘ is definedon A and ⌘

0 is not; otherwise ⌘; ⌘0(A) = ⌘

0(A).Given a record r, we define an environment ⌘

r thatmaps each non-repeated field name of r to its value; ifa field name occurs more than once in r, then ⌘

r is notdefined on it.

Operations on tables The interpretation of relationname R in database D is denoted by R

D. We use thestandard operations of RA under their bag interpreta-tion [3, 11, 14]. The projection operation ⇡

can onlybe applied to tables T in which elements of ↵ do notrepeat (otherwise the meaning of such a projection isambiguous).A tuple t occurs m times in ⇡

(T ) i↵ for tuplest1, . . . , tk in T satisfying ⇡

(ti

) = t, we have m =m1 + · · ·+m

k

, where m

i

is the number of occurrencesof t

i

in T . We write " for the duplicate elimination op-eration that keeps one occurrence of each tuple in thebag. For union, intersection, and di↵erence we use theirstandard bag interpretations: if t occurs n times in T1

and m times in T2, then it occurs n+m times in T1[T2,min(n,m) times in T1 \T2, and max(n�m, 0) times inT1 � T2.When we write {t 2 T | ✓(t)} for a condition ✓, we

mean the bag in which tuple t has as many occurrencesas in T if ✓(t) is true, and zero occurrences otherwise.

Explanation of the semantics We now explain thekey elements of the semantics. Of course the semanticsof a relation is just the interpretation of that relation

Essentially SQL without arithmetic, grouping and aggregation

Page 9: Formal Semantics of SQL - Amazon S3 · 2019-10-18 · Syntax The syntax of core SQL is given in Figure 1. Below we describe abbreviations and naming conven-tions we use. Each letter

Formal Semantics: Challenges

Data model• Base relations / query outputs / intermediate results • Primitive data manipulation operations

Attribute references• Binding rules in subqueries • Environment collects and propagates bindings

Page 10: Formal Semantics of SQL - Amazon S3 · 2019-10-18 · Syntax The syntax of core SQL is given in Figure 1. Below we describe abbreviations and naming conven-tions we use. Each letter

Proposed SemanticsJRK

D,⌘

= R

D

J⌧ : �KD,⌘

= J(T1, . . . , Tk

) : (N1, . . . , Nk

)KD,⌘

= N1.JT1KD,⌘

⇥ · · ·⇥N

k

.JTk

KD,⌘s

FROM ⌧ : �WHERE ✓

{

D,⌘

=�a 2 J⌧ : �K

D,⌘

| J✓KD,⌘;⌘a = t

tSELECT ⇤FROM ⌧ : �WHERE ✓

|

D,⌘

=

sFROM ⌧ : �WHERE ✓

{

D,⌘

: ��1

tSELECT ↵ : �0

FROM ⌧ : �WHERE ✓

|

D,⌘

= ⇡

sFROM ⌧ : �

WHERE ✓

{

D,⌘

!: �0

tSELECT DISTINCT ↵ : �0 | ⇤

FROM ⌧ : �WHERE ✓

|

D,⌘

= "

0

@t

SELECT ↵ : �0 | ⇤FROM ⌧ : �WHERE ✓

|

D,⌘

1

A

JTRUEKD,⌘

= t

JtKD,⌘

=

⇢⌘(A) if t = A

t if t 2 C or t = NULL

Jt1 = t2KD,⌘

=

8<

:

t if Jt1KD,⌘

= Jt2KD,⌘

and Jt1KD,⌘

6= NULL and Jt2KD,⌘

6= NULLf if Jt1KD,⌘

6= Jt2KD,⌘

and Jt1KD,⌘

6= NULL and Jt2KD,⌘

6= NULLu if Jt1KD,⌘

= NULL or Jt2KD,⌘

= NULL

Jt IS NULLKD,⌘

=

⇢t if JtK

D,⌘

= NULLf if JtK

D,⌘

6= NULL

Jt IS NOT NULLKD,⌘

= ¬Jt IS NULLKD,⌘

J(t1, . . . tn) = (t01, . . . , t0n

)KD,⌘

=n^

i=1

Jti

= t

0i

KD,⌘

J(t1, . . . tn) 6= (t01, . . . , t0n

)KD,⌘

=n_

i=1

Jti

6= t

0i

KD,⌘

Jt IN QKD,⌘

=

8<

:

t if 9r 2 JQKD,⌘

: Jt = ⌫(r)KD,⌘

= t

f if 8r 2 JQKD,⌘

: Jt = ⌫(r)KD,⌘

= f

u if @r 2 JQKD,⌘

: Jt = ⌫(r)KD,⌘

= t and 9r 2 JQKD,⌘

: Jt = ⌫(r)KD,⌘

6= f

Jt NOT IN QKD,⌘

= ¬Jt IN QKD,⌘

JEXISTS QKD,⌘

=

⇢t if JQK

D,⌘

6= ?f if JQK

D,⌘

= ?

J✓1 AND ✓2KD,⌘

= J✓1KD,⌘

^ J✓2KD,⌘

J✓1 OR ✓2KD,⌘

= J✓1KD,⌘

_ J✓2KD,⌘

JNOT ✓KD,⌘

= ¬J✓KD,⌘

JQ1 UNION ALL Q2KD,⌘

= JQ1KD,⌘

[ JQ2KD,⌘

: `(JQ1K)JQ1 INTERSECT ALL Q2KD,⌘

= JQ1KD,⌘

\ JQ2KD,⌘

: `(JQ1K)JQ1 EXCEPT ALL Q2KD,⌘

= JQ1KD,⌘

� JQ2KD,⌘

: `(JQ1K)JQ1 ? Q2KD,⌘

= "

�JQ1 ? ALL Q2KD,⌘

�, ? 2 {UNION, INTERSECT}

JQ1 EXCEPT Q2KD,⌘

= "(JQ1KD,⌘

)� JQ2KD,⌘

: `(JQ1K)

Figure 2: Semantics of core SQL

• Fits in one page • Non-ambiguous • Easy to understand • Easy to implement • Easy to modify

Page 11: Formal Semantics of SQL - Amazon S3 · 2019-10-18 · Syntax The syntax of core SQL is given in Figure 1. Below we describe abbreviations and naming conven-tions we use. Each letter

Formal Semantics: Validation

• Cannot prove that semantics is correct

• Provide sufficient experimental evidence

• Implemented in Python

• Validated on 100000+ random SQL queries

Page 12: Formal Semantics of SQL - Amazon S3 · 2019-10-18 · Syntax The syntax of core SQL is given in Figure 1. Below we describe abbreviations and naming conven-tions we use. Each letter

Formal Semantics of Cypher

• Collaboration between Neo Technologyand the University of Edinburgh

• Preliminary meeting in December

• Legal agreements finalized recently

• Neo Technology sponsors a researcher (Nadime Francis)

Page 13: Formal Semantics of SQL - Amazon S3 · 2019-10-18 · Syntax The syntax of core SQL is given in Figure 1. Below we describe abbreviations and naming conven-tions we use. Each letter

Challenges

• Getting the (abstract) data model right

• Intermediate representation (QUIL?)

• Identify core fragment

• Language constantly evolving

• Follow the footsteps of SQL? (nulls)