Master’s thesis A mechanism to evaluate context-free ... · of RDF statements from numerous sources, covering all sorts of topics. To ﬁnd speciﬁc information in this data, queries

Universidade Federal do Rio Grande do Norte

Centro de Ciências Exatas e da Terra

Departamento de Informática e Matemática Aplicada

Programa de Pós-Graduação em Sistemas e Computação

Master’s thesisA mechanism to evaluate context-free queriesinspired in LR(1) parsers over graph databases

Fred de Castro Santos

Natal-RN

March 22, 2018

Fred de Castro Santos

A mechanism to evaluate context-free queriesinspired in LR(1) parsers over graph databases

Master’s thesis presented at the Gradu-ate Program in Systems and Computation(PPgSC) of the Federal University of RioGrande do Norte (UFRN), under the su-pervision of Professor Umberto S. da Costaand the co-supervision of Professor Martin A.Musicante, as a requirement for obtaining aMaster’s Degree in Systems and Computing.

Natal-RN

March 22, 2018

Santos, Fred de Castro. A mechanism to evaluate context-free queries inspired inLR(1) parsers over graph databases / Fred de Castro Santos. -2018. 84f.: il.

Dissertação (mestrado) - Universidade Federal do Rio Grandedo Norte, Centro de Ciências Exatas e da Terra, Programa de Pós-Graduação em Sistemas e Computação. Natal, RN, 2018. Orientador: Umberto Souza da Costa. Coorientador: Martin Alejandro Musicante.

1. Computação - Dissertação. 2. Bancos de dados em grafo -Dissertação. 3. Expressividade de linguagens de consulta -Dissertação. 4. RDF - Dissertação. 5. Linguagens LR(1) -Dissertação. I. Costa, Umberto Souza da. II. Musicante, MartinAlejandro. III. Título.

RN/UF/CCET CDU 004

Universidade Federal do Rio Grande do Norte - UFRNSistema de Bibliotecas - SISBI

Catalogação de Publicação na Fonte. UFRN - Biblioteca Setorial Prof. Ronaldo Xavier de Arruda - CCET

Elaborado por JOSENEIDE FERREIRA DANTAS - CRB-15/324

Abstract

The World Wide Web is an always increasing collection of information. This informationis spread among different documents, which are made available by using the HTTP.Even though this information is accessible to users in the form of news articles, audiobroadcasts, images and videos, software agents often cannot classify it. The lack ofsemantic information about these documents in a machine-readable format usually makesthe analysis inaccurate. A significant number of entities have adopted Linked Data as away to add semantic information to their data, not just publishing it on the Web. Theresult is a global data collection, called the Web of Data, which forms a global graph,consisting of RDF [22] statements from numerous sources, covering all sorts of topics. Tofind specific information in this graph, queries are performed starting at a subject andanalyzing their predicates in the RDF statements. These predicates are the connectionsbetween the subject and object, and a set of traces forms an information path.

The use of HTTP as a standardized data access mechanism and RDF as a standarddata model simplifies the data access, but accessing heterogeneous data on distinct loca-tions may have an increased time complexity and current query languages have a reducedquery expressiveness, which motivates us to research alternatives in how this data isqueried. This reduced expressiveness happens because most query languages belong tothe class of Regular Languages. The main goal of this work is to use LR(1) context-freegrammar processing techniques to search for context-free paths over RDF graph databases,providing, as result, a tool which allows better expressiveness, efficiency and scalabilityin such queries than what is proposed today. To achieve that, we implemented an algo-rithm based on the LR(1) parsing technique that uses the GSS [30] structure instead of astack, and give means for the user to input queries with an LR(1) context-free grammar.Also, we analyze our algorithm’s complexity and make some experiments, comparing oursolution to other proposals present in the literature and show that ours can have betterperformance in given scenarios.

Keywords: Graph databases; Query language expressiveness; RDF; LR(1) languages.

i

ii

Resumo

A World Wide Web é uma coleção de informações sempre crescente. Esta informação édistribuída entre documentos diferentes, disponibilizados através do HTTP. Mesmo queessa informação seja acessível aos usuários na forma de artigos de notícias, transmissõesde áudio, imagens e vídeos, os agentes de software geralmente não podem classificá-la.A falta de informações semânticas sobre esses documentos em um formato legível pormáquina geralmente faz com que a análise seja imprecisa. Um número significativo deentidades adotaram Linked Data como uma forma de adicionar informações semânticasaos seus dados, e não apenas publicá-lo na Web. O resultado é uma coleção global dedados, chamada Web of Data, que forma um grafo global, composto por declarações noformato RDF [22] de diversas fontes, cobrindo todos os tipos de tópicos. Para encontrarinformações específicas nesses grafos, as consultas são realizadas começando em um sujeitoe analisando seus predicados nas instruções RDF. Esses predicados são as conexões entreo sujeito e o objeto, e um conjunto de trilhas forma um caminho de informação.

O uso de HTTP como mecanismo padrão de acesso a dados e RDF como modelo dedados padrão simplifica o acesso a dados, o que nos motiva a pesquisar alternativas naforma como esses dados são buscados. Uma vez que a maioria das linguagens de consultade banco de dados de grafo estão na classe de Linguagens Regulares, nós propomos seguirum caminho diferente e tentar usar uma classe de gramática menos restritiva, chamadaGramática Livre de Contexto Determinística, para aumentar a expressividade das con-sultas no banco de dados em grafo. Mais especificamente, aplicando o método de análiseLR(1) para encontrar caminhos em um banco de dados de grafo RDF. O principal objetivodeste trabalho é prover meios para se permitir a utilização de técnicas de reconhecimentode gramáticas livres de contexto LR(1) para fazer consultas por caminhos formados pelasetiquetas das arestas em um banco de dados RDF. Fornecendo, como um resultado, umaferramenta que se permita atingir melhor expressividade, eficiência e escalabilidade nestasconsultas do que o que existe atualmente.

Para atingir este objetivo, nós implementamos um algoritmo baseado nas técnicasde reconhecimento LR(1), usando o GSS [30] ao invés de uma pilha, e permitimos aousuário fazer consultas com uma gramática livre de contexto (LR1). Também analisamosa complexidade do nosso algoritmo e executamos alguns experimentos, comparando nossasolução com as outras propostas na literatura, mostrando que a nossa pode ter melhordesempenho em alguns cenários.

Palavras-Chave: Bancos de Dados em Grafo; Expressividade de linguagens de consulta;RDF; Linguagens LR(1).

iii

iv

Contents

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Theoretical Foundation 52.1 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.1 RDF and RDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.2 Graph queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.3 SPARQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 Language Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.1 Regular Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.2 Context-Free Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.3 LR Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Related Work 233.1 nSPARQL: A navigational language for RDF . . . . . . . . . . . . . . . . . . 233.2 Conjunctive Context-Free Path Queries . . . . . . . . . . . . . . . . . . . . . 273.3 Context-Free Path Queries on RDF Graphs . . . . . . . . . . . . . . . . . . . 293.4 Context-Free Path Querying with Structural Representation of Result . . . . 313.5 Top-Down Evaluation of Context-Free Path Queries in Graphs . . . . . . . . 313.6 Tomita-Style Generalized LR Parsers . . . . . . . . . . . . . . . . . . . . . . . 32

4 The GrLR Query Processing Algorithm Approach 374.1 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.1.1 Algorithm execution example . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.2.1 Runtime complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.2.2 Space complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.3 Discussion about correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5 Experiments 535.1 Ontologies stored as RDF graphs . . . . . . . . . . . . . . . . . . . . . . . . . 535.2 Binary trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.3 String graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6 Conclusions 61

v

vi

List of Figures

2.1 Visual representation of an RDF graph. . . . . . . . . . . . . . . . . . . . . . . 82.2 RDF database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 Path found via path query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4 SPARQL query to find the books written by a:Author1 . . . . . . . . . . . . 102.5 Graph with opening and closing parenthesis as labels for the edges. . . . . . . 132.6 Visual representation of a Context-Free Grammar that defines strings that

start with n opening parenthesis and end with n closing parenthesis, with n > 0. 142.7 Visual representation of the derivation trees. . . . . . . . . . . . . . . . . . . . 152.8 Hierarchy of Context-Free Grammar classes [3] . . . . . . . . . . . . . . . . . . 162.9 Context-Free Grammar as described in Figure 2.6, extended with the start

symbol S ′ and end symbol $. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.10 Visual representation of the LR automaton generated by the extended gram-

mar in Figure 2.9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1 RDF graph containing information about available transport services betweencities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Forward and backward axes for an RDF triple (a, p, b) [19]. . . . . . . . . . . 253.3 Path connecting a1 to a6 [19]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.4 Adding extra edges to the graph. . . . . . . . . . . . . . . . . . . . . . . . . . 293.5 Context-Free Grammar extended with nSPARQL regular expressions . . . . . 303.6 Comparison between the languages. . . . . . . . . . . . . . . . . . . . . . . . . 313.7 Representation of a shift transition in a GSS. . . . . . . . . . . . . . . . . . . 333.8 Representation of a reduce transition in a GSS. . . . . . . . . . . . . . . . . . 333.9 Representation of a reduce transition for a ε-transition in a GSS. . . . . . . . 333.10 GSS generated for parsing the input string ( ( ) ) in the grammar in Figure 5. 34

4.1 Paths identified in a graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2 Execution of a GSS_Up function call. . . . . . . . . . . . . . . . . . . . . . . 384.3 Graph with a loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.4 Input data for the algorithm example. . . . . . . . . . . . . . . . . . . . . . . 434.5 Initialization of the GSS for the graph in Figure 4.4a. . . . . . . . . . . . . . . 434.6 Resulting GSS after processing level U0. . . . . . . . . . . . . . . . . . . . . . 444.7 Resulting GSS after processing level U1. . . . . . . . . . . . . . . . . . . . . . 454.8 Resulting GSS after processing level U2. . . . . . . . . . . . . . . . . . . . . . 464.9 Resulting GSS after processing level U4. . . . . . . . . . . . . . . . . . . . . . 474.10 Complete graph with three vertices. . . . . . . . . . . . . . . . . . . . . . . . . 47

5.1 Grammars for Queries Q1 (a) and Q2 (b). . . . . . . . . . . . . . . . . . . . . 535.2 Visualization of the results for the Query Q1 on RDF databases. . . . . . . . . 545.3 Visualization of the results for the Query Q2 on RDF databases. . . . . . . . . 55

vii

5.4 Top-down (a) and Bottom-up (b) tree patterns used in the experiment. . . . . 565.5 Grammars for queries Q3 (a) and Q4 (b). . . . . . . . . . . . . . . . . . . . . . 565.6 Visualization of the top-down binary tree experiment results. . . . . . . . . . . 575.7 Visualization of the bottom-up binary tree experiment results. . . . . . . . . . 585.8 String graph pattern used in the experiments. . . . . . . . . . . . . . . . . . . 595.9 Visualization of the strings experiment results. . . . . . . . . . . . . . . . . . . 59

viii

List of Tables

2.1 Comparison between RDBMS and NoSQL databases [16]. . . . . . . . . . . . . 72.2 LR(1) Parsing table for the extended Context-Free Grammar described in Fig-

ure 2.9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.3 Parsing of the input string ( ( ) ) according to the LR1 grammar in 2.6b. . . 21

3.1 RDFS inference rules [19]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.1 Performance evaluation for Query Q1 on RDF databases. . . . . . . . . . . . . 545.2 Performance evaluation for Query Q2 on RDF databases. . . . . . . . . . . . . 555.3 Execution time for the grammars Q3 and Q4 on top-down binary trees. . . . . . 575.4 Execution time for the grammars Q3 and Q4 on bottom-up binary trees. . . . . 58

ix

x

Glossary

CNF Chomsky Normal Form. 50

DBMS Database Management System. 5–7, 9

DDL Data Definition Language. 5

DML Data Manipulation Language. 5

GSS Graph Structured Stack. i, iii, vii, 32–35, 37–40, 42–50, 58, 59, 61, 62

HTTP Hypertext Transfer Protocol. i, iii, 1, 2

IoT Internet of Things. 6

IRI International Resource Identifier. 1

JIT Just-in-Time. 53

NoSQL Not only SQL. ix, 1, 6, 7

NRE Nested Regular Expression. 23, 26, 27, 30

RDBMS Relational Database Management Systems. ix, 6, 7

RDF Resource Description Framework. i, iii, vii, ix, 1–3, 6, 8–11, 24–26, 29, 31, 53–55

RDFS RDF Schema. ix, 8, 11, 23, 24

SQL Structured Query Language. 6, 7

URI Uniform Resource Identifiers. 6, 7, 9, 10

W3C World Wide Web Consortium. 2, 6, 23

xi

xii

1 Introduction

The World Wide Web is an always increasing collection of information. This information

is spread among different documents, which are made available by using the HTTP. These

documents are identified by their International Resource Identifiers (IRIs), and may be

connected to each other by hyperlinks. Even though this information is accessible to users

in the form of news articles, audio broadcasts, images and videos, software agents often

cannot classify it. These software agents need to analyze the contents of the documents

to identify their meanings, but, because of the lack of semantic information about these

documents in a machine-readable format, the result of this automated analysis is often

inaccurate. In order to make the semantic information on these documents available in

a machine-readable format, Linked Data was proposed as a complementary approach to

the World Wide Web [25].

Linked Data uses the same protocol as the World Wide Web to access and retrieve

information. In this context, IRI denote "things", called resources. These resources may

then be referenced by a standard mechanism, like the RDF, to make statements. In the

RDF, a set of statements forms a graph and a collection of graphs forms a dataset. This

organization schema allows the association between resources from distinct datasets. To

keep the systems available and scalable, NoSQL comprise an alternative to traditional re-

lational databases, capable of handling huge volumes of data leveraging on the capabilities

of cloud environments [16].

A significant number of entities have adopted Linked Data as a way to add semantic

information to their data, not just publishing it on the Web. The result is a global data

collection, called the Web of Data. The Web of Data forms a global graph, consisting

of RDF statements from numerous sources, covering all sorts of topics. To find specific

information in this data, queries are performed starting in a subject and analyzing its

predicates in the RDF statements. These predicates are the connections between the

subject and object, and a set of traces forms an information path. Given that a trace

1

is a set of predicates in an information path, one may tell there is a connection between

subject1 and object1 if there is a trace between them in the RDF statements [4].

The use of HTTP as a standardized data access mechanism and RDF as a standard

data model simplifies data access compared to Web APIs, which rely on heterogeneous

data models and access interfaces [11], but accessing heterogeneous data on distinct lo-

cations may have an increased time complexity and reduced query expressiveness, which

motivates us to research alternatives in how this data is queried.

1.1 Motivation

With the increase in size of the databases, new technologies were proposed to satisfy

the need of retrieving the stored data at acceptable speeds. These proposed technologies

range from changes in the database structure to the implementation of new query lan-

guages. Query languages vary in how expressive queries written therein are allowed to be

achieved. The World Wide Web Consortium (W3C) has proposed SPARQL [31], a query

language which allows the use of regular expressions to query the RDF database. Even

though SPARQL is the proposed standard language for querying graph databases, it has

limitations in expressiveness.

The query languages basically search for paths formed by the label of the edges between

nodes. The regular expressions have some known limitations, which, for example, do not

allow counting the parsed symbols [1].

Since most of the specified graph database query languages use Regular Expressions,

we propose to take a different path and aim on using Context-Free Grammars to increase

the expressiveness of graph database queries. More specifically, we propose to apply the

LR(1) parsing method to find paths in an RDF graph database.

1.2 Problems

This research aims to solve two main problems:

(i) Given a set of vertices from a graph and an LR(1) context-free grammar G, return

2

all the vertices in the graph which can be reached from them by following paths

formed by the edges where the sequence of symbols formed by their labels belongs

to the language of G;

Input: A graph DG ⊆ V × E × V , a set of vertices {v′|v′ ∈ V } and an LR(1)

context-free grammar G.

Output: A set of pairs of nodes {(v′, v)|v′, v ∈ V } where there exists a trace

between v′ and v whose edges form a string that belongs to the language formed by

G.

(ii) Given two nodes v′ and v from a graph and an LR(1) context-free grammar, identify

if there exists a path, using the same criteria as Problem (i), which respects the given

grammar between v and v′.

Input: A graph DG ⊆ V × E × V , two vertices {v′, v|v′, v ∈ V } and an LR(1)

context-free grammar G.

Output: True if there exists a trace between v′ and v whose edges’ labels form a

string that belongs to the language of G. False otherwise.

Problem (ii) is a sub-problem of Problem (i), since one can search for the vertices

which can be reached by a given vertex v then verify if the answer contains v′.

1.3 Goals

The main goal of this work is to provide means to enable the usage of LR(1) context-free

grammar processing techniques to search for paths formed by the labels of the edges in

an RDF graph database, providing, as result, a tool which allows better expressiveness,

efficiency and scalability in such queries than what is proposed today.

To achieve that, we implemented an algorithm based on the LR(1) parsing technique

and give means for the user to input queries with an LR(1) context-free grammar. Also,

we evaluated the algorithm’s runtime and space complexity, besides the expressiveness of

queries that it enables with the solutions proposed by the related works, in Chapter 3.

3

The remainder of this dissertation is organized as follow, in Chapter 2, we introduce

some of the concepts needed for better understanding the given problems and how to

solve them; In Chapter 3, we analyze some works related to our research; In Chapter 4,

we present a new solution, using LR(1) parsing concepts to query the graph database. In

Chapter 5, we conduct some experiments, comparing our solution to the related works,

giving our final conclusions and suggest a set of future work in Chapter 6.

4

2 Theoretical Foundation

In this chapter, we give insight on a few concepts the work in this dissertation is based

on. First, we introduce databases and graph databases, explaining what kind of query

expressiveness is possible to achieve with the current state-of-the-art solutions. Second,

we explore in details the graph theory to be able to show how to give graph databases a

navigation mechanism similar to the ones available to Context-Free Grammars. Lastly,

we introduce regular languages and Context-Free Grammars, comparing notations, query

expressiveness and their associated query mechanisms, as possible candidates to solve our

problem.

2.1 Databases

A database is a collection of organized, related data which is used as a source of infor-

mation to answer user queries or to facilitate other data processing activities. The basic

database management problem is how to store and organize data efficiently to meet the

data processing needs of the applications which use the data [28].

Since the mid 70s, the database field has experienced rapid growth and seen major

advances in applications, technology and research. One of the proposed technologies was

related to the way to structure the data called the Relational Model. It forms a sound

basis for treating derivability, redundancy and consistency of data by organizing it in a

set of relations [5].

To correctly manage the data, a Database Management System (DBMS) is needed.

By definition, a DBMS acts as an interface between the application program and the

data stored in the database and has five basic functions: (i) it is responsible to establish

the logical relationships among different data elements in a database and also define

schemas using the data definition language (DDL); (ii) enter data into the database;

(iii) manipulate and process the data stored in the database using the data manipulation

language (DML); (iv) maintain data integrity and security by allowing limited access of

5

the database to authorized users; and (v) query the stored data using structured query

language (SQL) [6].

By the year 2000, the increasing number in applications that involve cloud deployment,

mobile presence, social networking and the Internet of Things (IoT) started to demand

database technologies that include but are not limited to relational systems [10]. These

technologies weakened the relational DBMS patterns by sacrificing some of its principles

to ensure a gain in speed and scalability.

In a graph database, queries are localized to a portion of the graph, taking execution

time proportional only to the traversed portion of the graph for each query, rather than

the size of the overall dataset. And, since graphs are naturally additive, they do not

require the domain to be modeled in exhaustive detail up front. The application domain

can be designed incrementally as the business requires it. Governance can be applied in

a programmatic fashion, using tests to assert the business rules and maintain the data

model and queries.

Among these new technologies are Not only SQL (NoSQL), NewSQL, Big Data and

RDF. Although the three first are vaguely defined terms, they represent the most used

expressions for referring to next-generation database technologies [27]. We will present

RDF in more detail in Section 2.1.1. Table 2.1 shows a comparison between Relational

Database Management Systems (RDBMS) and NoSQL databases. The main difference

between them is that the relational databases require a fixed schema and enforce the

validity of the data while the NoSQL databases do not.

2.1.1 RDF and RDFS

The World Wide Web Consortium (W3C), an international community that develops open

standards to ensure the long-term growth of the Web, has defined a standard model for

data interchange on the web by extending the linking structure of the Web. This model

is called RDF [22] and it is the only standardized NoSQL solution. RDF uses uniform

resource identifiers (URI) to name entities’ relationships, which are represented by triples

composed by a subject, a predicate and an object. Each resource must have an unique

6

Feature RDBMS NoSQLData Validity Higher guarantees. Lower guarantees.Query Language Structured Query Language

(SQL).No declarative query language.

Data type Supports relational data and itsrelationships are stored inseparate tables.

Supports unstructured andunpredictable data.

Data Storage Stored in a relational model, withrows and columns. Rows containall of the information about onespecific entry/entity, and columnsare all the separate data points.

The term “NoSQL” encompassesa host of databases, eachwith different data storagemodels. The main ones are:document-based-store, graph-based,key-value-store, column-based-store.

Schemas andFlexibility

Each record conforms to fixedschema.

Schemas are dynamic. Each "row"does not have to contain datafor each "column".

DBMS Compliance The vast majority of relationaldatabases comply with all theDBMS functions.

Sacrifice some DBMS functions forperformance and scalability.

Table 2.1: Comparison between RDBMS and NoSQL databases [16].

URI. A collection of instances of this model will form a directed labeled graph with the

subject and object entities being the vertices. This graph has a semantics that facilitates

data merging even between distinct schemas, supporting the evolution of schemas without

requiring data consumers to be changed. With this model, it is possible to mix, expose

and share structured data across different applications [20].

Since we will be referencing directed graphs in the rest of this document, a short

introduction to graphs is required. A convenient way to define a graph D is as a set of

tuples (v, e, w) where v, w ∈ V are vertices of the graph, and e ∈ E are the edges of the

graph. a tuple (v, e, w) means that from vertex v to vertex w there is an edge labeled e

in the graph.

A graph has received this name because it can be easily represented graphically, which

helps understand many of its properties. Most of the definitions and concepts in graph

theory are suggested by the graphical representation. Two vertices which are incident

with an edge are adjacent ; and an edge with identical ends is a loop. A graph is finite if

both its vertex set and edge set are finite, simple if it has no loops and no two of its edges

join the same pair of vertices.

It is possible to follow a path in a graph departing from a vertex through one of the

edges connected to it and reach a destination vertex. If the path contains no cycles, it is

7

called a simple path.

A trace in D is a finite non-null sequence W = v0e0v1e1...ekvk, where its terms are

alternatively vertices and edges, starting in the first vertex and ending on the last vertex.

The integer k is the length of the trace. The vertices v0 and vk are called the origin and

terminus of W, respectively, and v1, v2, ..., vk−1 its internal vertices. In the context of this

research, we will be using the label of the edges e ∈ E to identify the connection between

two vertices.

Example 1 Figure 2.1 shows an RDF graph, containing information about authors and

books. Note that a:Author1, a:Author2, b:Book1 and b:Book2 are the labels of the graph

vertices. The arrows pointing from the authors to the books represent the directed edges.

library:wrote is the label of the edges. In this context, by looking at the graph, one can

tell a:Author1 has written both books, while a:Author2 has written b:Book1.

a:author2

a:author1

b:Book1

b:Book2

library:wrote

library:w

rote

library:wrote

Figure 2.1: Visual representation of an RDF graph.

The definition of RDF itself does not assure the data consistency within a graph. For

this, a semantic extension is required [22].

In order to extend the semantics and assure the consistency of the RDF graphs, RDFS

was proposed. This schema provides a vocabulary and mechanisms for describing groups

of resources which are related, for modeling the data within an RDF graph [20].

Following the architectural principles of the Web, the RDF Schema uses a property-

centric approach, which allows the description of existing resources to be extended. Its

class and property system is similar to the type systems of object-oriented programming

languages. The difference is that the properties are described in terms of the classes of

resources to which they apply, instead of defining the classes by their properties [24].

8

Example 2 Figure 2.2 lists the text required to store library data in an RDF database

containing information on which author wrote which books. The definition of Library,

Author and Book are contained in their respective URIs. Like in Example 1, in this case

is also possible to see a:Author1 wrote b:Book1 and b:Book2, while a:Author2 wrote

b:Book1.

@prefix library <http://example.org/library/1.0/> .@prefix b <http://example.org/library/1.0/book/> .@prefix a <http://example.org/library/1.0/author/> .a:Author1 library:wrote b:Book1 .a:Author1 library:wrote b:Book2 .a:Author2 library:wrote b:Book1 .

Figure 2.2: RDF database.

At first, the RDF data was intended to be stored in textual form and various storage

methods were proposed, such as Turtle [35], TriG [34] and JSON-LD [14], but, due to

the nature of the web, the amount of data to be stored in such way may cause loss of

performance when querying. To work with this problem, a DBMS is more suited in this

case [23].

2.1.2 Graph queries

Given that the data being used in this work will be stored in a graph database, it is

needed to discuss about the types of queries that can be executed over it.

Subgraph Isomorphism This type of query is a decision problem and is usually

NP-complete. Given two graphs D and D′, identify if D contains a subgraph that is

isomorphic to D′, meaning that there is a subgraph in D that contains the same nodes

and edges as D′. In some cases it may be processed in polynomial time in terms of the

number of vertices in the graph.

Frequent Subgraph Mining A subgraph may be considered interesting if it appears

multiple times in a graph. Given a threshold σ, this type of query consists in finding the

subgraphs that appear at least σ times in a given graph D.

Path Query The path queries consist in, given a graph and a path, return the pairs of

nodes from the graph that are connected by edges that form the given path. The dashed

9

path in Figure 2.3 represents a path found between the nodes a and b in the graph, ac-

cording to the path query ( ).

a b

c

d e(

)( (

Figure 2.3: Path found via path query.

In Chapter 4, we show that, similar to related approaches, our algorithm uses path

query to search for nodes connected by paths where the label of the edges form a valid

string in the given Context-Free Grammar.

2.1.3 SPARQL

Along with the RDF specification, a query language called SPARQL was defined [31],

whether the data is stored natively as RDF or viewed as RDF via a middleware. A

query in this language contains a set of triple patterns called basic graph pattern, which

will match a subgraph of the RDF data when the terms from that subgraph may be

substituted for the pattern variables [31].

Example 3 The RDF graph in Figure 2.1 contains the library data. In this case, to find

the books for a given author, a SPARQL query can be as follows.

PREFIX library <http://example.org/library/1.0/>PREFIX b <http://example.org/library/1.0/book/>PREFIX a <http://example.org/library/1.0/author/>SELECT ?bookWHERE{a:Author1 library:wrote ?book}

Figure 2.4: SPARQL query to find the books written by a:Author1

The query in Figure 2.4 starts by specifying the URI prefixes so the words used may be

reduced. SELECT ?book means that we need to retrieve the registers associated with the

?book part of the tuple being searched for in the WHERE statement. We use the tuple

10

{a:Author1 library:wrote ?book}, which means that we are looking for all RDF regis-

ters in the database where the subject is a:Author1 and the predicate is library:wrote.

In this case, ?book is object and we do not know the value of it.

It has been noted that, although RDF is a directed labeled graph data format,

SPARQL only provides limited navigational functionalities. This is more notorious when

one considers the RDFS vocabulary, which current SPARQL specification does not cover,

where testing conditions like being a subclass of or a sub-property of requires navigating

the RDF data [19]. As a solution to this, nSPARQL was proposed [19] and we will talk

more about it and other related works in Chapter 3.

These limited navigational functionalities lead us to investigate how the languages are

built and how they may help us to provide mechanisms to make more expressive queries.

2.2 Language Specifications

A formal language is a possibly infinite set of sentences which can be formed by a grammar.

These sentences consist of a finite number of characters belonging to the alphabet of the

grammar. A formal grammar consists of a finite set of symbols and some production

rules, and may be used to generate sentences. A formal grammar may be defined by a

quadruple G = (N, T, S, P ), where:

• N is a finite set of nonterminal symbols, representing the elements which may be

replaced by applying a production rule. The nonterminal symbols are usually rep-

resented by one upper case letter;

• T is a finite set of terminal symbols, representing the alphabet of the grammar.

The terminal symbols are the characters that appear in the sentences generated by

the grammar;

• S is the start symbol, and consists of one nonterminal from which all sentences

composing the language can be generated; and

• P is a finite set of production rules.

11

A production rule P has the form α → β, with α and β being sequences of symbols

and β can be empty (ε). For each existing nonterminal in N, there must be at least

one production rule [15]. The nonterminals may also have as many production rules as

needed. In this context, α is called the Left-hand Side (LHS) of the production rule and

β is the Right-hand Side (RHS) of the production rule.

There is an hierarchy between the different classes of languages, known as Chomsky

Hierarchy, in which each grammar is classified by the form of their productions. Each

category represents a class of languages that can be recognized by a different automaton

[32].

The classification begins with the less restrictive grammar class and adds restrictions

for the inner classes. The most general one, called Recursively enumerable, allow any

production rule in the form α → β to be recognized, meaning that any sequence of

symbols can be parsed into any other. Languages represented by grammars in this class

are the ones recognizable by Turing Machines and, together with the Context-Sensitive

Grammars, are too generic for the scope of this work. For now, we will focus on the two

most relevant classes for programming languages, Regular Languages and Context-Free

Languages.

2.2.1 Regular Languages

The grammar class that can be accepted by a finite automaton is the one called Regular

Grammars. Production rules P for the Regular Grammars must either be left or right

linear, meaning that in the production rule A → α, α can only have a nonterminal in

the beginning or in the end, and each nonterminal can have exactly one production rule

[21]. The language that can be formed by some regular grammar is known as a regular

language, and can be used to define lexical structures in a declarative way [3].

Given an alphabet T to be considered as an input set, the regular expression over

T may be defined as: (i) φ is a regular expression which denotes the empty set; (ii) ε

is a regular expression and denotes the set {ε} and it is an empty string; (iii) For each

a in T , a is a regular expression and denotes the set {a}; (iv) if r and s are regular

12

expressions denoting the languages L1 and L2 respectively, then rs is equivalent to L1L2,

i.e. concatenation; (v) r∗ is equivalent to L∗1, i.e. closure, which indicates occurrence of r

for zero or more times; (vi) r+ is equivalent to the positive closure, which requires at least

one occurrence of r; (vii) Regular expressions can contain the choice symbol |, which,

when used between expressions as in L1|L2, allows L1 or L2 to be matched.

Example 4 Consider that a path is formed by the labels of the edges connecting vertices.

A regular expression to match all strings that start with a number of opening parenthesis,

followed by a number of closing parenthesis can be described by means of the regular

expression (+)+. If this regular expression is used to query the graph in Figure 2.5 for

paths starting in the vertex a, the answer would be the vertices c, e and f.

a b

c

d e f

()

() )

Figure 2.5: Graph with opening and closing parenthesis as labels for the edges.

Every sentence that may be described by a regular expression may be described by a

Context-Free Grammar but not the other way around. As seen on Example 4, even though

Regular Expressions are powerful enough to provide a pattern matching mechanism, they

will not allow us to use a pattern like (n)n to match sentences with the same amount of

opening and closing parenthesis [1]. To achieve this level of expressiveness, we have to

investigate the next, less restrictive, language class, called Context-Free Languages.

2.2.2 Context-Free Grammars

Context-Free Grammar is a class of grammars that have production rules in the form

A → β, where A ∈ N and β is a set composed of elements in T ∪ N or the ε symbol,

meaning it is an empty rule. The Context-Free Grammars define syntactic structure

declaratively. They may also be used to describe the structure of lexical tokens, although

Regular Grammars are more adequate, and more concise, for that purpose. The Context-

Free Grammars may be recognized by stack automata [3].

13

S → (P )P → (P )P → ε

(a) Extended grammar

S → (P )P → ε|(P )

(b) Compact grammar

Figure 2.6: Visual representation of a Context-Free Grammar that defines strings thatstart with n opening parenthesis and end with n closing parenthesis, with n > 0.

The grammar represented in Figure 2.6 indicates that the symbols S, P ∈ N are sym-

bols that must be derived. We will discuss more about derivation and parsing techniques

later on this section and on Section 2.2.3.

Example 5 A grammar to define all strings formed by a number of opening parenthesis

followed by the same number of closing parenthesis, (), (()), ((())), etc, may be defined as

seen in Figure 2.6a. In this case, ( and ) are the terminal symbols, and S and P are the

nonterminal symbols. If used to recognize the paths in the graph of Figure 2.5, starting in

vertex a, the answer would be the vertices c and f. Note that, to make it easier to write

the grammar, the | symbol can be used to join multiple definitions for a nonterminal. In

this example, P → (P ) and P → ε could be written as P → ε | (P), as shown on Figure

2.6b.

A technique called parsing may be used to reveal the grammatical structure of a se-

quence of symbols and identify if it belongs to a Context-Free Grammar, by consecutively

matching the nonterminals with their production rules and building a structure called a

Derivation Tree. The Derivation Tree is formed by having each symbol of the grammar as

one vertex, being the starting symbol at the top, or root, nonterminal symbols in the inner

vertices and only terminals in the leaves, or outer vertices. This process will derive strings

beginning with the start symbol and repeatedly replacing a nonterminal by the RHS of

a production for that nonterminal until reaching the end of the sequence of symbols [3].

This Derivation Tree, then, may be navigated from the leaves to the root or from the root

to the leaves, according to the chosen parsing method.

Example 6 Consider the graph in Figure 2.5 and the grammar in Figure 2.6. Figure

2.7a describes how the derivation tree would be if we would like to identify if the path (

14

S

P( )

ε

(a) Path ( )

S

P( )

P( )

ε

(b) Path ( ( ) )

Figure 2.7: Visual representation of the derivation trees.

) between the vertices a and c belong to the grammar, while Figure 2.7b describes the

derivation tree for the path ( ( ) ) between the vertices a and f.

There are several methods for identifying if a given string belongs to a language defined

by a Context-Free Grammar. These methods are called parsing, and differ from each other

by how they build and traverse the derivation tree. The Top-down approach starts at the

root vertex and move on to the leaves, investigating which production rule should be used

when each new symbol is read, while the Bottom-up starts at the leaves and move on

to the root of the tree, identifying which production rule should be applied for the given

input string, replacing the RHS with the LHS of the production.

The parsers defined for the Context-Free Grammar make use of a stack to help deciding

which rule should be applied when processing the current input symbol. Some grammars

may be written in way that a conflict is introduced, leading the parser to reach a state

where it is possible to apply more than one rule in the parsing.

Figure 2.8 defines the hierarchy between the ambiguous and unambiguous Context-

Free Grammars. The unambiguous grammars are divided in LL(k) and LR(k). The first

"L" stands for left-to-right scanning of the input, the second "L" in LL, for constructing

a leftmost derivation in reverse, the "R" in LR, for constructing a rightmost derivation

in reverse, and the k stands for lookahead, represents the symbols at the front in the

input stream that aid analysis decisions [3]. Usually, programming languages are parsed

by using k = 1. When k is omitted, k is assumed to be 1, which is of practical interest

15

and will be used in this dissertation.

LL grammars are those which do not have recursion to the left, meaning that in a

production rule A → β, β cannot start with the A symbol. Also, if multiple definitions

of rules for A are given, they cannot start with the same symbol.

The class of grammars that may be parsed using LR methods is a superset of the class

of grammars that may be parsed with predictive or LL methods. To parse an LL grammar,

the parser must be able to recognize the appropriate production rule to be applied to a

non-terminal seeing only the first k symbols that its right-hand side derives. Less rigorous

than that, for LL(k) grammars, the parser only needs to recognize the occurrence of the

right side of a production rule in a rightmost sentential form, with k input symbols of

lookahead, allowing this method to describe more languages than the former method.

Like the non-recursive LL parsers, the LR parsers are table-driven. A grammar is

considered to be LR if a left-to-right shift-reduce parser is able to recognize handles of

right-sentential forms when they appear on top of the stack. Syntactic errors are detected

while a simple left-to-right scan of the input is being made. We will continue investigating

the LR parsing method in Section 2.2.3

Figure 2.8: Hierarchy of Context-Free Grammar classes [3]

16

2.2.3 LR Parsing

Among the existing bottom-up parsers, the LR(k) is the most prevalent concept. If a

Context-Free Grammar can be written for a programming language, it will probably be

recognizable by an LR parser. For guidance on what steps to take, the LR parser makes

use of a Parsing Table (defined below) and a stack of states. By doing so, it may detect

syntactic errors as soon as it is possible to do on a left-to-right scan of the input string.

S ′ → S$S → (P )P → ε|(P )

Figure 2.9: Context-Free Grammar as described in Figure 2.6, extended with the startsymbol S ′ and end symbol $.

To be parsed by an LR Parser, the language must be extended with the start symbol,

called here S ′, and the end symbol $. Given the grammar in Figure 2.6b, it must be

improved with the production rule S ′ → S$, which means that the start symbol S ′

generates all the sentences as before and finish with the end symbol $. The resulting

grammar may be seen in Figure 2.9.

S ′ → ·S, $S → ·(P ), $

I0

S ′ → S·, $I1

S → (·P ), $P → ·, )P → ·(P ), )

I2

S → (P ·), $I3

P → (·P ), )P → ·, )P → ·(P ), )

I4

S → (P )·, $I5

P → (P ·), )I6

P → (P )·, )I7

S

(

P )

(

P )(

Figure 2.10: Visual representation of the LR automaton generated by the extended gram-mar in Figure 2.9.

The next step is to construct the LR automaton, which is used to direct decisions

during analysis. Figure 2.10 shows the LR automaton for the grammar given in Figure

2.9. In the automaton, each state represents a set of items. An LR(0) item of a grammar

D is a production of D with a dot (·) at some position of the production RHS. As the

17

automaton progresses in its states, the productions are represented as if the dot is moving

to the right, meaning the elements at its left indicate how much of a production has

already been seen by the parser. In this way, production A→ XY Z yields the four items

A→ ·XY Z, A→ X·Y Z, A→ XY ·Z and A→ XY Z·. The production A→ ε generates

only one item, A → · and A → XY Z· indicates that the parser has read XY Z and it

may be time to reduce XY Z to A.

Since we target on using the LR(1) method, we will incorporate the lookahead to the

items set, by redefining the items to include one terminal symbol as a second component.

The general form of an item becomes [A→ α · β, a], where A→ αβ is a production rule

and a is a terminal or the right end symbol $.

Algorithm 1: Construction of LR(1) items sets for an LR(1) grammar.Function ITEMS(G’ : D) : SetOfItems

C ← CLOSURE({[S ′ → ·S, $]});repeat

foreach I ∈ C doforeach grammar symbol X do

C ← C ∪ GOTO(I,X);

until no new sets of items are added to C ;return C

Algorithm 1 shows the function ITEMS(G′), which receives the augmented grammar

and prepares the set of states for the automaton for the given LR(1) grammar. Being I a

set of LR(1) items for a grammar G, then the CLOSURE(I) function, shown in Algorithm

2, is the set of items constructed from I by adding every item in I and if A → α·Bβ is

already added and B → γ is a production, then add B → ·γ, if not yet added. Repeat

until no more new items can be added to CLOSURE(I).

The GOTO(I, X) function, shown in Algorithm 3, is used to define the transitions in

the LR(1) automaton for a grammar. The automaton states correspond to sets of items

and the function specifies the transition from the state for I when given the input X.

The LR parsing algorithm will try to identify one of the three possible actions to take

given a new input symbol. The action can either be shift, when the symbol does not

yet complete the RHS of a production rule, reduce, when the symbol completes the RHS

18

Algorithm 2: Identifying the closure of a set of items for an LR(1) grammar.Function CLOSURE(I : SetOfItems) : SetOfItems

repeatforeach [A→ α·Bβ, a] ∈ I do

foreach B → ·γ ∈ D′ doforeach b ∈ FIRST(βa) do

I ← I ∪ [B → ·γ, b];

until no more items are added to I ;return I

of a production rule or accept, when we reach the end of the sentence and it belongs to

the language formed by the grammar.

Algorithm 3: Identifying the destination state for a given non-terminal for a set ofitems for an LR(1) grammar.Function GOTO(I : SetOfItems, X : N ): SetOfItems

J ← ∅;foreach [A→ α·Xβ, a] ∈ I do

J ← J ∪ [A→ αX·β, a]return I

To identify which action to take, the parser uses a stack where it stores the state where

the automaton currently is. For the new symbol, it will look into the parsing table, getting

the value stored for the line representing the current state and column representing the

new symbol.

The parsing table for the grammar in Figure 2.9 can be seen in Table 2.2. The shift

actions are represented by si, where i is the number of the new state to shift to. The

reduce actions are represented by ri, where i is the number of the production rule to

reduce. If the action is shift, the parser stores this new state in the stack and continues

to the next symbol. If the action is reduce, the parser will pop a number of elements

from the stack equal to the number of elements in the RHS of the production rule being

reduced, and add the LHS of the rule to the stack, and continue to the next symbol. If

the action is accept, then it means that the sentence has reached its end and should be

considered a valid sentence in the language. If the parser does not find a valid action for

the current symbol, it means that the sentence is wrong and should be rejected.

19

Example 7 Given the augmented LR(1) grammar represented in Figure 2.9 and the

sequence of input symbols ( ( ) ) to be parsed by the LR(1) algorithm, the parser

starts the stack with the state i0. By receiving the first input symbol, (, the parser

identifies that the action to take is to shift to state i2, consuming the input symbol and

storing state i2 in the stack. This can be verified by looking at the parsing Table 2.2.

The next symbol is another (. The parser identifies the next action as shift i4, consumes

the input symbols and stores the state i4 in the stack. The next symbol is ). By looking

at the parsing table, the parser identifies the next action as a reduce by the rule P → ε.

Since this rule does not have elements in its RHS, there are no states to be removed from

the stack. The parser, then, identifies the next action as a goto state i6 and adds the

state i6 to the stack. The next element is ) and the action is shift i7, which triggers a

reduce by rule P → (P) and a goto action to state i3. The current stack is i0 i2 i3 i5,

which represents the symbols (P). The input string has only the last input symbol, ),

which requires a shift action to state i5, triggering a reduce by production rule S → (P)

and a goto state i1, where we reach the end of the input string. The parser identifies in

the parsing table that the input symbols form a valid sentence in the given grammar and

accepts it. The step-by-step execution can be seen on Table 2.3.

State Action Goto( ) $ S’ S P

i0 s2 1i1 acci2 s4 r2 3i3 s5i4 s4 r2 6i5 r1i6 s7i7 r3

Table 2.2: LR(1) Parsing table for the extended Context-Free Grammar described inFigure 2.9.

While the LR parsing algorithm can be implemented as efficiently as more primitive

shift-reduce methods, its main drawback is that it takes too much work to construct

the parsing table for a typical programming-language grammar. For that, there is need

to use a specialized LR parser generator, like Yacc. Such generators take a Context-

20

# Stack Symbols Input Action(1) i0 ( ( ) ) $ shift i2(2) i0 i2 ( ( ) ) $ shift i4(3) i0 i2 i4 ( ( ) ) $ reduce P → ε(4) i0 i2 i4 ( ( P ) ) $ goto i6(5) i0 i2 i4 i6 ( ( P ) ) $ shift i7(6) i0 i2 i4 i6 i7 ( ( P ) ) $ reduce P → (P )(7) i0 i2 ( P ) $ goto i3(8) i0 i2 i3 ( P ) $ shift i5(9) i0 i2 i3 i5 ( P ) $ reduce S → (P )(10) i0 S $ goto i1(11) i0 i1 S $ accept

Table 2.3: Parsing of the input string ( ( ) ) according to the LR1 grammar in 2.6b.

Free Grammar and automatically produce a parser for it, locating and diagnosing the

constructs which are difficult to parse in a left-to-right scan of the input [1].

This chapter presented relevant concepts needed to understand the context of our

research as well as our approach to the problem of querying graph databases with context-

free queries. In the next chapter, we discuss some related works, identifying pros and cons

of their approaches.

21

22

3 Related Work

During the research for this work, we came upon some papers which aim at increasing

expressiveness of graph database queries by extending Regular Expressions or even adding

Context-Free Languages concepts to their query mechanisms. Some also include pre-

processing of the graph. Four of them seem to be most related to what we are trying

to achieve and required some analysis. The first one defines an extension to SPARQL,

adding nested regular expressions to its query mechanism; the second one proposes to add

Context-Free Grammar notions to query the graph database; the third uses a combination

of both approaches to extend SPARQL with notions of Context-Free Grammar and regular

expressions; and the fourth one introduces, among other things, a structure with the intent

of helping the parsing algorithm, which links the language’s automaton with the current

processing state.

3.1 nSPARQL: A navigational language for RDF

Even though the W3C has specified and recommended SPARQL as the standard graph

database query language, it is known that SPARQL contains limitations on the query

expressiveness, specially when trying to navigate through the RDFS subcategories and

subclasses.

Consider the graph shown in Figure 3.1, which contains information about cities and

transportation services between them. According to the RDFS specification, it should be

possible to identify whether a pair of cities a and b are connected by a sequence of trans-

portation services, without knowing in advance what services provide those connections,

but SPARQL does not provide us means of creating such query. To overcome these limi-

tations, in (J. Pérez, M. Arenas, C. Gutierrez, 2010) [19], a language extending SPARQL

was proposed, called nSPARQL, which adds regular expressions and a concept of expres-

sion branching, or nesting (NRE). The resulting language is then evaluated by means of

query time efficiency and the authors prove that if the appropriate data structure is used

23

Figure 3.1: RDF graph containing information about available transport services betweencities.

(1) Sub-property: (2) Subclass: (3) Typing:

(a) (A,sp,B)(B,sp,C)(A,sp,C)

(a) (A,sc,B)(B,sc,C)(A,sc,C)

(a) (A,dom,B)(X,A,Y )(X,type,B)

(b) (A,sp,B)(X,A,Y )(X,B,Y )

(b) (A,sc,B)(X,type,A)(X,type,B)

(b) (Arange,B)(X,A,Y )(Y,type,B)

Table 3.1: RDFS inference rules [19].

to store an RDF graph D, then it is possible to use a regular expression E to check in

time O(|D| · |E|) whether a vertex w is reachable from v.

Note that in Figure 3.1, a triple (s, p, o) is depicted as an edge sp−→ o, where s

and o are represented as nodes and p is represented as the label of an edge. For example

(Paris, TGV,Calais) is a triple stating there is a TGV transport service connecting Paris

and Calais.

Based on what is proposed by Muñoz et al. [18], the authors decided to use the

system of rules defined in Table 3.1 and to consider only the RDFS subsets composed

by rdfs:subClassOf, rdfs:subPropertyOf, rdfs:range, rdfs:domain and rdf:type, denoted by

sc, sp, range, dom and type, respectively. In every rule, letters A, B, C, X and Y stand

for variables occurring in the triples. A triple t is deduced from D if t ∈ D or there exists

a graph D′ such that t ∈ D′ and D′ is obtained from D by successively applying the rules

in Table 3.1.

The queries performed by nSPARQL use this set of inferences to identify connections

24

between vertices when a triple is found. The query is in the form (?X, query, ?Y ) and its

answer corresponds to pairs of vertices (X, Y ), which contain a path between them in the

graph, according to what was specified in the query.

Some navigation rules were specified to allow regular expressions in nSPARQL.

Figure 3.2: Forward and backward axes for an RDF triple (a, p, b) [19].

The navigation of a graph is done by using axes next, edge and node, and their

inverses next−1 , edge−1 and node−1, to move through an RDF triple. There is also an

special axis self, which is not used to navigate to another vertex, but to reference the

current vertex for logic purposes. As can be seen in Figure 3.2, b is the next of a; p is the

edge of a; b is the node of p. Similarly to this, a is next−1 of b; a is the edge−1 of p; and

p is the node−1 of b.

Example 8 Consider the graph in Figure 3.3. It is possible to find the connection be-

tween nodes a1 and a4, and a1 and a6 by using the regular expression as defined below:

next/next/edge/next/next−1/node

Starting in the node a1, next returns a2, then next returns a3, then edge returns p3, then

next returns p4, then next−1 returns p3 and p5 and then node returns the nodes a4 and

a6. The dashed lines are shown just to help the user to see the steps to recognize the path

from a1 to a6.

Figure 3.3: Path connecting a1 to a6 [19].

25

It is also possible to use Nested Regular Expressions (NRE) with nSPARQL. These

regular expressions can be used to test for the existence of certain paths starting at a

given axis and their syntax is defined by the following grammar:

exp := axis | axis :: a(a ∈ U) | axis :: [exp] | exp/exp | exp | exp | exp∗

With axis being one of self, next, next−1, edge, edge−1, node or node−1. As usual

for regular expressions, exp+ is used as a shorthand for exp/exp∗. Given an RDF graph

D, the expression next :: a identifies the pairs of nodes (x, y) such that (x, a, y) ∈ D.

Given that a node can also be the label of an edge, it is also possible to navigate from

a node to one of its leaving edges using the edge axis. The interpretation of edge :: a

is the pairs of nodes (x, y) such that (x, y, a) ∈ D. To express the regular expression

nesting, the nesting construction [exp] is used to check for the existence of a path defined

by expression exp. The evaluation of the expression next :: [exp] in a graph D, retrieves

the pairs of nodes (x, y) such that there exists a node z with (x, z, y) ∈ D, and that there

is a path in D that follows the expression exp starting in z.

Example 9 Considering the graph D in Figure 3.1, and the expression next :: [next ::

sp/self :: train], the path evaluation will start from the inner expression next :: sp/self ::

train, defining the pairs of nodes (z, w) such that it is possible to follow an edge labeled

sp from z and reach a node w labeled train. Only the node TGV satisfies this criteria,

so the external expression can be read as next :: TGV , defining the pairs of nodes that

are connected by an edge labeled TGV . The result of the expression evaluation in D is

{(Paris, Calais), (Paris,Dijon)}.

These navigation rules make the algorithm start in a vertex and follow a specific

path. The regular expressions can be nested in a way to allow the query to verify the

current vertex’s hierarchy without losing the context. The SPARQL operands AND,

OPT , UNION and FILTER can also be used to increase the expressiveness of the

queries.

The authors also defined an algorithm to perform the search in the graph, and prove

its efficiency, considering two problems: (i) verify if a given pair of vertices is in the result

26

of the expression evaluation in a graph; and (ii) given a vertex a, find which are the pairs

(a, b) which match what was specified in the expression in the given graph.

Note that, for both problems, the algorithm receives at least one vertex as input. The

authors opted not to answer queries of the kind "return all pairs which match a given

expression in the graph" because the algorithm would have quadratic complexity in the

worst case, in terms of the number of vertices in the graph, only to return the result.

For the proposed problems, the authors managed to prove their algorithm has complexity

O(|G| · |exp|), where |G| is the size of the input graph and |exp| is the size of the nested

regular expression being evaluated, and, to return the result to problem (i), the time

complexity is constant, in terms of the number of vertices in the graph; to problem (ii),

it is linear in terms of the number of vertices in the graph.

Even though nSPARQL improves the expressiveness of graph query languages by

adding the NREs, it still relies on Regular Expressions to perform the queries, which de-

scribe a more strict class of grammars than the Context-Free Grammars. Since we intend

to use an LR(1) parsing mechanism, our solution is able to allow higher expressiveness in

the queries, like allowing element counting in the queries.

3.2 Conjunctive Context-Free Path Queries

With the purpose of increasing the expressiveness of the Conjunctive Regular Path Queries,

a new query mechanism is proposed in (J. Hellings, 2014) [12] for searching for paths in

directed graphs, called Conjunctive Context-Free Path Queries (CCFPQ). CCFPQ up-

dates the Conjunctive Regular Path Queries by replacing the Regular Expressions with

Context-Free Grammars in the Chomsky Normal Form.

Given a graph D, formed by a tuple (V,E, ψ), where V is a set containing all the

graph vertices, E is a set containing all the graph edges and ψ is a function connecting

two vertices via one edge, a path π = (n1e1...ni−1ei−1ni) in a graph D is the non-empty

finite sequence of vertices connected by edges, where n ∈ V and e ∈ E, respecting the rule

that there is always only one edge between vertices in the path sequence. The expression

nπm is used to identify a path between vertices n and m, and the trace of the path π is

27

defined by T = (l1...ln), being l the label of the edges in the path.

The definition of a Context-free Path Query over a grammar G = (N, T, S, P ) is

defined as follows:

Q(→v )← ∃→µ

∧i∈I

Ni(ni,mi)

where Q is the name of the query, →v is a tuple of vertex variables, →µ is a tuple of distinct

vertex variables that do not occur in →v , i belongs to a finite index I, Ni ∈ N is a non-

terminal, and ni and mi are vertex variables from →v or →µ. A CCFPQ built with a regular

grammar is called Conjunctive Regular Path Query (CRPQ).

The author provides an algorithm to query the graph databases with the proposed

mechanism and proves its execution time to be O(|G| · n5) and O(n3 · m2), where |G|

depends on the size of the provided Context-Free Grammar, n is the number of vertices

in the graph and m the maximum path length.

The proposed algorithm uses an adaptation of the CYK parsing method, presented

in [8], and is used to pre-process the graph and add edges connecting vertices which will

create a path, according to the given Context-Free Grammar. For each path found that

matches a production rule in the Context-Free Grammar, an edge connecting the first

vertex to the last vertex in the path will be added to the graph, making queries through

that path to follow this new edge, instead of searching through the whole path again.

Example 10 It is possible to see in Figure 3.4 the edges that are added to the graph on

Figure 2.5 when using the grammar in figure 2.6 using this parsing method. Later, the

provided algorithm needs only to search on the added edges to identify the correct paths,

which, in this case, would be a→ c, a→ f and b→ e

Some variants of CRPQ also allow the definition of explicit path variables. Path

variables are used in the form:

Q()← ∃nm∃nπmr1(π) ∧ r2(π)

28

a b

c

d e f(

)( ) )

S

S

S

Figure 3.4: Adding extra edges to the graph.

where r1 and r2 are regular expressions and place conditions on the trace of a path. These

regular expressions can be used to specify which paths should be returned by a query.

In the CCFPQ, the expressiveness is increased by giving the possibility to use Context-

Free grammars in the query at the cost of processing the graph in advance. In our work,

we aim at achieving expressiveness close of queries in CCFPQ but, while restricting to

LR(1) grammars, we intend to reduce the complexity of the algorithm.

3.3 Context-Free Path Queries on RDF Graphs

In (X. Zhang, Z. Feng, X. Wang, G. Rao, W. Wu, 2016) [36], the authors proposed the

Context-Free Path Queries to navigate through an RDF graph, and the Context-Free

SPARQL query language (cfSPARQL) for RDF, built on the context-free path queries,

introduced by [12], added with the standard SPARQL operations and nested regular

expressions, uniting the research of the two previous related works and creating the Con-

junctive Context-Free path Queries (CCFPQ).

A Conjunctive Context-Free Path Query in cfSPARQL is in the form:

q(?x, ?y) :=m∧i=1

αi

where q is the name of the query, αi is a triple in the form (?x, ?y, ?z) or in the form

v(?x, ?y), being x the object, y the predicate and z the object in the set of edges in

the graph, allowing a query to include both nSPARQL nested regular expressions and

context-free path queries; and {?x, ?y} is a subset of the variables occurring in the body

29

of q.

Example 11 Consider a Context-Free Grammar D = (N, T, S, P ) where N = {S,R},

T = {next ::, next :: sp, self :: train}, S = {S} and P the grammar defined in Figure

3.5, the cfSPARQL query Q based on the grammar D, applied on the graph defined in

Figure 3.1, will return {(Paris, Calais), (Paris,Dijon)}

S → next :: RR→ [next :: sp/self :: train]

Figure 3.5: Context-Free Grammar extended with nSPARQL regular expressions

It is also possible to merge more than one CCFPQ and capture more expressive power

such as disjunctive capability. An union of conjunctive context-free path queries (UC-

CFPQ) is in the form:

q(?x, ?y) :=m∨i=1

qi(?x, ?y)

where Qi(?x, ?y) is a CCFPQ for all i = 1, ...,m.

The authors of [36] then give algorithms to parse the query, namely recognize and

convert, and verify their complexity, claiming it to be O(|D|) for converting and O((|N | ·

|D|)3) for recognizing, where |D| is the size of the graph D and N is the number of

non-terminal symbols in the given Context-Free grammar.

In addition to the query algorithm, the authors introduce a language called Context-

free SPARQL, which extends SPARQL to use context-free triple patterns and SPARQL

basic operations, like UNION, AND, OPT, FILTER and SELECT.

Figure 3.6 shows the general relation between the variants of CFPQ and nested regular

expressions in terms of expressiveness, where a → b means that a is expressible in b.

According to the authors, UCCFPQ can express queries in CCFPQ, and their extensions.

In CFPQ and its extensions, the query structure and query evaluation are similar to those

proposed in [12], but the query expressiveness is increased with the addition of the NREs.

30

UCCFPQ nre¬

nreCCFPQ UCCFPQS

CFPQnre0(N) nre0(|)

nre0

Figure 3.6: Comparison between the languages.

3.4 Context-Free Path Querying with Structural Rep-

resentation of Result

In (S. Grigorev, A. Ragozina, 2016) [7], the authors propose an approach to recognize

context-free paths in RDF graphs using a top-down method based on the GLL [29] parsing

algorithm. The algorithm allows one to build such representation with respect to given

grammar in polynomial time and space for an arbitrary context-free grammar and graph,

in terms of the number of vertices in the graph. The authors state that the proposed

algorithm’s runtime complexity is O(|V |3 maxv∈V (deg+(v))), where V is the set of vertices

and deg+(v) is the out-degree of vertex v. For complete graphs, the proposal has runtime

complexity O(|V |4).

3.5 Top-Down Evaluation of Context-Free Path Queries

in Graphs

The work in [17] was developed in parallel to our research and tries to solve problems

similar to ours. The author proposes an algorithm that evaluates context-free path queries

using-top-down [1] parsing techniques. The proposed algorithm requires the input LL

Context-Free Grammar to be in the Chomsky Normal Form and receives, as parameters,

a data graph, the parsing table of said grammar and a set of query pairs in the form

(a,X). The algorithm parses the paths by populating three sets of processed pairs, which

31

will help the algorithm to identify when it is time to stop processing, returning a set of

tuples (a,X, b), which represents that there is a path valid according to the non-terminal

N in the grammar from a to b in the data graph.

The main difference in functionality between our work and [17] is that its query pa-

rameter allows the user to specify any non-terminal in the grammar to be returned in the

response, giving the user more flexibility.

Given that V is the set of vertices in the graph and P the set of production rules

in the given grammar, the author calculates the time complexity of the algorithm to be

O(|V |3|P |). Since this work is being developed in parallel to ours, we will be able to run

some experiments and make comparisons between the performance of both proposals.

3.6 Tomita-Style Generalized LR Parsers

In [30], an LR parsing algorithm is introduced, which takes an input string, a1a2...an,

and uses it to traverse a DFA, constructing a Graph Structured Stack (GSS) during the

process. We found this structure to be of interest of this research since it suits to store

the data we are manipulating in the algorithm we are implementing.

A GSS is formed by state nodes, labeled after the DFA states, and a set of symbol

nodes, which are labeled after the grammar symbols. The state nodes are grouped together

into disjoint sets, an initial set, U0, and one set, Ui, for each element ai of the input string.

U0 is the input related reduction-closure of the start state of the DFA, and for 1 6 i 6 n,

Ui is the input related reduction-closure of Ui−1ai, the set of all states which can be

reached from a state in Ui−1 along a transition labeled ai. A node is at level i if it is in Ui,

and that v ∈ Ui has a valid reduction if the DFA state, h, which labels v contains an item

of the form (A ::= α·, ai+1). A reduction via the rule A ::= α is valid for a state node v

which is at level i and has label h, if the DFA state h contains the item (A ::= α·, ai+1),

which means that valid reductions are those which can be applied when the input a1...ai

has been read and the lookahead input symbol is ai+1.

In the GSS, all successors and predecessors of a symbol node are state nodes and all

successors and predecessors of a state node are symbol nodes. The GSS constructed from

32

k ai h

Figure 3.7: Representation of a shift transition in a GSS.

input a1...an contains a subgraph as represented in Figure 3.7, that shows there is a node

in Ui−1 labeled k, and a node in Ui labeled h, and a transition labeled ai from k to h in

the DFA. This step corresponds to the shift action in the LR Parser, where the input

symbol is read and added to the top of the stack.

w x1 ... xm u

A v

Figure 3.8: Representation of a reduce transition in a GSS.

Similarly to that, the reduce action in the LR Parser, which consists of removing

elements from the stack equal to the amount of symbols in the RHS of the rule for the

given non-terminal, is represented in the form shown in Figure 3.8 in the GSS, where

u ∈ Uj, w ∈ Ui, A ::= x1...xm is a grammar rule and there is a transaction from w to v

labeled A in the DFA. This way, the node v is reduction related to the node w via a path

of length 2m and a symbol node labeled A.

w x1 ... xm u

A v

Figure 3.9: Representation of a reduce transition for a ε-transition in a GSS.

Since the reduction removes |RHS| elements from the stack, the behaviour for reducing

rule in the form A ::= ε will create a link between v and u, because the |RHS| is zero.

This transition may be seen in Figure 3.9

When the GSS is complete, the final set Un of state nodes is examined. The input

string is in the language if, and only if, Un contains the accepting state of the DFA.

The GSS in Figure 3.10 is obtained by parsing the string ( ( ) ) for the Context-Free

Grammar {S → ( P ), P → ( P ), P → ε}. For the initial state, a node v0, labeled 0,

is added to the GSS in the U0 set. The first input symbol, (, is read. The state machine

33

0v0

( 2v1

( 4v2

P 6v3

) 7v4

P 3v5

) 5v6

S 1v7

U0 U1 U2 U3 U4

Figure 3.10: GSS generated for parsing the input string ( ( ) ) in the grammar in Figure5.

for the DFA moves to the DFA State i2, which is the target of the transition labeled (

from the current state. Since this is a shift operation, a node v1, labeled 2 is added to

the U1 set, together with a symbol node labeled (, being this the successor of v0 and

predecessor of v1.

Next step is to look at the state 2 in the DFA to identify if it has any valid reductions

on the next input symbol (. Since only one symbol was consumed, the rules being searched

for are in the form (A ::= a1·, a2). For each one of these rules, a new node is added to U1,

connected to v1 in U1 by the terminal symbol used in the transition in the DFA, if not

yet present in the GSS.

The process will continue reading input symbols until it finds one which is not present

in the accepting symbols of the DFA in Un, where it means the input string is not in the

language, reporting a failure, or reaches a success state in the DFA, reporting success.

After analyzing the related works, we have enough information to start developing our

solution to the proposed problems. To be able to use GSS in our research, we need to

extend it so we can store data for the multiple traces that may be found in a graph. These

adaptations to the GSS are presented in Chapter 4. Also, if we were using the conventional

LR parsing algorithm, we would have to run the parser and create one parsing stack for

34

each vertex and each path, but the GSS data structure allows us to identify all paths at

once, with a single instance of the GSS. Our algorithm allows increased expressiveness

for the queries when compared to the related works whose path queries are described by

Regular Grammars. Another contribution of our algorithm is that it uses a technique

traditionally used for parsing strings of Context-Free Grammars to process the queries.

35

36

4 The GrLR Query Processing

Algorithm Approach

Given that we have a graph D formed by a set of tuples (v, e, w), where v, w ∈ V are the

vertices of the graph and e ∈ E is the edge connecting both vertices, containing the data

to be queried and a Context-Free Grammar G = (N, T, S, P ) to define the valid paths, we

use some of the concepts presented during this research, in this chapter, to explain how

we solve the proposed problems. In Section 4.1, we present and detail the final version of

the algorithm to solve the problem, showing an example execution of the algorithm. In

Section 4.2, we calculate our algorithm’s time and space complexity, and, in Section 4.3,

prove that our algorithm is correct.

The LR Parsing algorithm was originally designed to parse a single finite string of

symbols. The algorithm continues parsing the string while it is still considered to be a

valid sentence in the language formed by the input grammar, stopping when reaching the

last symbol of the string or when finding an unexpected symbol.

Since the graph may have multiple paths that form valid sentences, we may have to

parse multiple strings. We start in a set of vertices in the data graph and use the labels of

the edges forming traces W = v0e0v1e1...envn to other nodes as the strings to be parsed.

When identifying the traces, if the algorithm reaches a vertex in the graph that contains

more than one outgoing edge, it tries to parse one path for each edge. Given a Context-

Free Grammar G = (N, T, S, P ), if the algorithm were parsing the graph shown in Figure

4.1, it would find the traces a(−→ b

)−→ c and a(−→ b

)−→ d, if the string () belongs to the

language formed by G. The answer, in this case, would be a set containing (a, S, c) and

(a, S, d), with S being the start symbol in the grammar.

We replace the ordinary stack used by the original LR parsing algorithm with the

GSS structure. We modified the GSS node structure to also store the vertex in the graph

which was just processed. This allows us to easily identify the vertices when calculating

37

a b c

d

( )

)

Figure 4.1: Paths identified in a graph.the answers for the query.

Before talking about the algorithm itself, let’s define some auxiliary functions and

parameters specific for the manipulation of the GSS and parsing table. The first function

is CreateParsingTable(G), which receives the grammar and builds a table with I rows

and T columns, with I being the possible states for the DFA. This function also returns

the initial state s0 in the DFA. Each element in the table may contain a valid action to be

performed by the parsing algorithm when reaching a state i ∈ I and receiving an input

symbol t ∈ T . Since the creation of the parsing table is not in the scope of this research,

this function uses an external tool, called JSMachines [13], for building the LR(1) parsing

table.

Example 12 Let’s suppose that our algorithm has built a GSS as shown in Figure 4.2

and is ready to process the GSS pair labeled v3. Looking at the parsing table, it sees

that the next action to perform is to reduce the rule P → a a, whose RHS size is 2. The

result of a call to GSS_Up in this node, passing 2 as third parameter is the set of nodes

{v0, v1}.

a, i0

v0

b, i0

v1

a b, i2

v2

a

a d, i4

v3

U0 U1 U2

Figure 4.2: Execution of a GSS_Up function call.

For the GSS specific operations, we have the functions CreateGSS(Q, s0), which ini-

tializes the GSS with a set of pairs in the form (v, s0), associating each v ∈ Q with the

38

initial state s0 in the DFA. We also have the function GSS_Pairs(GSS, level) as a con-

venient way to retrieve all the pairs in a given level of the GSS. The third function

is called GSS_Up(GSS, (a, si), |α|) and returns all GSS pairs that originated the path

being processed for the current rule, as shown in Example 12. This function follows the

same implementation as in the original GSS algorithm, with the difference that it detects

the multiple paths being followed. It starts at the (a, si) pair in the current GSS level

and traverses the GSS from there |α| steps behind, returning the current pairs when they

stop. The fourth and last function is called GSS_Insert_Pair(GSS, level, GSSPair),

which inserts the given pair in the GSS in the given level.

4.1 The algorithm

The function for querying the graph is called GrLR, defined in Algorithm 4, which takes

as input a data graph DG ⊆ V × E × V to query from, a Context-Free Grammar G =

(N, T, S, P ) and a Context-Free Path Query Q ⊆ V , which contains the vertices in the

data graph to start the query from.

The first part of the algorithm initializes the parsing table for the given grammar and

the GSS by filling its U0 level with all the starting vertices for the query, associated with

the starting DFA state for the grammar (lines 2 to 7).

With the GSS initialized, it is time to perform the actual query. The algorithm starts

processing the valid paths into the main while loop, starting at line 8. Each iteration of

the loop parses the current symbol of each path being analyzed, and try to process three

kinds of operations. First, it tries to process reduce operations (lines 10 to 25), then it

tries to identify valid answers by processing accept operations (lines 26 to 32) and then

the shift operations (lines 33 to 41). If no valid operation is defined for the current input

symbol, our algorithm simply ignores the path being followed (we do not return an error

as done in the original LR parsing algorithm).

This follows the same logic of the LR parsing algorithm using GSS instead of a stack.

The variable changed is set to FALSE in the beginning of each iteration, at line 9. If any of

these three operations are performed and produces a new discovered element in the GSS

39

structure, then the algorithm sets the changed variable to TRUE (lines 25, 32 and 41). The

main loop iterates until the changed variable is not set to TRUE in a given iteration. The

verification for this is done at line 42.

We need to discuss about the stop condition of the algorithm before explaining the

three specific operations in depth. Given the grammar G = {S → (S), S → ε}, it is

possible that the graph has paths that create a valid loop according to the grammar, as

seen on Figure 4.3.

a b c) )

(

Figure 4.3: Graph with a loop.

We defined that, in this case, a good stop condition would be to have three sets of

values that should be checked after the processing of each loop iteration, and we called

them ReductionEdges, Answers and V isitedPairs (lines 4 to 6). The ReductionEdges

set includes tuples in the format (a,N, b), storing the information about all reductions

made between vertices b and a by a rule whose LHS is N . Whenever a new tuple is found

during a reduction, we set changed to TRUE (lines 23 to 25). The Answers set includes

pairs in the format (a, b), representing all valid answers to the query, meaning that there

is a valid path, according to the given grammar, from a to b. Whenever a new answer is

found, the changed variable is set to TRUE (lines 30 to 32). The V isitedPairs set includes

pair in the format (a, i), meaning that the algorithm has already visited a vertex a in the

data graph in the state i of the DFA. If any new pair is found, then the changed variable

is set to TRUE (lines 39 to 41). At the end of the main loop the algorithm checks if the

changed variable is set to TRUE, increases the level by one and continues (line 43). If it

remains FALSE, the algorithm stops (line 42).

Now lets discuss about the three specific operations. Given a D = (V × E × V ), the

GSS contains pairs (a, i) at each level, where a ∈ V and i a state in the DFA of the given

grammar.

Processing Reduces. First, the algorithm retrieves all the pairs (a, i) at the current

level of the GSS (lines 12 and 13). For each edge in the data graph in the format (a, t, b),

with a, b ∈ V and t ∈ E, the algorithm looks in the parsing table if there is a valid reduce

40

Algorithm 4: GrLR Query Processing Algorithm.input : - a data graph DG ⊆ V × E × V ;

- a Context-Free Grammar G = (N,T, S, P );- a Context-Free Path Query Q ⊆ V .

output: - AnswersG(Q).1 Function GrLR(DG, G, Q) : AnswersG(Q)2 (ParsingTable, s0)←CreateParsingTable(G)3 GSS ←CreateGSS(Q, s0)4 V isitedPairs← ∅5 ReductionEdges← ∅6 Answers← ∅7 level← 08 while TRUE do9 changed←FALSE

// processing reduces10 PairsToProcess←GSS_Pairs(GSS, level)11 while PairsToProcess 6= ∅ do12 choose (a, si) ∈ PairsToProcess13 PairsToProcess← PairsToProcess\{(a, si)}14 NextTerminals← {terminal|(a, terminal, b) ∈ DG} ∪ {$}15 for each terminal ∈ NextTerminals do16 for each ParsingTable[si][terminal ] do17 if ParsingTable[si][terminal ] = REDUCE A → α then18 Ancestors← GSS_Up(GSS, (a, si), |α|)19 for each ((c, sj) ∈ Ancestors) do20 GSSPair ← (a, ParsingTable[sj ][N ])21 GSS_Insert_Pair(GSS, level, GSSPair)22 PairsToProcess← PairsToProcess ∪ {GSSPair}23 if (c, N, a) /∈ ReductionEdges then24 ReductionEdges← ReductionEdges ∪ {(c, A, a)}25 changed← TRUE

// processing accept states26 for each (a, si) ∈ GSS_Pairs(GSS, level) do27 if ParsingTable[si][$] = ACCEPT then28 Ancestors← GSS_Up(GSS, (a, si), 1)29 for each (c, sj) ∈ Ancestors do30 if (c, a) /∈ Answers then31 Answers← Answers ∪ {(c, a)}32 changed← TRUE

// processing shifts33 for each (a, si) ∈ GSS_Pairs(GSS, level) do34 for each (a, terminal, b) ∈ DG do35 for each ParsingTable[si][terminal ] do36 if ParsingTable[si][terminal ] = SHIFT sj then37 GSSPair ← (b, sj)38 GSS_Insert_Pair(GSS, level + 1, GSSPair)39 if (b, sj) /∈ VisitedPairs then40 V isitedPairs← V isitedPairs ∪ {(b, sj)}41 changed← TRUE

// has V isitedPairs or ReductionEdges changed at this level?42 if not (changed) then break;43 level← level + 1

44 return Answers

41

action for the DFA state i when looking at the symbol t (lines 15 and 16). If any is

found, then the algorithm calls the GSS_Up function, which returns the GSS pairs at the

beginning of the path when the parsing has started for the rule currently being reduced

and adds news pairs in the current GSS level connected to them by the non-terminal

represented in the LHS of the rule being processed (lines 18 to 22).

Processing Accepts. After the reductions, the algorithm fetches all the GSS pairs

(a, i) (including the ones generated by the reductions on the same level) in the current

GSS level (line 26), and consult the parsing table for accept actions for the DFA state i

(line 27). If it is found, the algorithm calls the GSS_Up function to identify the pairs

(c, i) which originated the valid paths that reached a, storing the tuple (c, S, a) in the

Answers set (lines 28 to 31), meaning that there is a valid path from c to a following the

non-terminal S, according to the given grammar.

Processing Shifts. The last action to process is the shift. The algorithm fetches all

the GSS pairs (a, i) (including the ones generated by the reductions on the same level)

in the current GSS level. For each edge in the data graph in the format (a, t, b), the

algorithm looks in the parsing table if there is a valid shift action to the DFA state j

from state i when looking at the symbol t (lines 35 and 36). If any is found, then the

algorithm adds a new pair (b, j) on the next level of the GSS (line 38).

4.1.1 Algorithm execution example

In this section, we present an execution example for the algorithm. A JavaScript imple-

mentation of our algorithm is available in [9] and may be used to reproduce the steps

described in this section. Consider the graph D in Figure 4.4a as the data graph to

be queried with the input grammar G represented in Figure 4.4b. We want to identify

nodes connected by edges that correspond to expressions with "Matching parenthesis",

possibly "nested". Examples of paths that correspond to the query are a(−→ b

)−→ c,

a(−→ b

(−→ d)−→ e

)−→ f and b(−→ d

)−→ e are valid.

To do so, we initialize the parsing table for the given grammar, represented in Figure

4.4c, and initialize a GSS containing nodes for all the vertices in the given graph on the

42

a

b

c

d e f

( )

() )

(a) Data graph D.

S ′ → S$S → (P )P → ε|(P )(b) LR Grammar G.

State Action Goto( ) $ S’ S P

i0 s2 1i1 acci2 s4 r2 3i3 s5i4 s4 r2 6i5 r1i6 s7i7 r3

(c) Parsing table for the inputgrammar.

Figure 4.4: Input data for the algorithm example.

level U0. This means that we plan to identify all paths starting at each vertex in the

graph at once, according to the given grammar, by passing Q = {a, b, c, d, e, f} to the

algorithm.

The GSS in Figure 4.5 represents the initial state for the parsing method, containing

one node for each vertex in the graph, all of them pointing to the state i0 in the DFA.

a, i0

v0

b, i0

v1

...

f, i0

v5

U0

Figure 4.5: Initialization of the GSS for the graph in Figure 4.4a.

To start querying, we call GrLR(D,G,Q) and start analyzing the level U0. For each

node (v, i) in this level, we identify its vertex v in the graph and read the labels of outgoing

edges (v, e, w). For each label found, we look at the parsing table to identify the actions

to take at state i, according to the outgoing edge’s label e. The algorithm identifies that

there are no reduce or accept actions to make in this level, but there are two shift actions

43

from node v0 when reading the edge a(−→ b and from node v1 when reading the edge b

(−→ d,

both leading to state i2, and adds the two pairs (b, i2) and (d, i2) to level U1. These pairs

are also added to the V isitedPairs set. Since this set was previously empty, the changed

variable is set to TRUE and the level variable is increased by one. Figure 4.6 shows the

resulting GSS after processing level U0.

a, i0

v0

b, i0

v1

...

f, i0

v5

( b, i2

v6

( d, i2

v7

U0 U1

Figure 4.6: Resulting GSS after processing level U0.

Starting to process the U1 level, the algorithm identifies in the parsing table that there

is a reduce action from node v6 when reading the edge b)−→ c and other reduce action from

node v7 when reading the edge d)−→ e. Both reductions happen for the rule P → ε, which

has |RHS| = 0. The algorithm, then, calls the GSS_Up function to identify the GSS

pairs where the parsing of the current rule started. In this case, jumping zero times, the

reduction root returned is the GSS node being processed itself. The algorithm adds two

pairs to the same level being processed, v8 pointing to v6 and v9 pointing to v7, connected

by the LSH of the rule. P in this case.

The algorithm verifies that the tuples (b, P, b) and (d, P, d) did not yet exist in the

ReductionEdges set and adds them to it, also setting the changed variable to TRUE. After

this, the algorithm finds three shift actions to make from the nodes v6, v8 and v9, looks

at the parsing table which state each of them should take and adds them to the next level.

The changed variable was set to TRUE, so the level variable is increased by one. Figure

44

4.7 shows the resulting GSS after processing the level U1.

a, i0

v0

b, i0

v1

...

f, i0

v5

( b, i2

v6

P b, i3

v8

( d, i2

v7

P d, i3

v9

( d, i4

v10

) c, i5

v11

) e, i5

v12

U0 U1 U2


Since the changed variable was set to TRUE, we need to process the current GSS level.

Starting on the level U2, the algorithm identifies first three reduce actions from the GSS

nodes v10, v11 and v12 and calls the GSS_Up function to figure out where the parsing for

the rule being reduced started. The node v10 with the edge d)−→ e is reduced by the rule

P → ε. The nodes v11 and v12 with the end of string are reduced by the rule S → (P ),

finding the nodes v0 and v1, respectively, as reduction root. The tuples (d, P, d), (a, S, c)

and (b, S, e) are added to the ReductionEdges set and the changed variable is set to TRUE.

Next, it tries to identify the accept actions, and finds two of them, on nodes v14 and v15.

Looking at the results of GSS_Up, the algorithm adds the tuples (a, c) and (b, e) to the

Answers set. After finding the accept actions, the algorithm identifies that there is one

shift action from the GSS node v13 with the edge d)−→ e to the state i7 and adds the

(e, i7) tuple to the V isitedEdges. Figure 4.8 shows the resulting GSS after processing the

level U2.

Since a new node was added to a new level, we need to process it, by calling evalLevel

on the level 3. Here, the algorithm finds one reduce action from the node v16 with the

edge e)−→ f . This time, the reduction rule is P → (P ), which means the algorithm needs

to go back three steps to find the reduction root v6. After this, there is also a shift action

on the node v17 with the edge e)−→ f .

After processing the level U3 and finding a shift action, the algorithm starts a new

45

a, i0

v0

b, i0

v1

c, i0

v2

d, i0

v3

e, i0

v4

f, i0

v5

( b, i2

v6

P b, i3

v8

( d, i2

v7

P d, i3

v9

( d, i4

v10

P d, i6

v13

) c, i5

v11

S c, i1

v14

) e, i5

v12

S e, i1

v15

) e, i7

v16

U0 U1 U2 U3


iteration, this time for level U4. Here, the algorithm only finds the node v18, which only

accepts a reduction, when given the end of the string, to the rule S → (P ). The reduction

root is v0. While searching for accepts, the algorithm finds that the newly added GSS

node v19 has an acceptance state in the parsing table and adds a pair (a, f) to the answers.

The algorithm stops because there were no shift actions performed while parsing the

current GSS level, so the level U5 is empty. The algorithm, then, returns the answers {(a,

c), (a, f), (b, e)}.

4.2 Complexity

We dedicate this section to calculate our algorithm’s runtime and space complexity in the

worst case scenario, which corresponds to queries over a complete data graph. In this

context, as seen on Figure 4.10, a complete graph contains edges connecting every node

to every node with each terminal in the given grammar. This means that there is a path

to every node from every node, causing the algorithm to execute the maximum number

of shift and reduce operations. In addition to that, we consider that the queries will be

46

a, i0

v0

b, i0

v1

c, i0

v2

d, i0

v3

e, i0

v4

f, i0

v5

( b, i2

v6

P b, i3

v8

( d, i2

v7

P d, i3

v9

( d, i4

v10

P d, i6

v13

) c, i5

v11

S c, i1

v14

) e, i5

v12

S e, i1

v15

) e, i7

v16

P e, i3

v17

) f, i5

v18

S f, i1

v19

U0 U1 U2 U3 U4


made starting on all vertices in the graph, so that every node has to be investigated as

the source of a path.

1

2 3

a, b

a, ba, ba, b

a, b

a, ba, b

a, ba, b

Figure 4.10: Complete graph with three vertices.

4.2.1 Runtime complexity

Given a data graph D ⊆ V × E × V to query from, a Context-Free Grammar G =

(N, T, S, P ) and a Context-Free Path Query Q ⊆ V , we call ρ the size of the largest RHS

of the rules in P , and I the set of states in the DFA generated for the given grammar.

We can consider the algorithm being composed of two main parts: initialization and

iterations. During the initialization, the algorithm initializes the GSS and generates |Q|

pairs on the U0 level. In the worst case scenario, the user is querying from all vertices from

47

the data graph, so |Q| = |V |. Considering that the creation of each GSS node has a fixed

cost k, we consider the runtime complexity in the worst case scenario for the initialization

to be k ∗ |Q|, which gives us O( |V | ).

After the initialization, the algorithm enters the main loop and iterates until no new

elements may be added in the ReductionEdges, Answers and V isitedPairs sets. The

ReductionEdges set contains tuples (v′, n, v), where v′, v ∈ V , which represent the con-

nections by a non-terminal n ∈ N found between the vertices v′, v of the data graph

during the processing of a reduction. The maximum number of elements which may be

added to the ReductionEdges set is |V |2|N |. The Answers set contains tuples (v′, v),

where v′, v ∈ V , which are considered answers to the query. It contains all the connec-

tions found between the starting vertices v′ ∈ Q of the query and other vertices v ∈ V in

the data graph which can be connected via the starting non-terminal S and can have, at

maximum, |V |2 elements. The V isitedPairs set contains pairs (v, i), where v ∈ V and

i ∈ I, and represent the states of the DFA that the algorithm found when reaching a given

vertex in the data graph. At maximum, it can contain |V ||I| elements. Our algorithm

stops running whenever those three sets have converged to fixed points.

In order to calculate the runtime complexity of the main loop of the algorithm, first

we analyze the complexity of the reduce, accept and shift operations in separate. Also,

we have to compute the runtime complexity of the GSS_Up function, which is called at

lines 18 and 28.

The GSS_Up function is responsible for finding the ancestor GSS nodes for a reduc-

tion in a given GSS node, which is located some levels behind in the GSS, according to

the |RHS| of the rule being reduced. The GSS_Up function receives the GSS node which

enabled the reduction of the production rule and the number of steps to move back. In the

worst case scenario, the number of steps is ρ. At each GSS level, the function GSS_Up

finds |V ||I| GSS nodes to investigate, at maximum. For each GSS node found at a given

level, a GSS_Up is called recursively, ρ times in the worst case. In this manner, the

worst case complexity of the GSS_Up function is O( (|V ||I|)ρ ).

Process Reduces. At line 10, the maximum number of pairs (v, i) returned by

48

GSS_Pairs is |V ||I|, which is the maximum amount of GSS nodes that can be added

to any GSS level. The algorithm iterates through this set (line 11) and, for each pair

found, iterates through all triples (v, e, w) in the data graph (line 15), which, in the worst

case scenario, is exactly |T ||V | per GSS pair. The algorithm then calls the GSS_Up

function in all of these cases, giving us |V |2+ρ|I|1+ρ|T | operations. The algorithm then

adds the tuples (v′, S, v) to the ReductionEdges set if they were not yet present. Since

the algorithm will, in the worst case scenario, process enough reduce operations to fill the

ReductionEdges set, we need to multiply the complexity of a single reduce operation by

|V |2|N |. The complete formula for the runtime complexity of the reduce operations is

O( |V |4+ρ|I|1+ρ|T ||N | ).

Process Accepts. At line 26, the maximum number of pairs retrieved byGSS_Pairs

is |V ||I|, which is the maximum amount of GSS nodes that can be added to any GSS

level. In the worst case scenario, the algorithm calls GSS_Up for each one of them,

giving us (|V ||I|)1+ρ operations. Since the algorithm will, in the worst case scenario,

process enough accept operations to completely fill the Answers set, we need to multiply

this number by |V |2, which is the maximum amount of elements that can be added to the

Answers set. The complete formula for the runtime complexity of the accept operations

is O( |V |3+ρ|I|1+ρ ).

Process Shifts. At line 33, the maximum number of pairs (v, i) retrieved byGSS_Pairs

is |V ||I|, which is the maximum amount of GSS nodes that can be added to any GSS

level. For each one of them, the algorithm iterates through all edges (v, e, w) (line 34),

which takes |T ||V | operations, and executes |I| shift operations (line 35). This gives us

(|V ||I|)2|T |) operations for a single shift operation. Since the algorithm will, in the worst

case scenario, process enough shift operations to completely fill the V isitedPairs set, we

need to multiply this number by |V ||I|, which is the maximum amount of elements that

can be added to the V isitedPairs set. The complete formula for the runtime complexity

of the shift operations is O( (|V ||I|)3|T | ).

The complete formula for the runtime complexity of our algorithm, in terms of oper-

49

ations, is:

O( |V |+ |V |4+ρ|I|1+ρ|T ||N |+ |V |3+ρ|I|1+ρ + (|V ||I|)3|T | )

Since the processing of the reduce operations in the algorithm has the highest runtime

complexity, the complexity of our algorithm in the worst case scenario is:

O( |V |4+ρ|I|1+ρ|T ||N | )

By looking at this formula, one would think that converting the grammar to the

Chomsky Normal Form (CNF) might improve the algorithm’s runtime complexity, since

the largest RHS (ρ) in the production rules for a grammar in this form is 2, but exper-

iments have indicated that while there would be less symbols to process per production

rule, an increased number of reductions would be required to parse the strings for the

grammar, increasing the execution time.

4.2.2 Space complexity

The level U0 of the GSS contains at most |V | tuples (v, i0). During the processing, each

level of the GSS may have up to |V ||I| pairs, connected by |V |2|T | edges. The number of

GSS levels to explore depends on the maximum number of elements that may be stored

in the ReductionEdges, Answers and V isitedPairs sets. So, we need to store up to

|V ||I| ∗ (|V |2|N |+ |V |2 + |V ||I|) pairs and |V |2|T | ∗ (|V |2|N |+ |V |2 + |V ||I|) edges. Thus,

the space complexity of our algorithm is O( |V |3|N ||I|+|V |2|I|2+|V |4|N ||T |+|V |3|T ||I| ).

4.3 Discussion about correctness

In this section, we discuss about the correctness of the algorithm. The algorithm starts

with V isitedPairs, ReductionEdges and Answers empty and these sets will be filled

while the parsing proceeds and the algorithm processes the connections between the ver-

tices in the data graph according to the grammar rules. The first step in the algorithm is

to create the GSS nodes referring to the starting vertices of the query, on the GSS level

50

U0. From this point, the algorithm reaches the main loop and will iterate until no new

information is added into those three sets. In order to find new information, the algorithm

consults the parsing table for the given grammar for valid actions according to the labels

of the edges leaving the vertices currently being parsed.

In the main loop, the algorithm identifies which operations may be executed on the

parsing for each GSS node in the current GSS level. To process reductions, the algorithm

first fetches all GSS nodes in the current GSS level by calling GSS_Pairs (line 10). For

each node (a, i) found, the algorithm iterates through all tuples (a, e, b) in the data graph

starting with a and tries to identify if there is a reduce action allowed by the parsing

table of the given grammar for the e symbol. If the action is allowed, then a production

rule n → RHS will be reduced. The algorithm adds a new GSS node in the current

level, pointing to the reduction roots (c, j) as ancestral nodes. The reduction roots are

the GSS nodes where the parsing of the string being reduced was initiated, with c being

the vertex in the data graph and j the state where the algorithm was when the parsing of

the rule began. This reduction is made in the same way as in the original GSS algorithm.

After this, for each reduction node found, if not yet present, the algorithm adds one tuple

(c, n, a) to ReductionEdges and sets the changed variable to TRUE.

The next step in the main loop is to identify if there are accept actions allowed by

the grammar for any of the GSS nodes in the current GSS level. The algorithm calls

GSS_Pairs (line 26) to get the (a, i) tuples of GSS nodes in the current level. For each

one of them, verifies if there is an accept action enabled by the parsing table for the end

symbol $. If the accept is defined in the parsing table, then the algorithm has found a

valid answer, according to the parsing table. In this case, the vertex in the GSS node

being parsed can be reached from one vertex of the query via S, the start symbol of the

grammar. The algorithm calls GSS_Up passing 1 as the number of steps to look back

and will find GSS nodes (c, i) in the U0 level. If not yet present, the Answers set will be

added with the (c, a) pair and the changed variable is set to TRUE. This new element is

considered as a valid answer to the query and means that there is a connection between

the query vertex c and a vertex a in the data graph via the starting non-terminal S.

51

The last step in the main loop is to identify if there are shift actions allowed by

the grammar for any of the GSS nodes in the current GSS level. The algorithm calls

GSS_Pairs (line 33) to get the (a, i) tuples of GSS nodes in the current level. For

each pair found found, the algorithm iterates through all tuples (a, e, b) in the data graph

starting with a and tries to identify if there is a shift action to state j over e allowed by

the parsing table of the given grammar. If the action is allowed, then the shift operation

is processed the same way as is done in the original GSS algorithm. After this, for each

shift operation performed, the algorithm adds one pair (b, j) to V isitedPairs if not yet

present and sets the changed variable to TRUE.

The requisite for the algorithm to keep running (line 42) is that if there are still

symbols to be parsed, parsing them must result in new elements being added either to

V isitedPairs, ReductionEdges or Answers, setting changed to TRUE (lines 25, 32 and

41). At maximum, the V isitedPairs set may have |V ||I| elements; ReductionEdges may

contain |V |2|N | elements; and Answers may contain |V |2 elements. Since these three

sets are formed by groups of finite elements, they are also finite and the algorithm will

reach a state where there is either nothing else to parse or these three sets will not change

anymore, keeping the changed variable as FALSE at the end of the loop iteration and the

algorithm stops.

In this chapter, we presented our algorithm to query graph databases by exploring

paths described by LR(1) grammars. We discussed about the runtime and space com-

plexity of our proposal, and presented an example of its usage. In the next chapter, we

report some experiments intended to measure how our algorithm behaves in some specific

scenarios, comparing the results with the ones obtained by the related works.

52

5 Experiments

We implemented our algorithm in the Python language in order to be able to execute

some experiments and evaluate how it performs. We also compare them with some of

the related works, presented in Chapter 3. The execution time for each experiment was

obtained by measuring the average time of five executions. The experiments were executed

using a computer with 7.3 GB of RAM and an AMD Phenom II X4 B97 processor, running

Ubuntu 16.04 (x64) and Python 2.7. We also benefit of the speed gains provided by using

the PyPy Python compiler [26], which uses Just-in-Time (JIT) compiling techniques [2].

This is the same computer as the one used to execute the experiments done by C. M.

Medeiros [17], so we may actually compare the value between ours and their algorithm

without speculations.

To compare the results of our implementation with the results of the other proposals,

we executed the same queries over the same databases used in their experiments (Section

5.1). Other experiments were executed to evaluate how the algorithm behaves in the

worst-case scenarios, like complete graphs, binary trees (Section 5.2) and string graphs

(Section 5.3).

5.1 Ontologies stored as RDF graphs

Both the works of X. Zhang et al. [36] and S. Grigorev et al. [7] performed two specific

queries, referred to as Q1 on Figure 5.1a, and Q2 on Figure 5.1b, to search for data

on some popular RDF databases, namely Skos, Generations, Travel, Univ-bench, Foaf,

People-pets, Funding, Atom-primitive, Biomedical, Pizza and Wine.

S → subClassOf subClassOf−1

S → subClassOf S subClassOf−1

S → type type−1

S → type S type−1

(a)

S → B subClassOf−1

B → subClassOf B subClassOf−1

B → ε

(b)

Figure 5.1: Grammars for Queries Q1 (a) and Q2 (b).

53

These databases are composed of some basic information on wineries, and pizza places,

production and distribution, social network user’s relationships, etc. Q1 gives all pairs of

nodes which reside in the same hierarchy level, while Q2 gives the nodes which reside one

level above other nodes.

In this experiment, as expected, we found out that the larger the graph and number

of answers, the longer the queries take to be executed. We managed to get a performance

similar to the experiments in S. Grigorev et al. [7], but we cannot directly compare the

results because of the difference in the hardware utilized. Table 5.1 shows the number of

answers and time that each algorithm takes to perform Q1 starting on all vertices of the

ontology databases. Figure 5.2 shows a bar chart with these results.

GSSLR Zhang [36] Medeiros [17] Grigorev [7]Ontology #tuples #results time (ms) time (ms) time (ms) time (ms)

skos 252 810 19 1044 83 10generations 273 2164 20 6091 173 19travel 277 2499 32 13971 316 24univ-bench 293 2540 23 20981 318 25atom-primitive 425 15454 172 515285 2074 255biomedical 459 15156 223 420604 2288 261foaf 631 4118 25 5027 377 39people-pets 640 9472 51 82081 914 89funding 1086 17634 112 499 1754 212wine 1839 66572 415 4075319 6797 819pizza 1980 56195 436 3233587 7292 697

Table 5.1: Performance evaluation for Query Q1 on RDF databases.

Figure 5.2: Visualization of the results for the Query Q1 on RDF databases.

Table 5.2 shows the number of answers and time that each algorithm takes to perform

Q2 starting on all vertices of the ontology databases. Figure 5.3 shows a bar chart with

54

these results.

GSSLR Zhang [36] Medeiros [17] Grigorev [7]Ontology #tuples #results time (ms) time (ms) time (ms) time (ms)

skos 252 1 0 16 4 1generations 273 0 0 13 3 1travel 277 63 4 281 22 1univ-bench 293 81 3 532 26 11atom-primitive 425 122 1 4711499 45 66biomedical 459 2871 39 1068851 486 45foaf 631 10 1 1154 10 2people-pets 640 37 5 247 23 3funding 1086 1158 21 125 254 23wine 1839 133 9 273 70 8pizza 1980 1262 28 255853 335 29

Table 5.2: Performance evaluation for Query Q2 on RDF databases.

Figure 5.3: Visualization of the results for the Query Q2 on RDF databases.

5.2 Binary trees

S. Grigorev et al. [7], proposed to use two different grammars, Q3, represented in Figure

5.5a, and Q4, represented in Figure 5.5b, which have the same language, but the first

one being ambiguous, containing shift/reduce conflicts, while the second is unambiguous.

Both grammars may be used to detect paths that contain the same amount of as followed

by bs in the form "ababab", "aaabbbab", etc.

The trees in the experiment are binary, which means that every node has exactly zero

or two children. All the edges for each level of the tree have one terminal symbol of the

55

1

2 3

4 5 6 7

a a

b b b b

(a)

1

2 3

4 5 6 7

b b

a a a a

(b)

Figure 5.4: Top-down (a) and Bottom-up (b) tree patterns used in the experiment.

S → εS → a S bS → S S

(a)

S → a S b SS → ε

(b)

Figure 5.5: Grammars for queries Q3 (a) and Q4 (b).

grammar. The paths in the trees follow the pattern v0a−→ v1

b−→ v2. In the experiment,

there are two terminals, a and b.

The experiment starts by executing the queries for both Q3 and Q4 grammars on a tree

with height 1 and consecutively increasing the tree height. Following what was done by C.

M. Medeiros [17], we also used two tree patterns in our experiments. The first one being

a tree where the paths start in the root and spread through the children (Top-down), as

seen on Figure 5.4a, and the second one with paths starting on the leaves, directed to the

root (Bottom-up), as seen on Figure 5.4b.

Table 5.3 and Figure 5.6 show the results of executing the experiment on binary top-

down tree graphs for both the Q3 and Q4 grammars. In this case, our algorithm takes less

time to find the paths in the graph than the algorithm proposed by C. M. Medeiros [17],

but it is notable that using the ambiguous grammar, our approach takes a considerable

amount of extra time.

This difference starts to show up when using the tree with height 13, which has 8191

triples and produce 39139 results. The sudden increase in execution time on every odd

height value, more notable on heights 13 and 15, is due to the increase in the valid paths

in these cases, while on the even height values the last level on the tree only produces

invalid paths, requiring less operations to be performed.

56

Medeiros [17] Medeiros [17] GSSLR GSSLRQ3 Q4 Q3 Q4

Height #Vertices #Results time (ms) time (ms) time (ms) time (ms)

1 0 0 0 0 0 02 3 3 0 0 0 03 7 11 3 3 4 04 15 19 5 4 14 15 31 67 17 17 33 186 63 99 26 25 24 77 127 355 86 90 27 288 255 483 125 123 30 109 511 1763 418 453 127 3910 1023 2275 582 597 187 4411 2047 8419 2010 2235 780 25412 4095 10467 2674 2760 1216 29313 8191 39139 10158 11057 4663 153414 16383 47331 13254 13570 7412 180215 32767 178403 55579 59414 27558 8725

Table 5.3: Execution time for the grammars Q3 and Q4 on top-down binary trees.

10 10.5 11 11.5 12 12.5 13 13.5 14 14.5 15

0

2

4

6

·104

Tree height

Tim

e(m

s)

Q3 Medeiros [17]Q4 Medeiros [17]Q3 GSSLRQ4 GSSLR

Figure 5.6: Visualization of the top-down binary tree experiment results.

The second part of this experiment is to perform the queries on the bottom-up trees.

By analyzing Table 5.4 and Figure 5.7 we notice that using the unambiguous grammar

Q4 allows our algorithm to have a considerably better performance than the algorithm

proposed by C. M. Medeiros [17], but there is a great degradation on performance when

using the ambiguous grammar Q3. This degradation is easily noticeable on height 15,

where the algorithm takes 3,613 milliseconds to execute the query using Q4, against the

395,223 milliseconds using Q3.

Even though using both grammars produce the same results, it is notable that there

57

Medeiros [17] Medeiros [17] GSSLR GSSLRQ3 Q4 Q3 Q4

Height #Vertices #Results time (ms) time (ms) time (ms) time (ms)

1 0 0 0 0 0 02 3 3 1 0 0 03 7 11 3 2 3 04 15 23 7 5 14 15 31 67 18 16 27 76 63 135 37 29 24 197 127 355 90 71 25 168 255 711 183 143 34 149 511 1763 434 344 87 2110 1023 3527 882 691 197 2811 2047 8419 2151 1570 1346 6112 4095 16839 4189 3225 2803 14013 8191 39139 10549 7586 20833 39614 16383 78279 21065 15996 42851 82115 32767 178403 56744 37527 395223 3613

Table 5.4: Execution time for the grammars Q3 and Q4 on bottom-up binary trees.

9 9.5 10 10.5 11 11.5 12 12.5 13 13.5 14 14.5 15

0

2

4

·105

Tree height

Tim

e(m

s)


Figure 5.7: Visualization of the bottom-up binary tree experiment results.

is an increase in execution time when using the ambiguous grammar to execute the ex-

periments. This happens because for each step that there is a shift/reduce conflict, our

algorithm tries to parse a path via all actions defined in the parsing table in the current

state for the given input string, creating more pairs and connections in the GSS.

5.3 String graphs

In another experiment we execute the same queries as the ones used for the binary trees

experiment, but this time over string graphs. In this case, the graph contains one valid

58

path from the first vertex to the last vertex, in the form shown on Figure 5.8.

1 2 3 4 5a a b b

Figure 5.8: String graph pattern used in the experiments.

Analyzing the graph of the execution time in Figure 5.9, we observe that the algorithm

proposed by C. M. Medeiros [17] manages to take almost linear time to find the valid paths,

while our proposal tends to take polynomial time.

100 200 300 400 500 600 700 800 900 1,000

102

103

Number of vertices

Tim

e(m

s)


Figure 5.9: Visualization of the strings experiment results.

This happens because a graph like this requires multiple reductions for each path being

parsed, and the reductions are the most complex operations in our algorithm. We are also

querying from all nodes in the graph, which increases the number of valid paths found.

According to these experiments, our algorithm manages to execute the queries in an

acceptable time, even though its asymptotic runtime complexity in the worst case scenario

is high. In the experiments, our algorithm outperformed all the other algorithms when

executing the suggested queries in the ontology databases. We also tried to execute the

experiments on complete graphs, where every node is connected to all the other nodes

in the graph by all of the terminals in the grammar, but the experiments took too much

time to run. This increase in execution time happens because there is a valid path from

every node to every node in complete graphs. This causes the GSS to keep increasing

the number of pairs and connections in each of its level. In the next chapter, we present

concluding remarks and suggestions of future work.

59

60

6 Conclusions

In this work we introduced a context-free path query algorithm for graph databases. The

proposed algorithm is inspired by the LR parsing [1, 3] algorithm and uses a variant of

the GSS structure, introduced in [30, 33], to enable the derivation of multiple paths at

the same time. A Python prototype was implemented and experiments were conducted

to validate and compare the results of our algorithm with those obtained by similar ap-

proaches. We conducted three experiments, using four queries to evaluate our algorithm’s

execution times.

In the first experiment, the ontologies used in [36], [7] and [17] were used as databases.

The main goal of this experiment was to investigate the feasibility of our method as well

as compare our results with those works. In this experiment, our algorithm outperforms

all the other approaches, meaning that our algorithm can be used to query data from real

applications with an acceptable execution time.

In the second and third experiments, synthetic data of different sizes were used to

investigate the scalability of our approach and compare it to [17]. With these experiments,

we discovered that our algorithm scales well when the user provides an unambiguous

grammar, and/or when the user knows the subject from where the query will begin, not

only starting the query on the whole database. We also need to consider that a complete

graph is improbable to happen in real applications, due to its lack of information value.

A complete graph is simply all information connected. Even though we were not able to

execute the experiment on complete graphs, the overall results of the experiments suggest

that it is viable to use our proposed algorithm to perform context-free queries on graphs

on most of the existing scenarios.

Experimental results show that our algorithm behaves well and outperforms the related

works in real application scenarios, but it is costly in cases where the graph is complete or

paths are too long. These cases require an elevated number of reduce operations, which

is the heaviest part of our algorithm’s execution. Our algorithm is best suited to perform

61

queries that require few reductions and where given grammar does not include production

rules with a large RHS.

The most important contributions of our work are:

(i) Analysis of the state of the art related to databases, graph databases and query

languages;

(ii) Adaptation to the GSS structure to manage information of multiple strings simul-

taneously;

(iii) Proposal of an algorithm that allow querying graph databases using LR(1) gram-

mars;

(iv) Prototypes of the proposed algorithm.

During the execution of the experiments, we discovered that, even though our al-

gorithm needs improvements in order to achieve good scalability when processing large

complete graphs, it managed to perform well compared to the related works.

As future work, we suggest some improvements to our algorithm and the data struc-

tures used by it:

(i) The GSS_Up function can be modified to have a decreased time complexity,

significantly improving our algorithm’s performance;

(ii) Since the GSS structure builds and keeps many information on the connections

deducted about the data graph’s vertices and connections, one improvement to our al-

gorithm may be to allow the user to query for valid paths between nodes at any given

non-terminal of the grammar, as is done by (C. M. Medeiros, 2018) [17]. Currently we

only allow querying for the start symbol in the grammar;

(iii) Our algorithm has not been optimized or refined in any way. It’s scalability may

be improved, allowing it to parse even bigger graphs, by improving memory management

and allowing the concurrent processing of the paths.

62

Bibliography

[1] A. Aho, M. Lam, R. Sethi, J. Ullman, Compilers: Principles, Techniques, and Tools,

ADDISON WESLEY Publishing Company Incorporated, 2007 (cit. on pp. 2, 13,

21, 31, 61).

[2] D. Ancona, C. F. Bolz, A. Cuni, A. Rigo, Automatic generation of JIT compilers

for dynamic languages in .NET, tech. rep., DISI, University of Genova and Institut

für Informatik, Heinrich-Heine-Universität Düsseldorf, 2008 (cit. on p. 53).

[3] A. Appel, M. Ginsburg, Modern Compiler Implementation in C: Basic Techniques,

Cambridge University Press, 1997 (cit. on pp. 12–16, 61).

[4] S. Bechhofer, M. Hauswirth, J. Hoffmann, M. Koubarakis, The Semantic Web: Re-

search and Applications: 5th European Semantic Web Conference, ESWC 2008,

Tenerife, Canary Islands, Spain, Springer Berlin Heidelberg, 2008 (cit. on p. 2).

[5] E. F. Codd, Commun. ACM June 1970, 13, 377–387 (cit. on p. 5).

[6] R. Elmasri, S. Navathe, Fundamentals of database systems, Benjamin/Cummings,

1989 (cit. on p. 6).

[7] S. Grigorev, A. Ragozina, arXiv preprint arXiv:1612.08872 2016 (cit. on pp. 31,

53–55, 61).

[8] D. Grune, C. Jacobs, Parsing Techniques: A Practical Guide, Springer New York,

2007 (cit. on p. 28).

[9] GSSLR JavaScript prototype, http : / / htmlpreview . github . io / ?https : / /

github.com/freddcs/gsslr/blob/master/index.htm, Last access: February,

20th 2018, 2018 (cit. on p. 42).

63

http://htmlpreview.github.io/?https://github.com/freddcs/gsslr/blob/master/index.htm

http://htmlpreview.github.io/?https://github.com/freddcs/gsslr/blob/master/index.htm

[10] G. Harrison, Next Generation Databases: NoSQLand Big Data, Apress, 2015 (cit.

on p. 6).

[11] T. Heath, C. Bizer, Linked Data: Evolving the Web Into a Global Data Space, Morgan

& Claypool, 2011 (cit. on p. 2).

[12] J. Hellings, Conjunctive Context-Free Path Queries, (Eds.: N. Schweikardt, V. Christophides,

V. Leroy), OpenProceedings.org, 2014, pp. 119–130 (cit. on pp. 27, 29, 30).

[13] JSMachines: Collection of Javascript applications illustrating parsing algorithms,

http://jsmachines.sourceforge.net/machines/lr1.html, Last access: January

22nd, 2018, 2012 (cit. on p. 38).

[14] JSON-LD - A JSON-based Serialization for Linked Data, https://www.w3.org/

TR/json-ld/, Last access: February, 25th 2017, 2014 (cit. on p. 9).

[15] P. Linz, An Introduction to Formal Languages and Automata, Jones & Bartlett

Learning, 2016 (cit. on p. 12).

[16] A. Makris, K. Tserpes, V. Andronikou, D. Anagnostopoulos, Procedia Computer Sci-

ence 2016, 97, 2nd International Conference on Cloud Forward: From Distributed

to Complete Computing, 94–103 (cit. on pp. 1, 7).

[17] C. M. Medeiros, MA thesis, Universidade Federal do Rio Grande do Norte, 2018

(cit. on pp. 31, 32, 53–59, 61, 62).

[18] S. MUÑOZ, J. PÉREZ, C. Gutierrez in Proceedings of the 4th European Confer-

ence on The Semantic Web: Research and Applications, Springer-Verlag, Innsbruck,

Austria, 2007, pp. 53–67 (cit. on p. 24).

[19] J. Pérez, M. Arenas, C. Gutierrez, Web Semantics: Science Services and Agents on

the World Wide Web 2010, 8, Semantic Web Challenge 2009User Interaction in

Semantic Web research, 255–270 (cit. on pp. 11, 23–25).

[20] PRIMER RDF 1.1 Primer, https://www.w3.org/TR/2014/NOTE-rdf11-primer-

20140225/, Last access: February, 8th 2017, 2014 (cit. on pp. 7, 8).

64

http://jsmachines.sourceforge.net/machines/lr1.html

https://www.w3.org/TR/json-ld/

https://www.w3.org/TR/json-ld/

https://www.w3.org/TR/2014/NOTE-rdf11-primer-20140225/

https://www.w3.org/TR/2014/NOTE-rdf11-primer-20140225/

[21] A. Puntambekar, Formal Languages And Automata Theory, Technical Publications,

2009 (cit. on p. 12).

[22] RDF - Semantics Web Standards, https://www.w3.org/RDF/, Last access: Febru-

ary, 8th 2017, 2014 (cit. on pp. i, iii, 6, 8).

[23] RDFC - Concepts and Abstract Syntax, https : / / www . w3 . org / TR / rdf11 -

concepts/, Last access: February, 8th 2017, 2014 (cit. on p. 9).

[24] RDFS RDF Schema 1.1, https://www.w3.org/TR/rdf-schema/, Last access:

February, 8th 2017, 2014 (cit. on p. 8).

[25] L. Rietveld, Publishing and Consuming Linked Data: Optimizing for the Unknown,

IOS Press, 2016 (cit. on p. 1).

[26] A. Rigo, S. Pedroni, JIT Compiler Architecture, tech. rep. D08.2, PyPy, May 2007

(cit. on p. 53).

[27] I. Robinson, J. Webber, E. Eifrem, Graph Databases: New Opportunities for Con-

nected Data, O’Reilly Media, 2015 (cit. on p. 6).

[28] A. Satinder Bal Gupta, Introduction to Database Management System, Laxmi Pub-

lications, 2009 (cit. on p. 5).

[29] E. Scott, A. Johnstone, Electronic Notes in Theoretical Computer Science 2010,

253, Proceedings of the Ninth Workshop on Language Descriptions Tools and Ap-

plications (LDTA 2009), 177–189 (cit. on p. 31).

[30] E. Scott, A. Johnstone, S. S. Hussain, Tomita-Style Generalised LR Parsers, tech.

rep., Dec. 2000 (cit. on pp. i, iii, 32, 61).

[31] SPARQL 1.1 Overview, https://www.w3.org/TR/rdf- sparql- query/, Last

access: February, 10th 2017, 2013 (cit. on pp. 2, 10).

[32] Theory Of Automata, McGraw-Hill Education (India) Pvt Limited, 2010 (cit. on

p. 12).

[33] M. Tomita, Comput. Linguist. Jan. 1987, 13, 31–46 (cit. on p. 61).

65

https://www.w3.org/RDF/

https://www.w3.org/TR/rdf11-concepts/

https://www.w3.org/TR/rdf11-concepts/

https://www.w3.org/TR/rdf-schema/

https://www.w3.org/TR/rdf-sparql-query/

[34] TriG - RDF Dataset Language, https://www.w3.org/TR/trig/, Last access:

February, 25th 2017, 2014 (cit. on p. 9).

[35] Turtle - Terse RDF Triple Language, https://www.w3.org/TR/turtle/, Last

access: February, 25th 2017, 2014 (cit. on p. 9).

[36] X. Zhang, Z. Feng, X. Wang, G. Rao, W. Wu in The Semantic Web – ISWC 2016:

15th International Semantic Web Conference, Kobe, Japan, October 17–21, 2016,

Proceedings, Part I, (Eds.: P. Groth, E. Simperl, A. Gray, M. Sabou, M. Krötzsch, F.

Lecue, F. Flöck, Y. Gil), Springer International Publishing, Cham, 2016, pp. 632–

648 (cit. on pp. 29, 30, 53–55, 61).

66

https://www.w3.org/TR/trig/

https://www.w3.org/TR/turtle/

Documents

Master’s thesis A mechanism to evaluate context-free ... · of RDF statements from numerous sources, covering all sorts of topics. To ﬁnd speciﬁc information in this data, queries