Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Universidade Federal do Rio Grande do Norte
Centro de Ciências Exatas e da Terra
Departamento de Informática e Matemática Aplicada
Programa de Pós-Graduação em Sistemas e Computação
Master’s thesisA mechanism to evaluate context-free queriesinspired in LR(1) parsers over graph databases
Fred de Castro Santos
Natal-RN
March 22, 2018
Fred de Castro Santos
A mechanism to evaluate context-free queriesinspired in LR(1) parsers over graph databases
Master’s thesis presented at the Gradu-ate Program in Systems and Computation(PPgSC) of the Federal University of RioGrande do Norte (UFRN), under the su-pervision of Professor Umberto S. da Costaand the co-supervision of Professor Martin A.Musicante, as a requirement for obtaining aMaster’s Degree in Systems and Computing.
Natal-RN
March 22, 2018
Santos, Fred de Castro. A mechanism to evaluate context-free queries inspired inLR(1) parsers over graph databases / Fred de Castro Santos. -2018. 84f.: il.
Dissertação (mestrado) - Universidade Federal do Rio Grandedo Norte, Centro de Ciências Exatas e da Terra, Programa de Pós-Graduação em Sistemas e Computação. Natal, RN, 2018. Orientador: Umberto Souza da Costa. Coorientador: Martin Alejandro Musicante.
1. Computação - Dissertação. 2. Bancos de dados em grafo -Dissertação. 3. Expressividade de linguagens de consulta -Dissertação. 4. RDF - Dissertação. 5. Linguagens LR(1) -Dissertação. I. Costa, Umberto Souza da. II. Musicante, MartinAlejandro. III. Título.
RN/UF/CCET CDU 004
Universidade Federal do Rio Grande do Norte - UFRNSistema de Bibliotecas - SISBI
Catalogação de Publicação na Fonte. UFRN - Biblioteca Setorial Prof. Ronaldo Xavier de Arruda - CCET
Elaborado por JOSENEIDE FERREIRA DANTAS - CRB-15/324
Abstract
The World Wide Web is an always increasing collection of information. This informationis spread among different documents, which are made available by using the HTTP.Even though this information is accessible to users in the form of news articles, audiobroadcasts, images and videos, software agents often cannot classify it. The lack ofsemantic information about these documents in a machine-readable format usually makesthe analysis inaccurate. A significant number of entities have adopted Linked Data as away to add semantic information to their data, not just publishing it on the Web. Theresult is a global data collection, called the Web of Data, which forms a global graph,consisting of RDF [22] statements from numerous sources, covering all sorts of topics. Tofind specific information in this graph, queries are performed starting at a subject andanalyzing their predicates in the RDF statements. These predicates are the connectionsbetween the subject and object, and a set of traces forms an information path.
The use of HTTP as a standardized data access mechanism and RDF as a standarddata model simplifies the data access, but accessing heterogeneous data on distinct loca-tions may have an increased time complexity and current query languages have a reducedquery expressiveness, which motivates us to research alternatives in how this data isqueried. This reduced expressiveness happens because most query languages belong tothe class of Regular Languages. The main goal of this work is to use LR(1) context-freegrammar processing techniques to search for context-free paths over RDF graph databases,providing, as result, a tool which allows better expressiveness, efficiency and scalabilityin such queries than what is proposed today. To achieve that, we implemented an algo-rithm based on the LR(1) parsing technique that uses the GSS [30] structure instead of astack, and give means for the user to input queries with an LR(1) context-free grammar.Also, we analyze our algorithm’s complexity and make some experiments, comparing oursolution to other proposals present in the literature and show that ours can have betterperformance in given scenarios.
Keywords: Graph databases; Query language expressiveness; RDF; LR(1) languages.
i
ii
Resumo
A World Wide Web é uma coleção de informações sempre crescente. Esta informação édistribuída entre documentos diferentes, disponibilizados através do HTTP. Mesmo queessa informação seja acessível aos usuários na forma de artigos de notícias, transmissõesde áudio, imagens e vídeos, os agentes de software geralmente não podem classificá-la.A falta de informações semânticas sobre esses documentos em um formato legível pormáquina geralmente faz com que a análise seja imprecisa. Um número significativo deentidades adotaram Linked Data como uma forma de adicionar informações semânticasaos seus dados, e não apenas publicá-lo na Web. O resultado é uma coleção global dedados, chamada Web of Data, que forma um grafo global, composto por declarações noformato RDF [22] de diversas fontes, cobrindo todos os tipos de tópicos. Para encontrarinformações específicas nesses grafos, as consultas são realizadas começando em um sujeitoe analisando seus predicados nas instruções RDF. Esses predicados são as conexões entreo sujeito e o objeto, e um conjunto de trilhas forma um caminho de informação.
O uso de HTTP como mecanismo padrão de acesso a dados e RDF como modelo dedados padrão simplifica o acesso a dados, o que nos motiva a pesquisar alternativas naforma como esses dados são buscados. Uma vez que a maioria das linguagens de consultade banco de dados de grafo estão na classe de Linguagens Regulares, nós propomos seguirum caminho diferente e tentar usar uma classe de gramática menos restritiva, chamadaGramática Livre de Contexto Determinística, para aumentar a expressividade das con-sultas no banco de dados em grafo. Mais especificamente, aplicando o método de análiseLR(1) para encontrar caminhos em um banco de dados de grafo RDF. O principal objetivodeste trabalho é prover meios para se permitir a utilização de técnicas de reconhecimentode gramáticas livres de contexto LR(1) para fazer consultas por caminhos formados pelasetiquetas das arestas em um banco de dados RDF. Fornecendo, como um resultado, umaferramenta que se permita atingir melhor expressividade, eficiência e escalabilidade nestasconsultas do que o que existe atualmente.
Para atingir este objetivo, nós implementamos um algoritmo baseado nas técnicasde reconhecimento LR(1), usando o GSS [30] ao invés de uma pilha, e permitimos aousuário fazer consultas com uma gramática livre de contexto (LR1). Também analisamosa complexidade do nosso algoritmo e executamos alguns experimentos, comparando nossasolução com as outras propostas na literatura, mostrando que a nossa pode ter melhordesempenho em alguns cenários.
Palavras-Chave: Bancos de Dados em Grafo; Expressividade de linguagens de consulta;RDF; Linguagens LR(1).
iii
iv
Contents
1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Theoretical Foundation 52.1 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.1 RDF and RDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.2 Graph queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.3 SPARQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 Language Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.1 Regular Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.2 Context-Free Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.3 LR Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Related Work 233.1 nSPARQL: A navigational language for RDF . . . . . . . . . . . . . . . . . . 233.2 Conjunctive Context-Free Path Queries . . . . . . . . . . . . . . . . . . . . . 273.3 Context-Free Path Queries on RDF Graphs . . . . . . . . . . . . . . . . . . . 293.4 Context-Free Path Querying with Structural Representation of Result . . . . 313.5 Top-Down Evaluation of Context-Free Path Queries in Graphs . . . . . . . . 313.6 Tomita-Style Generalized LR Parsers . . . . . . . . . . . . . . . . . . . . . . . 32
4 The GrLR Query Processing Algorithm Approach 374.1 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.1.1 Algorithm execution example . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.2.1 Runtime complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.2.2 Space complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.3 Discussion about correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5 Experiments 535.1 Ontologies stored as RDF graphs . . . . . . . . . . . . . . . . . . . . . . . . . 535.2 Binary trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.3 String graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6 Conclusions 61
v
vi
List of Figures
2.1 Visual representation of an RDF graph. . . . . . . . . . . . . . . . . . . . . . . 82.2 RDF database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 Path found via path query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4 SPARQL query to find the books written by a:Author1 . . . . . . . . . . . . 102.5 Graph with opening and closing parenthesis as labels for the edges. . . . . . . 132.6 Visual representation of a Context-Free Grammar that defines strings that
start with n opening parenthesis and end with n closing parenthesis, with n > 0. 142.7 Visual representation of the derivation trees. . . . . . . . . . . . . . . . . . . . 152.8 Hierarchy of Context-Free Grammar classes [3] . . . . . . . . . . . . . . . . . . 162.9 Context-Free Grammar as described in Figure 2.6, extended with the start
symbol S ′ and end symbol $. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.10 Visual representation of the LR automaton generated by the extended gram-
mar in Figure 2.9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1 RDF graph containing information about available transport services betweencities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Forward and backward axes for an RDF triple (a, p, b) [19]. . . . . . . . . . . 253.3 Path connecting a1 to a6 [19]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.4 Adding extra edges to the graph. . . . . . . . . . . . . . . . . . . . . . . . . . 293.5 Context-Free Grammar extended with nSPARQL regular expressions . . . . . 303.6 Comparison between the languages. . . . . . . . . . . . . . . . . . . . . . . . . 313.7 Representation of a shift transition in a GSS. . . . . . . . . . . . . . . . . . . 333.8 Representation of a reduce transition in a GSS. . . . . . . . . . . . . . . . . . 333.9 Representation of a reduce transition for a ε-transition in a GSS. . . . . . . . 333.10 GSS generated for parsing the input string ( ( ) ) in the grammar in Figure 5. 34
4.1 Paths identified in a graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2 Execution of a GSS_Up function call. . . . . . . . . . . . . . . . . . . . . . . 384.3 Graph with a loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.4 Input data for the algorithm example. . . . . . . . . . . . . . . . . . . . . . . 434.5 Initialization of the GSS for the graph in Figure 4.4a. . . . . . . . . . . . . . . 434.6 Resulting GSS after processing level U0. . . . . . . . . . . . . . . . . . . . . . 444.7 Resulting GSS after processing level U1. . . . . . . . . . . . . . . . . . . . . . 454.8 Resulting GSS after processing level U2. . . . . . . . . . . . . . . . . . . . . . 464.9 Resulting GSS after processing level U4. . . . . . . . . . . . . . . . . . . . . . 474.10 Complete graph with three vertices. . . . . . . . . . . . . . . . . . . . . . . . . 47
5.1 Grammars for Queries Q1 (a) and Q2 (b). . . . . . . . . . . . . . . . . . . . . 535.2 Visualization of the results for the Query Q1 on RDF databases. . . . . . . . . 545.3 Visualization of the results for the Query Q2 on RDF databases. . . . . . . . . 55
vii
5.4 Top-down (a) and Bottom-up (b) tree patterns used in the experiment. . . . . 565.5 Grammars for queries Q3 (a) and Q4 (b). . . . . . . . . . . . . . . . . . . . . . 565.6 Visualization of the top-down binary tree experiment results. . . . . . . . . . . 575.7 Visualization of the bottom-up binary tree experiment results. . . . . . . . . . 585.8 String graph pattern used in the experiments. . . . . . . . . . . . . . . . . . . 595.9 Visualization of the strings experiment results. . . . . . . . . . . . . . . . . . . 59
viii
List of Tables
2.1 Comparison between RDBMS and NoSQL databases [16]. . . . . . . . . . . . . 72.2 LR(1) Parsing table for the extended Context-Free Grammar described in Fig-
ure 2.9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.3 Parsing of the input string ( ( ) ) according to the LR1 grammar in 2.6b. . . 21
3.1 RDFS inference rules [19]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.1 Performance evaluation for Query Q1 on RDF databases. . . . . . . . . . . . . 545.2 Performance evaluation for Query Q2 on RDF databases. . . . . . . . . . . . . 555.3 Execution time for the grammars Q3 and Q4 on top-down binary trees. . . . . . 575.4 Execution time for the grammars Q3 and Q4 on bottom-up binary trees. . . . . 58
ix
x
Glossary
CNF Chomsky Normal Form. 50
DBMS Database Management System. 5–7, 9
DDL Data Definition Language. 5
DML Data Manipulation Language. 5
GSS Graph Structured Stack. i, iii, vii, 32–35, 37–40, 42–50, 58, 59, 61, 62
HTTP Hypertext Transfer Protocol. i, iii, 1, 2
IoT Internet of Things. 6
IRI International Resource Identifier. 1
JIT Just-in-Time. 53
NoSQL Not only SQL. ix, 1, 6, 7
NRE Nested Regular Expression. 23, 26, 27, 30
RDBMS Relational Database Management Systems. ix, 6, 7
RDF Resource Description Framework. i, iii, vii, ix, 1–3, 6, 8–11, 24–26, 29, 31, 53–55
RDFS RDF Schema. ix, 8, 11, 23, 24
SQL Structured Query Language. 6, 7
URI Uniform Resource Identifiers. 6, 7, 9, 10
W3C World Wide Web Consortium. 2, 6, 23
xi
xii
1 Introduction
The World Wide Web is an always increasing collection of information. This information
is spread among different documents, which are made available by using the HTTP. These
documents are identified by their International Resource Identifiers (IRIs), and may be
connected to each other by hyperlinks. Even though this information is accessible to users
in the form of news articles, audio broadcasts, images and videos, software agents often
cannot classify it. These software agents need to analyze the contents of the documents
to identify their meanings, but, because of the lack of semantic information about these
documents in a machine-readable format, the result of this automated analysis is often
inaccurate. In order to make the semantic information on these documents available in
a machine-readable format, Linked Data was proposed as a complementary approach to
the World Wide Web [25].
Linked Data uses the same protocol as the World Wide Web to access and retrieve
information. In this context, IRI denote "things", called resources. These resources may
then be referenced by a standard mechanism, like the RDF, to make statements. In the
RDF, a set of statements forms a graph and a collection of graphs forms a dataset. This
organization schema allows the association between resources from distinct datasets. To
keep the systems available and scalable, NoSQL comprise an alternative to traditional re-
lational databases, capable of handling huge volumes of data leveraging on the capabilities
of cloud environments [16].
A significant number of entities have adopted Linked Data as a way to add semantic
information to their data, not just publishing it on the Web. The result is a global data
collection, called the Web of Data. The Web of Data forms a global graph, consisting
of RDF statements from numerous sources, covering all sorts of topics. To find specific
information in this data, queries are performed starting in a subject and analyzing its
predicates in the RDF statements. These predicates are the connections between the
subject and object, and a set of traces forms an information path. Given that a trace
1
is a set of predicates in an information path, one may tell there is a connection between
subject1 and object1 if there is a trace between them in the RDF statements [4].
The use of HTTP as a standardized data access mechanism and RDF as a standard
data model simplifies data access compared to Web APIs, which rely on heterogeneous
data models and access interfaces [11], but accessing heterogeneous data on distinct lo-
cations may have an increased time complexity and reduced query expressiveness, which
motivates us to research alternatives in how this data is queried.
1.1 Motivation
With the increase in size of the databases, new technologies were proposed to satisfy
the need of retrieving the stored data at acceptable speeds. These proposed technologies
range from changes in the database structure to the implementation of new query lan-
guages. Query languages vary in how expressive queries written therein are allowed to be
achieved. The World Wide Web Consortium (W3C) has proposed SPARQL [31], a query
language which allows the use of regular expressions to query the RDF database. Even
though SPARQL is the proposed standard language for querying graph databases, it has
limitations in expressiveness.
The query languages basically search for paths formed by the label of the edges between
nodes. The regular expressions have some known limitations, which, for example, do not
allow counting the parsed symbols [1].
Since most of the specified graph database query languages use Regular Expressions,
we propose to take a different path and aim on using Context-Free Grammars to increase
the expressiveness of graph database queries. More specifically, we propose to apply the
LR(1) parsing method to find paths in an RDF graph database.
1.2 Problems
This research aims to solve two main problems:
(i) Given a set of vertices from a graph and an LR(1) context-free grammar G, return
2
all the vertices in the graph which can be reached from them by following paths
formed by the edges where the sequence of symbols formed by their labels belongs
to the language of G;
Input: A graph DG ⊆ V × E × V , a set of vertices {v′|v′ ∈ V } and an LR(1)
context-free grammar G.
Output: A set of pairs of nodes {(v′, v)|v′, v ∈ V } where there exists a trace
between v′ and v whose edges form a string that belongs to the language formed by
G.
(ii) Given two nodes v′ and v from a graph and an LR(1) context-free grammar, identify
if there exists a path, using the same criteria as Problem (i), which respects the given
grammar between v and v′.
Input: A graph DG ⊆ V × E × V , two vertices {v′, v|v′, v ∈ V } and an LR(1)
context-free grammar G.
Output: True if there exists a trace between v′ and v whose edges’ labels form a
string that belongs to the language of G. False otherwise.
Problem (ii) is a sub-problem of Problem (i), since one can search for the vertices
which can be reached by a given vertex v then verify if the answer contains v′.
1.3 Goals
The main goal of this work is to provide means to enable the usage of LR(1) context-free
grammar processing techniques to search for paths formed by the labels of the edges in
an RDF graph database, providing, as result, a tool which allows better expressiveness,
efficiency and scalability in such queries than what is proposed today.
To achieve that, we implemented an algorithm based on the LR(1) parsing technique
and give means for the user to input queries with an LR(1) context-free grammar. Also,
we evaluated the algorithm’s runtime and space complexity, besides the expressiveness of
queries that it enables with the solutions proposed by the related works, in Chapter 3.
3
The remainder of this dissertation is organized as follow, in Chapter 2, we introduce
some of the concepts needed for better understanding the given problems and how to
solve them; In Chapter 3, we analyze some works related to our research; In Chapter 4,
we present a new solution, using LR(1) parsing concepts to query the graph database. In
Chapter 5, we conduct some experiments, comparing our solution to the related works,
giving our final conclusions and suggest a set of future work in Chapter 6.
4
2 Theoretical Foundation
In this chapter, we give insight on a few concepts the work in this dissertation is based
on. First, we introduce databases and graph databases, explaining what kind of query
expressiveness is possible to achieve with the current state-of-the-art solutions. Second,
we explore in details the graph theory to be able to show how to give graph databases a
navigation mechanism similar to the ones available to Context-Free Grammars. Lastly,
we introduce regular languages and Context-Free Grammars, comparing notations, query
expressiveness and their associated query mechanisms, as possible candidates to solve our
problem.
2.1 Databases
A database is a collection of organized, related data which is used as a source of infor-
mation to answer user queries or to facilitate other data processing activities. The basic
database management problem is how to store and organize data efficiently to meet the
data processing needs of the applications which use the data [28].
Since the mid 70s, the database field has experienced rapid growth and seen major
advances in applications, technology and research. One of the proposed technologies was
related to the way to structure the data called the Relational Model. It forms a sound
basis for treating derivability, redundancy and consistency of data by organizing it in a
set of relations [5].
To correctly manage the data, a Database Management System (DBMS) is needed.
By definition, a DBMS acts as an interface between the application program and the
data stored in the database and has five basic functions: (i) it is responsible to establish
the logical relationships among different data elements in a database and also define
schemas using the data definition language (DDL); (ii) enter data into the database;
(iii) manipulate and process the data stored in the database using the data manipulation
language (DML); (iv) maintain data integrity and security by allowing limited access of
5
the database to authorized users; and (v) query the stored data using structured query
language (SQL) [6].
By the year 2000, the increasing number in applications that involve cloud deployment,
mobile presence, social networking and the Internet of Things (IoT) started to demand
database technologies that include but are not limited to relational systems [10]. These
technologies weakened the relational DBMS patterns by sacrificing some of its principles
to ensure a gain in speed and scalability.
In a graph database, queries are localized to a portion of the graph, taking execution
time proportional only to the traversed portion of the graph for each query, rather than
the size of the overall dataset. And, since graphs are naturally additive, they do not
require the domain to be modeled in exhaustive detail up front. The application domain
can be designed incrementally as the business requires it. Governance can be applied in
a programmatic fashion, using tests to assert the business rules and maintain the data
model and queries.
Among these new technologies are Not only SQL (NoSQL), NewSQL, Big Data and
RDF. Although the three first are vaguely defined terms, they represent the most used
expressions for referring to next-generation database technologies [27]. We will present
RDF in more detail in Section 2.1.1. Table 2.1 shows a comparison between Relational
Database Management Systems (RDBMS) and NoSQL databases. The main difference
between them is that the relational databases require a fixed schema and enforce the
validity of the data while the NoSQL databases do not.
2.1.1 RDF and RDFS
The World Wide Web Consortium (W3C), an international community that develops open
standards to ensure the long-term growth of the Web, has defined a standard model for
data interchange on the web by extending the linking structure of the Web. This model
is called RDF [22] and it is the only standardized NoSQL solution. RDF uses uniform
resource identifiers (URI) to name entities’ relationships, which are represented by triples
composed by a subject, a predicate and an object. Each resource must have an unique
6
Feature RDBMS NoSQLData Validity Higher guarantees. Lower guarantees.Query Language Structured Query Language
(SQL).No declarative query language.
Data type Supports relational data and itsrelationships are stored inseparate tables.
Supports unstructured andunpredictable data.
Data Storage Stored in a relational model, withrows and columns. Rows containall of the information about onespecific entry/entity, and columnsare all the separate data points.
The term “NoSQL” encompassesa host of databases, eachwith different data storagemodels. The main ones are:document-based-store, graph-based,key-value-store, column-based-store.
Schemas andFlexibility
Each record conforms to fixedschema.
Schemas are dynamic. Each "row"does not have to contain datafor each "column".
DBMS Compliance The vast majority of relationaldatabases comply with all theDBMS functions.
Sacrifice some DBMS functions forperformance and scalability.
Table 2.1: Comparison between RDBMS and NoSQL databases [16].
URI. A collection of instances of this model will form a directed labeled graph with the
subject and object entities being the vertices. This graph has a semantics that facilitates
data merging even between distinct schemas, supporting the evolution of schemas without
requiring data consumers to be changed. With this model, it is possible to mix, expose
and share structured data across different applications [20].
Since we will be referencing directed graphs in the rest of this document, a short
introduction to graphs is required. A convenient way to define a graph D is as a set of
tuples (v, e, w) where v, w ∈ V are vertices of the graph, and e ∈ E are the edges of the
graph. a tuple (v, e, w) means that from vertex v to vertex w there is an edge labeled e
in the graph.
A graph has received this name because it can be easily represented graphically, which
helps understand many of its properties. Most of the definitions and concepts in graph
theory are suggested by the graphical representation. Two vertices which are incident
with an edge are adjacent ; and an edge with identical ends is a loop. A graph is finite if
both its vertex set and edge set are finite, simple if it has no loops and no two of its edges
join the same pair of vertices.
It is possible to follow a path in a graph departing from a vertex through one of the
edges connected to it and reach a destination vertex. If the path contains no cycles, it is
7
called a simple path.
A trace in D is a finite non-null sequence W = v0e0v1e1...ekvk, where its terms are
alternatively vertices and edges, starting in the first vertex and ending on the last vertex.
The integer k is the length of the trace. The vertices v0 and vk are called the origin and
terminus of W, respectively, and v1, v2, ..., vk−1 its internal vertices. In the context of this
research, we will be using the label of the edges e ∈ E to identify the connection between
two vertices.
Example 1 Figure 2.1 shows an RDF graph, containing information about authors and
books. Note that a:Author1, a:Author2, b:Book1 and b:Book2 are the labels of the graph
vertices. The arrows pointing from the authors to the books represent the directed edges.
library:wrote is the label of the edges. In this context, by looking at the graph, one can
tell a:Author1 has written both books, while a:Author2 has written b:Book1.
a:author2
a:author1
b:Book1
b:Book2
library:wrote
library:w
rote
library:wrote
Figure 2.1: Visual representation of an RDF graph.
The definition of RDF itself does not assure the data consistency within a graph. For
this, a semantic extension is required [22].
In order to extend the semantics and assure the consistency of the RDF graphs, RDFS
was proposed. This schema provides a vocabulary and mechanisms for describing groups
of resources which are related, for modeling the data within an RDF graph [20].
Following the architectural principles of the Web, the RDF Schema uses a property-
centric approach, which allows the description of existing resources to be extended. Its
class and property system is similar to the type systems of object-oriented programming
languages. The difference is that the properties are described in terms of the classes of
resources to which they apply, instead of defining the classes by their properties [24].
8
Example 2 Figure 2.2 lists the text required to store library data in an RDF database
containing information on which author wrote which books. The definition of Library,
Author and Book are contained in their respective URIs. Like in Example 1, in this case
is also possible to see a:Author1 wrote b:Book1 and b:Book2, while a:Author2 wrote
b:Book1.
@prefix library <http://example.org/library/1.0/> .@prefix b <http://example.org/library/1.0/book/> .@prefix a <http://example.org/library/1.0/author/> .a:Author1 library:wrote b:Book1 .a:Author1 library:wrote b:Book2 .a:Author2 library:wrote b:Book1 .
Figure 2.2: RDF database.
At first, the RDF data was intended to be stored in textual form and various storage
methods were proposed, such as Turtle [35], TriG [34] and JSON-LD [14], but, due to
the nature of the web, the amount of data to be stored in such way may cause loss of
performance when querying. To work with this problem, a DBMS is more suited in this
case [23].
2.1.2 Graph queries
Given that the data being used in this work will be stored in a graph database, it is
needed to discuss about the types of queries that can be executed over it.
Subgraph Isomorphism This type of query is a decision problem and is usually
NP-complete. Given two graphs D and D′, identify if D contains a subgraph that is
isomorphic to D′, meaning that there is a subgraph in D that contains the same nodes
and edges as D′. In some cases it may be processed in polynomial time in terms of the
number of vertices in the graph.
Frequent Subgraph Mining A subgraph may be considered interesting if it appears
multiple times in a graph. Given a threshold σ, this type of query consists in finding the
subgraphs that appear at least σ times in a given graph D.
Path Query The path queries consist in, given a graph and a path, return the pairs of
nodes from the graph that are connected by edges that form the given path. The dashed
9
path in Figure 2.3 represents a path found between the nodes a and b in the graph, ac-
cording to the path query ( ).
a b
c
d e(
)( (
Figure 2.3: Path found via path query.
In Chapter 4, we show that, similar to related approaches, our algorithm uses path
query to search for nodes connected by paths where the label of the edges form a valid
string in the given Context-Free Grammar.
2.1.3 SPARQL
Along with the RDF specification, a query language called SPARQL was defined [31],
whether the data is stored natively as RDF or viewed as RDF via a middleware. A
query in this language contains a set of triple patterns called basic graph pattern, which
will match a subgraph of the RDF data when the terms from that subgraph may be
substituted for the pattern variables [31].
Example 3 The RDF graph in Figure 2.1 contains the library data. In this case, to find
the books for a given author, a SPARQL query can be as follows.
PREFIX library <http://example.org/library/1.0/>PREFIX b <http://example.org/library/1.0/book/>PREFIX a <http://example.org/library/1.0/author/>SELECT ?bookWHERE{a:Author1 library:wrote ?book}
Figure 2.4: SPARQL query to find the books written by a:Author1
The query in Figure 2.4 starts by specifying the URI prefixes so the words used may be
reduced. SELECT ?book means that we need to retrieve the registers associated with the
?book part of the tuple being searched for in the WHERE statement. We use the tuple
10
{a:Author1 library:wrote ?book}, which means that we are looking for all RDF regis-
ters in the database where the subject is a:Author1 and the predicate is library:wrote.
In this case, ?book is object and we do not know the value of it.
It has been noted that, although RDF is a directed labeled graph data format,
SPARQL only provides limited navigational functionalities. This is more notorious when
one considers the RDFS vocabulary, which current SPARQL specification does not cover,
where testing conditions like being a subclass of or a sub-property of requires navigating
the RDF data [19]. As a solution to this, nSPARQL was proposed [19] and we will talk
more about it and other related works in Chapter 3.
These limited navigational functionalities lead us to investigate how the languages are
built and how they may help us to provide mechanisms to make more expressive queries.
2.2 Language Specifications
A formal language is a possibly infinite set of sentences which can be formed by a grammar.
These sentences consist of a finite number of characters belonging to the alphabet of the
grammar. A formal grammar consists of a finite set of symbols and some production
rules, and may be used to generate sentences. A formal grammar may be defined by a
quadruple G = (N, T, S, P ), where:
• N is a finite set of nonterminal symbols, representing the elements which may be
replaced by applying a production rule. The nonterminal symbols are usually rep-
resented by one upper case letter;
• T is a finite set of terminal symbols, representing the alphabet of the grammar.
The terminal symbols are the characters that appear in the sentences generated by
the grammar;
• S is the start symbol, and consists of one nonterminal from which all sentences
composing the language can be generated; and
• P is a finite set of production rules.
11
A production rule P has the form α → β, with α and β being sequences of symbols
and β can be empty (ε). For each existing nonterminal in N, there must be at least
one production rule [15]. The nonterminals may also have as many production rules as
needed. In this context, α is called the Left-hand Side (LHS) of the production rule and
β is the Right-hand Side (RHS) of the production rule.
There is an hierarchy between the different classes of languages, known as Chomsky
Hierarchy, in which each grammar is classified by the form of their productions. Each
category represents a class of languages that can be recognized by a different automaton
[32].
The classification begins with the less restrictive grammar class and adds restrictions
for the inner classes. The most general one, called Recursively enumerable, allow any
production rule in the form α → β to be recognized, meaning that any sequence of
symbols can be parsed into any other. Languages represented by grammars in this class
are the ones recognizable by Turing Machines and, together with the Context-Sensitive
Grammars, are too generic for the scope of this work. For now, we will focus on the two
most relevant classes for programming languages, Regular Languages and Context-Free
Languages.
2.2.1 Regular Languages
The grammar class that can be accepted by a finite automaton is the one called Regular
Grammars. Production rules P for the Regular Grammars must either be left or right
linear, meaning that in the production rule A → α, α can only have a nonterminal in
the beginning or in the end, and each nonterminal can have exactly one production rule
[21]. The language that can be formed by some regular grammar is known as a regular
language, and can be used to define lexical structures in a declarative way [3].
Given an alphabet T to be considered as an input set, the regular expression over
T may be defined as: (i) φ is a regular expression which denotes the empty set; (ii) ε
is a regular expression and denotes the set {ε} and it is an empty string; (iii) For each
a in T , a is a regular expression and denotes the set {a}; (iv) if r and s are regular
12
expressions denoting the languages L1 and L2 respectively, then rs is equivalent to L1L2,
i.e. concatenation; (v) r∗ is equivalent to L∗1, i.e. closure, which indicates occurrence of r
for zero or more times; (vi) r+ is equivalent to the positive closure, which requires at least
one occurrence of r; (vii) Regular expressions can contain the choice symbol |, which,
when used between expressions as in L1|L2, allows L1 or L2 to be matched.
Example 4 Consider that a path is formed by the labels of the edges connecting vertices.
A regular expression to match all strings that start with a number of opening parenthesis,
followed by a number of closing parenthesis can be described by means of the regular
expression (+)+. If this regular expression is used to query the graph in Figure 2.5 for
paths starting in the vertex a, the answer would be the vertices c, e and f.
a b
c
d e f
()
() )
Figure 2.5: Graph with opening and closing parenthesis as labels for the edges.
Every sentence that may be described by a regular expression may be described by a
Context-Free Grammar but not the other way around. As seen on Example 4, even though
Regular Expressions are powerful enough to provide a pattern matching mechanism, they
will not allow us to use a pattern like (n)n to match sentences with the same amount of
opening and closing parenthesis [1]. To achieve this level of expressiveness, we have to
investigate the next, less restrictive, language class, called Context-Free Languages.
2.2.2 Context-Free Grammars
Context-Free Grammar is a class of grammars that have production rules in the form
A → β, where A ∈ N and β is a set composed of elements in T ∪ N or the ε symbol,
meaning it is an empty rule. The Context-Free Grammars define syntactic structure
declaratively. They may also be used to describe the structure of lexical tokens, although
Regular Grammars are more adequate, and more concise, for that purpose. The Context-
Free Grammars may be recognized by stack automata [3].
13
S → (P )P → (P )P → ε
(a) Extended grammar
S → (P )P → ε|(P )
(b) Compact grammar
Figure 2.6: Visual representation of a Context-Free Grammar that defines strings thatstart with n opening parenthesis and end with n closing parenthesis, with n > 0.
The grammar represented in Figure 2.6 indicates that the symbols S, P ∈ N are sym-
bols that must be derived. We will discuss more about derivation and parsing techniques
later on this section and on Section 2.2.3.
Example 5 A grammar to define all strings formed by a number of opening parenthesis
followed by the same number of closing parenthesis, (), (()), ((())), etc, may be defined as
seen in Figure 2.6a. In this case, ( and ) are the terminal symbols, and S and P are the
nonterminal symbols. If used to recognize the paths in the graph of Figure 2.5, starting in
vertex a, the answer would be the vertices c and f. Note that, to make it easier to write
the grammar, the | symbol can be used to join multiple definitions for a nonterminal. In
this example, P → (P ) and P → ε could be written as P → ε | (P), as shown on Figure
2.6b.
A technique called parsing may be used to reveal the grammatical structure of a se-
quence of symbols and identify if it belongs to a Context-Free Grammar, by consecutively
matching the nonterminals with their production rules and building a structure called a
Derivation Tree. The Derivation Tree is formed by having each symbol of the grammar as
one vertex, being the starting symbol at the top, or root, nonterminal symbols in the inner
vertices and only terminals in the leaves, or outer vertices. This process will derive strings
beginning with the start symbol and repeatedly replacing a nonterminal by the RHS of
a production for that nonterminal until reaching the end of the sequence of symbols [3].
This Derivation Tree, then, may be navigated from the leaves to the root or from the root
to the leaves, according to the chosen parsing method.
Example 6 Consider the graph in Figure 2.5 and the grammar in Figure 2.6. Figure
2.7a describes how the derivation tree would be if we would like to identify if the path (
14
S
P( )
ε
(a) Path ( )
S
P( )
P( )
ε
(b) Path ( ( ) )
Figure 2.7: Visual representation of the derivation trees.
) between the vertices a and c belong to the grammar, while Figure 2.7b describes the
derivation tree for the path ( ( ) ) between the vertices a and f.
There are several methods for identifying if a given string belongs to a language defined
by a Context-Free Grammar. These methods are called parsing, and differ from each other
by how they build and traverse the derivation tree. The Top-down approach starts at the
root vertex and move on to the leaves, investigating which production rule should be used
when each new symbol is read, while the Bottom-up starts at the leaves and move on
to the root of the tree, identifying which production rule should be applied for the given
input string, replacing the RHS with the LHS of the production.
The parsers defined for the Context-Free Grammar make use of a stack to help deciding
which rule should be applied when processing the current input symbol. Some grammars
may be written in way that a conflict is introduced, leading the parser to reach a state
where it is possible to apply more than one rule in the parsing.
Figure 2.8 defines the hierarchy between the ambiguous and unambiguous Context-
Free Grammars. The unambiguous grammars are divided in LL(k) and LR(k). The first
"L" stands for left-to-right scanning of the input, the second "L" in LL, for constructing
a leftmost derivation in reverse, the "R" in LR, for constructing a rightmost derivation
in reverse, and the k stands for lookahead, represents the symbols at the front in the
input stream that aid analysis decisions [3]. Usually, programming languages are parsed
by using k = 1. When k is omitted, k is assumed to be 1, which is of practical interest
15
and will be used in this dissertation.
LL grammars are those which do not have recursion to the left, meaning that in a
production rule A → β, β cannot start with the A symbol. Also, if multiple definitions
of rules for A are given, they cannot start with the same symbol.
The class of grammars that may be parsed using LR methods is a superset of the class
of grammars that may be parsed with predictive or LL methods. To parse an LL grammar,
the parser must be able to recognize the appropriate production rule to be applied to a
non-terminal seeing only the first k symbols that its right-hand side derives. Less rigorous
than that, for LL(k) grammars, the parser only needs to recognize the occurrence of the
right side of a production rule in a rightmost sentential form, with k input symbols of
lookahead, allowing this method to describe more languages than the former method.
Like the non-recursive LL parsers, the LR parsers are table-driven. A grammar is
considered to be LR if a left-to-right shift-reduce parser is able to recognize handles of
right-sentential forms when they appear on top of the stack. Syntactic errors are detected
while a simple left-to-right scan of the input is being made. We will continue investigating
the LR parsing method in Section 2.2.3
Figure 2.8: Hierarchy of Context-Free Grammar classes [3]
16
2.2.3 LR Parsing
Among the existing bottom-up parsers, the LR(k) is the most prevalent concept. If a
Context-Free Grammar can be written for a programming language, it will probably be
recognizable by an LR parser. For guidance on what steps to take, the LR parser makes
use of a Parsing Table (defined below) and a stack of states. By doing so, it may detect
syntactic errors as soon as it is possible to do on a left-to-right scan of the input string.
S ′ → S$S → (P )P → ε|(P )
Figure 2.9: Context-Free Grammar as described in Figure 2.6, extended with the startsymbol S ′ and end symbol $.
To be parsed by an LR Parser, the language must be extended with the start symbol,
called here S ′, and the end symbol $. Given the grammar in Figure 2.6b, it must be
improved with the production rule S ′ → S$, which means that the start symbol S ′
generates all the sentences as before and finish with the end symbol $. The resulting
grammar may be seen in Figure 2.9.
S ′ → ·S, $S → ·(P ), $
I0
S ′ → S·, $I1
S → (·P ), $P → ·, )P → ·(P ), )
I2
S → (P ·), $I3
P → (·P ), )P → ·, )P → ·(P ), )
I4
S → (P )·, $I5
P → (P ·), )I6
P → (P )·, )I7
S
(
P )
(
P )(
Figure 2.10: Visual representation of the LR automaton generated by the extended gram-mar in Figure 2.9.
The next step is to construct the LR automaton, which is used to direct decisions
during analysis. Figure 2.10 shows the LR automaton for the grammar given in Figure
2.9. In the automaton, each state represents a set of items. An LR(0) item of a grammar
D is a production of D with a dot (·) at some position of the production RHS. As the
17
automaton progresses in its states, the productions are represented as if the dot is moving
to the right, meaning the elements at its left indicate how much of a production has
already been seen by the parser. In this way, production A→ XY Z yields the four items
A→ ·XY Z, A→ X·Y Z, A→ XY ·Z and A→ XY Z·. The production A→ ε generates
only one item, A → · and A → XY Z· indicates that the parser has read XY Z and it
may be time to reduce XY Z to A.
Since we target on using the LR(1) method, we will incorporate the lookahead to the
items set, by redefining the items to include one terminal symbol as a second component.
The general form of an item becomes [A→ α · β, a], where A→ αβ is a production rule
and a is a terminal or the right end symbol $.
Algorithm 1: Construction of LR(1) items sets for an LR(1) grammar.Function ITEMS(G’ : D) : SetOfItems
C ← CLOSURE({[S ′ → ·S, $]});repeat
foreach I ∈ C doforeach grammar symbol X do
C ← C ∪ GOTO(I,X);
until no new sets of items are added to C ;return C
Algorithm 1 shows the function ITEMS(G′), which receives the augmented grammar
and prepares the set of states for the automaton for the given LR(1) grammar. Being I a
set of LR(1) items for a grammar G, then the CLOSURE(I) function, shown in Algorithm
2, is the set of items constructed from I by adding every item in I and if A → α·Bβ is
already added and B → γ is a production, then add B → ·γ, if not yet added. Repeat
until no more new items can be added to CLOSURE(I).
The GOTO(I, X) function, shown in Algorithm 3, is used to define the transitions in
the LR(1) automaton for a grammar. The automaton states correspond to sets of items
and the function specifies the transition from the state for I when given the input X.
The LR parsing algorithm will try to identify one of the three possible actions to take
given a new input symbol. The action can either be shift, when the symbol does not
yet complete the RHS of a production rule, reduce, when the symbol completes the RHS
18
Algorithm 2: Identifying the closure of a set of items for an LR(1) grammar.Function CLOSURE(I : SetOfItems) : SetOfItems
repeatforeach [A→ α·Bβ, a] ∈ I do
foreach B → ·γ ∈ D′ doforeach b ∈ FIRST(βa) do
I ← I ∪ [B → ·γ, b];
until no more items are added to I ;return I
of a production rule or accept, when we reach the end of the sentence and it belongs to
the language formed by the grammar.
Algorithm 3: Identifying the destination state for a given non-terminal for a set ofitems for an LR(1) grammar.Function GOTO(I : SetOfItems, X : N ): SetOfItems
J ← ∅;foreach [A→ α·Xβ, a] ∈ I do
J ← J ∪ [A→ αX·β, a]return I
To identify which action to take, the parser uses a stack where it stores the state where
the automaton currently is. For the new symbol, it will look into the parsing table, getting
the value stored for the line representing the current state and column representing the
new symbol.
The parsing table for the grammar in Figure 2.9 can be seen in Table 2.2. The shift
actions are represented by si, where i is the number of the new state to shift to. The
reduce actions are represented by ri, where i is the number of the production rule to
reduce. If the action is shift, the parser stores this new state in the stack and continues
to the next symbol. If the action is reduce, the parser will pop a number of elements
from the stack equal to the number of elements in the RHS of the production rule being
reduced, and add the LHS of the rule to the stack, and continue to the next symbol. If
the action is accept, then it means that the sentence has reached its end and should be
considered a valid sentence in the language. If the parser does not find a valid action for
the current symbol, it means that the sentence is wrong and should be rejected.
19
Example 7 Given the augmented LR(1) grammar represented in Figure 2.9 and the
sequence of input symbols ( ( ) ) to be parsed by the LR(1) algorithm, the parser
starts the stack with the state i0. By receiving the first input symbol, (, the parser
identifies that the action to take is to shift to state i2, consuming the input symbol and
storing state i2 in the stack. This can be verified by looking at the parsing Table 2.2.
The next symbol is another (. The parser identifies the next action as shift i4, consumes
the input symbols and stores the state i4 in the stack. The next symbol is ). By looking
at the parsing table, the parser identifies the next action as a reduce by the rule P → ε.
Since this rule does not have elements in its RHS, there are no states to be removed from
the stack. The parser, then, identifies the next action as a goto state i6 and adds the
state i6 to the stack. The next element is ) and the action is shift i7, which triggers a
reduce by rule P → (P) and a goto action to state i3. The current stack is i0 i2 i3 i5,
which represents the symbols (P). The input string has only the last input symbol, ),
which requires a shift action to state i5, triggering a reduce by production rule S → (P)
and a goto state i1, where we reach the end of the input string. The parser identifies in
the parsing table that the input symbols form a valid sentence in the given grammar and
accepts it. The step-by-step execution can be seen on Table 2.3.
State Action Goto( ) $ S’ S P
i0 s2 1i1 acci2 s4 r2 3i3 s5i4 s4 r2 6i5 r1i6 s7i7 r3
Table 2.2: LR(1) Parsing table for the extended Context-Free Grammar described inFigure 2.9.
While the LR parsing algorithm can be implemented as efficiently as more primitive
shift-reduce methods, its main drawback is that it takes too much work to construct
the parsing table for a typical programming-language grammar. For that, there is need
to use a specialized LR parser generator, like Yacc. Such generators take a Context-
20
# Stack Symbols Input Action(1) i0 ( ( ) ) $ shift i2(2) i0 i2 ( ( ) ) $ shift i4(3) i0 i2 i4 ( ( ) ) $ reduce P → ε(4) i0 i2 i4 ( ( P ) ) $ goto i6(5) i0 i2 i4 i6 ( ( P ) ) $ shift i7(6) i0 i2 i4 i6 i7 ( ( P ) ) $ reduce P → (P )(7) i0 i2 ( P ) $ goto i3(8) i0 i2 i3 ( P ) $ shift i5(9) i0 i2 i3 i5 ( P ) $ reduce S → (P )(10) i0 S $ goto i1(11) i0 i1 S $ accept
Table 2.3: Parsing of the input string ( ( ) ) according to the LR1 grammar in 2.6b.
Free Grammar and automatically produce a parser for it, locating and diagnosing the
constructs which are difficult to parse in a left-to-right scan of the input [1].
This chapter presented relevant concepts needed to understand the context of our
research as well as our approach to the problem of querying graph databases with context-
free queries. In the next chapter, we discuss some related works, identifying pros and cons
of their approaches.
21
22
3 Related Work
During the research for this work, we came upon some papers which aim at increasing
expressiveness of graph database queries by extending Regular Expressions or even adding
Context-Free Languages concepts to their query mechanisms. Some also include pre-
processing of the graph. Four of them seem to be most related to what we are trying
to achieve and required some analysis. The first one defines an extension to SPARQL,
adding nested regular expressions to its query mechanism; the second one proposes to add
Context-Free Grammar notions to query the graph database; the third uses a combination
of both approaches to extend SPARQL with notions of Context-Free Grammar and regular
expressions; and the fourth one introduces, among other things, a structure with the intent
of helping the parsing algorithm, which links the language’s automaton with the current
processing state.
3.1 nSPARQL: A navigational language for RDF
Even though the W3C has specified and recommended SPARQL as the standard graph
database query language, it is known that SPARQL contains limitations on the query
expressiveness, specially when trying to navigate through the RDFS subcategories and
subclasses.
Consider the graph shown in Figure 3.1, which contains information about cities and
transportation services between them. According to the RDFS specification, it should be
possible to identify whether a pair of cities a and b are connected by a sequence of trans-
portation services, without knowing in advance what services provide those connections,
but SPARQL does not provide us means of creating such query. To overcome these limi-
tations, in (J. Pérez, M. Arenas, C. Gutierrez, 2010) [19], a language extending SPARQL
was proposed, called nSPARQL, which adds regular expressions and a concept of expres-
sion branching, or nesting (NRE). The resulting language is then evaluated by means of
query time efficiency and the authors prove that if the appropriate data structure is used
23
Figure 3.1: RDF graph containing information about available transport services betweencities.
(1) Sub-property: (2) Subclass: (3) Typing:
(a) (A,sp,B)(B,sp,C)(A,sp,C)
(a) (A,sc,B)(B,sc,C)(A,sc,C)
(a) (A,dom,B)(X,A,Y )(X,type,B)
(b) (A,sp,B)(X,A,Y )(X,B,Y )
(b) (A,sc,B)(X,type,A)(X,type,B)
(b) (Arange,B)(X,A,Y )(Y,type,B)
Table 3.1: RDFS inference rules [19].
to store an RDF graph D, then it is possible to use a regular expression E to check in
time O(|D| · |E|) whether a vertex w is reachable from v.
Note that in Figure 3.1, a triple (s, p, o) is depicted as an edge sp−→ o, where s
and o are represented as nodes and p is represented as the label of an edge. For example
(Paris, TGV,Calais) is a triple stating there is a TGV transport service connecting Paris
and Calais.
Based on what is proposed by Muñoz et al. [18], the authors decided to use the
system of rules defined in Table 3.1 and to consider only the RDFS subsets composed
by rdfs:subClassOf, rdfs:subPropertyOf, rdfs:range, rdfs:domain and rdf:type, denoted by
sc, sp, range, dom and type, respectively. In every rule, letters A, B, C, X and Y stand
for variables occurring in the triples. A triple t is deduced from D if t ∈ D or there exists
a graph D′ such that t ∈ D′ and D′ is obtained from D by successively applying the rules
in Table 3.1.
The queries performed by nSPARQL use this set of inferences to identify connections
24
between vertices when a triple is found. The query is in the form (?X, query, ?Y ) and its
answer corresponds to pairs of vertices (X, Y ), which contain a path between them in the
graph, according to what was specified in the query.
Some navigation rules were specified to allow regular expressions in nSPARQL.
Figure 3.2: Forward and backward axes for an RDF triple (a, p, b) [19].
The navigation of a graph is done by using axes next, edge and node, and their
inverses next−1 , edge−1 and node−1, to move through an RDF triple. There is also an
special axis self, which is not used to navigate to another vertex, but to reference the
current vertex for logic purposes. As can be seen in Figure 3.2, b is the next of a; p is the
edge of a; b is the node of p. Similarly to this, a is next−1 of b; a is the edge−1 of p; and
p is the node−1 of b.
Example 8 Consider the graph in Figure 3.3. It is possible to find the connection be-
tween nodes a1 and a4, and a1 and a6 by using the regular expression as defined below:
next/next/edge/next/next−1/node
Starting in the node a1, next returns a2, then next returns a3, then edge returns p3, then
next returns p4, then next−1 returns p3 and p5 and then node returns the nodes a4 and
a6. The dashed lines are shown just to help the user to see the steps to recognize the path
from a1 to a6.
Figure 3.3: Path connecting a1 to a6 [19].
25
It is also possible to use Nested Regular Expressions (NRE) with nSPARQL. These
regular expressions can be used to test for the existence of certain paths starting at a
given axis and their syntax is defined by the following grammar:
exp := axis | axis :: a(a ∈ U) | axis :: [exp] | exp/exp | exp | exp | exp∗
With axis being one of self, next, next−1, edge, edge−1, node or node−1. As usual
for regular expressions, exp+ is used as a shorthand for exp/exp∗. Given an RDF graph
D, the expression next :: a identifies the pairs of nodes (x, y) such that (x, a, y) ∈ D.
Given that a node can also be the label of an edge, it is also possible to navigate from
a node to one of its leaving edges using the edge axis. The interpretation of edge :: a
is the pairs of nodes (x, y) such that (x, y, a) ∈ D. To express the regular expression
nesting, the nesting construction [exp] is used to check for the existence of a path defined
by expression exp. The evaluation of the expression next :: [exp] in a graph D, retrieves
the pairs of nodes (x, y) such that there exists a node z with (x, z, y) ∈ D, and that there
is a path in D that follows the expression exp starting in z.
Example 9 Considering the graph D in Figure 3.1, and the expression next :: [next ::
sp/self :: train], the path evaluation will start from the inner expression next :: sp/self ::
train, defining the pairs of nodes (z, w) such that it is possible to follow an edge labeled
sp from z and reach a node w labeled train. Only the node TGV satisfies this criteria,
so the external expression can be read as next :: TGV , defining the pairs of nodes that
are connected by an edge labeled TGV . The result of the expression evaluation in D is
{(Paris, Calais), (Paris,Dijon)}.
These navigation rules make the algorithm start in a vertex and follow a specific
path. The regular expressions can be nested in a way to allow the query to verify the
current vertex’s hierarchy without losing the context. The SPARQL operands AND,
OPT , UNION and FILTER can also be used to increase the expressiveness of the
queries.
The authors also defined an algorithm to perform the search in the graph, and prove
its efficiency, considering two problems: (i) verify if a given pair of vertices is in the result
26
of the expression evaluation in a graph; and (ii) given a vertex a, find which are the pairs
(a, b) which match what was specified in the expression in the given graph.
Note that, for both problems, the algorithm receives at least one vertex as input. The
authors opted not to answer queries of the kind "return all pairs which match a given
expression in the graph" because the algorithm would have quadratic complexity in the
worst case, in terms of the number of vertices in the graph, only to return the result.
For the proposed problems, the authors managed to prove their algorithm has complexity
O(|G| · |exp|), where |G| is the size of the input graph and |exp| is the size of the nested
regular expression being evaluated, and, to return the result to problem (i), the time
complexity is constant, in terms of the number of vertices in the graph; to problem (ii),
it is linear in terms of the number of vertices in the graph.
Even though nSPARQL improves the expressiveness of graph query languages by
adding the NREs, it still relies on Regular Expressions to perform the queries, which de-
scribe a more strict class of grammars than the Context-Free Grammars. Since we intend
to use an LR(1) parsing mechanism, our solution is able to allow higher expressiveness in
the queries, like allowing element counting in the queries.
3.2 Conjunctive Context-Free Path Queries
With the purpose of increasing the expressiveness of the Conjunctive Regular Path Queries,
a new query mechanism is proposed in (J. Hellings, 2014) [12] for searching for paths in
directed graphs, called Conjunctive Context-Free Path Queries (CCFPQ). CCFPQ up-
dates the Conjunctive Regular Path Queries by replacing the Regular Expressions with
Context-Free Grammars in the Chomsky Normal Form.
Given a graph D, formed by a tuple (V,E, ψ), where V is a set containing all the
graph vertices, E is a set containing all the graph edges and ψ is a function connecting
two vertices via one edge, a path π = (n1e1...ni−1ei−1ni) in a graph D is the non-empty
finite sequence of vertices connected by edges, where n ∈ V and e ∈ E, respecting the rule
that there is always only one edge between vertices in the path sequence. The expression
nπm is used to identify a path between vertices n and m, and the trace of the path π is
27
defined by T = (l1...ln), being l the label of the edges in the path.
The definition of a Context-free Path Query over a grammar G = (N, T, S, P ) is
defined as follows:
Q(→v )← ∃→µ
∧i∈I
Ni(ni,mi)
where Q is the name of the query, →v is a tuple of vertex variables, →µ is a tuple of distinct
vertex variables that do not occur in →v , i belongs to a finite index I, Ni ∈ N is a non-
terminal, and ni and mi are vertex variables from →v or →µ. A CCFPQ built with a regular
grammar is called Conjunctive Regular Path Query (CRPQ).
The author provides an algorithm to query the graph databases with the proposed
mechanism and proves its execution time to be O(|G| · n5) and O(n3 · m2), where |G|
depends on the size of the provided Context-Free Grammar, n is the number of vertices
in the graph and m the maximum path length.
The proposed algorithm uses an adaptation of the CYK parsing method, presented
in [8], and is used to pre-process the graph and add edges connecting vertices which will
create a path, according to the given Context-Free Grammar. For each path found that
matches a production rule in the Context-Free Grammar, an edge connecting the first
vertex to the last vertex in the path will be added to the graph, making queries through
that path to follow this new edge, instead of searching through the whole path again.
Example 10 It is possible to see in Figure 3.4 the edges that are added to the graph on
Figure 2.5 when using the grammar in figure 2.6 using this parsing method. Later, the
provided algorithm needs only to search on the added edges to identify the correct paths,
which, in this case, would be a→ c, a→ f and b→ e
Some variants of CRPQ also allow the definition of explicit path variables. Path
variables are used in the form:
Q()← ∃nm∃nπmr1(π) ∧ r2(π)
28
a b
c
d e f(
)( ) )
S
S
S
Figure 3.4: Adding extra edges to the graph.
where r1 and r2 are regular expressions and place conditions on the trace of a path. These
regular expressions can be used to specify which paths should be returned by a query.
In the CCFPQ, the expressiveness is increased by giving the possibility to use Context-
Free grammars in the query at the cost of processing the graph in advance. In our work,
we aim at achieving expressiveness close of queries in CCFPQ but, while restricting to
LR(1) grammars, we intend to reduce the complexity of the algorithm.
3.3 Context-Free Path Queries on RDF Graphs
In (X. Zhang, Z. Feng, X. Wang, G. Rao, W. Wu, 2016) [36], the authors proposed the
Context-Free Path Queries to navigate through an RDF graph, and the Context-Free
SPARQL query language (cfSPARQL) for RDF, built on the context-free path queries,
introduced by [12], added with the standard SPARQL operations and nested regular
expressions, uniting the research of the two previous related works and creating the Con-
junctive Context-Free path Queries (CCFPQ).
A Conjunctive Context-Free Path Query in cfSPARQL is in the form:
q(?x, ?y) :=m∧i=1
αi
where q is the name of the query, αi is a triple in the form (?x, ?y, ?z) or in the form
v(?x, ?y), being x the object, y the predicate and z the object in the set of edges in
the graph, allowing a query to include both nSPARQL nested regular expressions and
context-free path queries; and {?x, ?y} is a subset of the variables occurring in the body
29
of q.
Example 11 Consider a Context-Free Grammar D = (N, T, S, P ) where N = {S,R},
T = {next ::, next :: sp, self :: train}, S = {S} and P the grammar defined in Figure
3.5, the cfSPARQL query Q based on the grammar D, applied on the graph defined in
Figure 3.1, will return {(Paris, Calais), (Paris,Dijon)}
S → next :: RR→ [next :: sp/self :: train]
Figure 3.5: Context-Free Grammar extended with nSPARQL regular expressions
It is also possible to merge more than one CCFPQ and capture more expressive power
such as disjunctive capability. An union of conjunctive context-free path queries (UC-
CFPQ) is in the form:
q(?x, ?y) :=m∨i=1
qi(?x, ?y)
where Qi(?x, ?y) is a CCFPQ for all i = 1, ...,m.
The authors of [36] then give algorithms to parse the query, namely recognize and
convert, and verify their complexity, claiming it to be O(|D|) for converting and O((|N | ·
|D|)3) for recognizing, where |D| is the size of the graph D and N is the number of
non-terminal symbols in the given Context-Free grammar.
In addition to the query algorithm, the authors introduce a language called Context-
free SPARQL, which extends SPARQL to use context-free triple patterns and SPARQL
basic operations, like UNION, AND, OPT, FILTER and SELECT.
Figure 3.6 shows the general relation between the variants of CFPQ and nested regular
expressions in terms of expressiveness, where a → b means that a is expressible in b.
According to the authors, UCCFPQ can express queries in CCFPQ, and their extensions.
In CFPQ and its extensions, the query structure and query evaluation are similar to those
proposed in [12], but the query expressiveness is increased with the addition of the NREs.
30
UCCFPQ nre¬
nreCCFPQ UCCFPQS
CFPQnre0(N) nre0(|)
nre0
Figure 3.6: Comparison between the languages.
3.4 Context-Free Path Querying with Structural Rep-
resentation of Result
In (S. Grigorev, A. Ragozina, 2016) [7], the authors propose an approach to recognize
context-free paths in RDF graphs using a top-down method based on the GLL [29] parsing
algorithm. The algorithm allows one to build such representation with respect to given
grammar in polynomial time and space for an arbitrary context-free grammar and graph,
in terms of the number of vertices in the graph. The authors state that the proposed
algorithm’s runtime complexity is O(|V |3 maxv∈V (deg+(v))), where V is the set of vertices
and deg+(v) is the out-degree of vertex v. For complete graphs, the proposal has runtime
complexity O(|V |4).
3.5 Top-Down Evaluation of Context-Free Path Queries
in Graphs
The work in [17] was developed in parallel to our research and tries to solve problems
similar to ours. The author proposes an algorithm that evaluates context-free path queries
using-top-down [1] parsing techniques. The proposed algorithm requires the input LL
Context-Free Grammar to be in the Chomsky Normal Form and receives, as parameters,
a data graph, the parsing table of said grammar and a set of query pairs in the form
(a,X). The algorithm parses the paths by populating three sets of processed pairs, which
31
will help the algorithm to identify when it is time to stop processing, returning a set of
tuples (a,X, b), which represents that there is a path valid according to the non-terminal
N in the grammar from a to b in the data graph.
The main difference in functionality between our work and [17] is that its query pa-
rameter allows the user to specify any non-terminal in the grammar to be returned in the
response, giving the user more flexibility.
Given that V is the set of vertices in the graph and P the set of production rules
in the given grammar, the author calculates the time complexity of the algorithm to be
O(|V |3|P |). Since this work is being developed in parallel to ours, we will be able to run
some experiments and make comparisons between the performance of both proposals.
3.6 Tomita-Style Generalized LR Parsers
In [30], an LR parsing algorithm is introduced, which takes an input string, a1a2...an,
and uses it to traverse a DFA, constructing a Graph Structured Stack (GSS) during the
process. We found this structure to be of interest of this research since it suits to store
the data we are manipulating in the algorithm we are implementing.
A GSS is formed by state nodes, labeled after the DFA states, and a set of symbol
nodes, which are labeled after the grammar symbols. The state nodes are grouped together
into disjoint sets, an initial set, U0, and one set, Ui, for each element ai of the input string.
U0 is the input related reduction-closure of the start state of the DFA, and for 1 6 i 6 n,
Ui is the input related reduction-closure of Ui−1ai, the set of all states which can be
reached from a state in Ui−1 along a transition labeled ai. A node is at level i if it is in Ui,
and that v ∈ Ui has a valid reduction if the DFA state, h, which labels v contains an item
of the form (A ::= α·, ai+1). A reduction via the rule A ::= α is valid for a state node v
which is at level i and has label h, if the DFA state h contains the item (A ::= α·, ai+1),
which means that valid reductions are those which can be applied when the input a1...ai
has been read and the lookahead input symbol is ai+1.
In the GSS, all successors and predecessors of a symbol node are state nodes and all
successors and predecessors of a state node are symbol nodes. The GSS constructed from
32
k ai h
Figure 3.7: Representation of a shift transition in a GSS.
input a1...an contains a subgraph as represented in Figure 3.7, that shows there is a node
in Ui−1 labeled k, and a node in Ui labeled h, and a transition labeled ai from k to h in
the DFA. This step corresponds to the shift action in the LR Parser, where the input
symbol is read and added to the top of the stack.
w x1 ... xm u
A v
Figure 3.8: Representation of a reduce transition in a GSS.
Similarly to that, the reduce action in the LR Parser, which consists of removing
elements from the stack equal to the amount of symbols in the RHS of the rule for the
given non-terminal, is represented in the form shown in Figure 3.8 in the GSS, where
u ∈ Uj, w ∈ Ui, A ::= x1...xm is a grammar rule and there is a transaction from w to v
labeled A in the DFA. This way, the node v is reduction related to the node w via a path
of length 2m and a symbol node labeled A.
w x1 ... xm u
A v
Figure 3.9: Representation of a reduce transition for a ε-transition in a GSS.
Since the reduction removes |RHS| elements from the stack, the behaviour for reducing
rule in the form A ::= ε will create a link between v and u, because the |RHS| is zero.
This transition may be seen in Figure 3.9
When the GSS is complete, the final set Un of state nodes is examined. The input
string is in the language if, and only if, Un contains the accepting state of the DFA.
The GSS in Figure 3.10 is obtained by parsing the string ( ( ) ) for the Context-Free
Grammar {S → ( P ), P → ( P ), P → ε}. For the initial state, a node v0, labeled 0,
is added to the GSS in the U0 set. The first input symbol, (, is read. The state machine
33
0v0
( 2v1
( 4v2
P 6v3
) 7v4
P 3v5
) 5v6
S 1v7
U0 U1 U2 U3 U4
Figure 3.10: GSS generated for parsing the input string ( ( ) ) in the grammar in Figure5.
for the DFA moves to the DFA State i2, which is the target of the transition labeled (
from the current state. Since this is a shift operation, a node v1, labeled 2 is added to
the U1 set, together with a symbol node labeled (, being this the successor of v0 and
predecessor of v1.
Next step is to look at the state 2 in the DFA to identify if it has any valid reductions
on the next input symbol (. Since only one symbol was consumed, the rules being searched
for are in the form (A ::= a1·, a2). For each one of these rules, a new node is added to U1,
connected to v1 in U1 by the terminal symbol used in the transition in the DFA, if not
yet present in the GSS.
The process will continue reading input symbols until it finds one which is not present
in the accepting symbols of the DFA in Un, where it means the input string is not in the
language, reporting a failure, or reaches a success state in the DFA, reporting success.
After analyzing the related works, we have enough information to start developing our
solution to the proposed problems. To be able to use GSS in our research, we need to
extend it so we can store data for the multiple traces that may be found in a graph. These
adaptations to the GSS are presented in Chapter 4. Also, if we were using the conventional
LR parsing algorithm, we would have to run the parser and create one parsing stack for
34
each vertex and each path, but the GSS data structure allows us to identify all paths at
once, with a single instance of the GSS. Our algorithm allows increased expressiveness
for the queries when compared to the related works whose path queries are described by
Regular Grammars. Another contribution of our algorithm is that it uses a technique
traditionally used for parsing strings of Context-Free Grammars to process the queries.
35
36
4 The GrLR Query Processing
Algorithm Approach
Given that we have a graph D formed by a set of tuples (v, e, w), where v, w ∈ V are the
vertices of the graph and e ∈ E is the edge connecting both vertices, containing the data
to be queried and a Context-Free Grammar G = (N, T, S, P ) to define the valid paths, we
use some of the concepts presented during this research, in this chapter, to explain how
we solve the proposed problems. In Section 4.1, we present and detail the final version of
the algorithm to solve the problem, showing an example execution of the algorithm. In
Section 4.2, we calculate our algorithm’s time and space complexity, and, in Section 4.3,
prove that our algorithm is correct.
The LR Parsing algorithm was originally designed to parse a single finite string of
symbols. The algorithm continues parsing the string while it is still considered to be a
valid sentence in the language formed by the input grammar, stopping when reaching the
last symbol of the string or when finding an unexpected symbol.
Since the graph may have multiple paths that form valid sentences, we may have to
parse multiple strings. We start in a set of vertices in the data graph and use the labels of
the edges forming traces W = v0e0v1e1...envn to other nodes as the strings to be parsed.
When identifying the traces, if the algorithm reaches a vertex in the graph that contains
more than one outgoing edge, it tries to parse one path for each edge. Given a Context-
Free Grammar G = (N, T, S, P ), if the algorithm were parsing the graph shown in Figure
4.1, it would find the traces a(−→ b
)−→ c and a(−→ b
)−→ d, if the string () belongs to the
language formed by G. The answer, in this case, would be a set containing (a, S, c) and
(a, S, d), with S being the start symbol in the grammar.
We replace the ordinary stack used by the original LR parsing algorithm with the
GSS structure. We modified the GSS node structure to also store the vertex in the graph
which was just processed. This allows us to easily identify the vertices when calculating
37
a b c
d
( )
)
Figure 4.1: Paths identified in a graph.the answers for the query.
Before talking about the algorithm itself, let’s define some auxiliary functions and
parameters specific for the manipulation of the GSS and parsing table. The first function
is CreateParsingTable(G), which receives the grammar and builds a table with I rows
and T columns, with I being the possible states for the DFA. This function also returns
the initial state s0 in the DFA. Each element in the table may contain a valid action to be
performed by the parsing algorithm when reaching a state i ∈ I and receiving an input
symbol t ∈ T . Since the creation of the parsing table is not in the scope of this research,
this function uses an external tool, called JSMachines [13], for building the LR(1) parsing
table.
Example 12 Let’s suppose that our algorithm has built a GSS as shown in Figure 4.2
and is ready to process the GSS pair labeled v3. Looking at the parsing table, it sees
that the next action to perform is to reduce the rule P → a a, whose RHS size is 2. The
result of a call to GSS_Up in this node, passing 2 as third parameter is the set of nodes
{v0, v1}.
a, i0
v0
b, i0
v1
a b, i2
v2
a
a d, i4
v3
U0 U1 U2
Figure 4.2: Execution of a GSS_Up function call.
For the GSS specific operations, we have the functions CreateGSS(Q, s0), which ini-
tializes the GSS with a set of pairs in the form (v, s0), associating each v ∈ Q with the
38
initial state s0 in the DFA. We also have the function GSS_Pairs(GSS, level) as a con-
venient way to retrieve all the pairs in a given level of the GSS. The third function
is called GSS_Up(GSS, (a, si), |α|) and returns all GSS pairs that originated the path
being processed for the current rule, as shown in Example 12. This function follows the
same implementation as in the original GSS algorithm, with the difference that it detects
the multiple paths being followed. It starts at the (a, si) pair in the current GSS level
and traverses the GSS from there |α| steps behind, returning the current pairs when they
stop. The fourth and last function is called GSS_Insert_Pair(GSS, level, GSSPair),
which inserts the given pair in the GSS in the given level.
4.1 The algorithm
The function for querying the graph is called GrLR, defined in Algorithm 4, which takes
as input a data graph DG ⊆ V × E × V to query from, a Context-Free Grammar G =
(N, T, S, P ) and a Context-Free Path Query Q ⊆ V , which contains the vertices in the
data graph to start the query from.
The first part of the algorithm initializes the parsing table for the given grammar and
the GSS by filling its U0 level with all the starting vertices for the query, associated with
the starting DFA state for the grammar (lines 2 to 7).
With the GSS initialized, it is time to perform the actual query. The algorithm starts
processing the valid paths into the main while loop, starting at line 8. Each iteration of
the loop parses the current symbol of each path being analyzed, and try to process three
kinds of operations. First, it tries to process reduce operations (lines 10 to 25), then it
tries to identify valid answers by processing accept operations (lines 26 to 32) and then
the shift operations (lines 33 to 41). If no valid operation is defined for the current input
symbol, our algorithm simply ignores the path being followed (we do not return an error
as done in the original LR parsing algorithm).
This follows the same logic of the LR parsing algorithm using GSS instead of a stack.
The variable changed is set to FALSE in the beginning of each iteration, at line 9. If any of
these three operations are performed and produces a new discovered element in the GSS
39
structure, then the algorithm sets the changed variable to TRUE (lines 25, 32 and 41). The
main loop iterates until the changed variable is not set to TRUE in a given iteration. The
verification for this is done at line 42.
We need to discuss about the stop condition of the algorithm before explaining the
three specific operations in depth. Given the grammar G = {S → (S), S → ε}, it is
possible that the graph has paths that create a valid loop according to the grammar, as
seen on Figure 4.3.
a b c) )
(
Figure 4.3: Graph with a loop.
We defined that, in this case, a good stop condition would be to have three sets of
values that should be checked after the processing of each loop iteration, and we called
them ReductionEdges, Answers and V isitedPairs (lines 4 to 6). The ReductionEdges
set includes tuples in the format (a,N, b), storing the information about all reductions
made between vertices b and a by a rule whose LHS is N . Whenever a new tuple is found
during a reduction, we set changed to TRUE (lines 23 to 25). The Answers set includes
pairs in the format (a, b), representing all valid answers to the query, meaning that there
is a valid path, according to the given grammar, from a to b. Whenever a new answer is
found, the changed variable is set to TRUE (lines 30 to 32). The V isitedPairs set includes
pair in the format (a, i), meaning that the algorithm has already visited a vertex a in the
data graph in the state i of the DFA. If any new pair is found, then the changed variable
is set to TRUE (lines 39 to 41). At the end of the main loop the algorithm checks if the
changed variable is set to TRUE, increases the level by one and continues (line 43). If it
remains FALSE, the algorithm stops (line 42).
Now lets discuss about the three specific operations. Given a D = (V × E × V ), the
GSS contains pairs (a, i) at each level, where a ∈ V and i a state in the DFA of the given
grammar.
Processing Reduces. First, the algorithm retrieves all the pairs (a, i) at the current
level of the GSS (lines 12 and 13). For each edge in the data graph in the format (a, t, b),
with a, b ∈ V and t ∈ E, the algorithm looks in the parsing table if there is a valid reduce
40
Algorithm 4: GrLR Query Processing Algorithm.input : - a data graph DG ⊆ V × E × V ;
- a Context-Free Grammar G = (N,T, S, P );- a Context-Free Path Query Q ⊆ V .
output: - AnswersG(Q).1 Function GrLR(DG, G, Q) : AnswersG(Q)2 (ParsingTable, s0)←CreateParsingTable(G)3 GSS ←CreateGSS(Q, s0)4 V isitedPairs← ∅5 ReductionEdges← ∅6 Answers← ∅7 level← 08 while TRUE do9 changed←FALSE
// processing reduces10 PairsToProcess←GSS_Pairs(GSS, level)11 while PairsToProcess 6= ∅ do12 choose (a, si) ∈ PairsToProcess13 PairsToProcess← PairsToProcess\{(a, si)}14 NextTerminals← {terminal|(a, terminal, b) ∈ DG} ∪ {$}15 for each terminal ∈ NextTerminals do16 for each ParsingTable[si][terminal ] do17 if ParsingTable[si][terminal ] = REDUCE A → α then18 Ancestors← GSS_Up(GSS, (a, si), |α|)19 for each ((c, sj) ∈ Ancestors) do20 GSSPair ← (a, ParsingTable[sj ][N ])21 GSS_Insert_Pair(GSS, level, GSSPair)22 PairsToProcess← PairsToProcess ∪ {GSSPair}23 if (c, N, a) /∈ ReductionEdges then24 ReductionEdges← ReductionEdges ∪ {(c, A, a)}25 changed← TRUE
// processing accept states26 for each (a, si) ∈ GSS_Pairs(GSS, level) do27 if ParsingTable[si][$] = ACCEPT then28 Ancestors← GSS_Up(GSS, (a, si), 1)29 for each (c, sj) ∈ Ancestors do30 if (c, a) /∈ Answers then31 Answers← Answers ∪ {(c, a)}32 changed← TRUE
// processing shifts33 for each (a, si) ∈ GSS_Pairs(GSS, level) do34 for each (a, terminal, b) ∈ DG do35 for each ParsingTable[si][terminal ] do36 if ParsingTable[si][terminal ] = SHIFT sj then37 GSSPair ← (b, sj)38 GSS_Insert_Pair(GSS, level + 1, GSSPair)39 if (b, sj) /∈ VisitedPairs then40 V isitedPairs← V isitedPairs ∪ {(b, sj)}41 changed← TRUE
// has V isitedPairs or ReductionEdges changed at this level?42 if not (changed) then break;43 level← level + 1
44 return Answers
41
action for the DFA state i when looking at the symbol t (lines 15 and 16). If any is
found, then the algorithm calls the GSS_Up function, which returns the GSS pairs at the
beginning of the path when the parsing has started for the rule currently being reduced
and adds news pairs in the current GSS level connected to them by the non-terminal
represented in the LHS of the rule being processed (lines 18 to 22).
Processing Accepts. After the reductions, the algorithm fetches all the GSS pairs
(a, i) (including the ones generated by the reductions on the same level) in the current
GSS level (line 26), and consult the parsing table for accept actions for the DFA state i
(line 27). If it is found, the algorithm calls the GSS_Up function to identify the pairs
(c, i) which originated the valid paths that reached a, storing the tuple (c, S, a) in the
Answers set (lines 28 to 31), meaning that there is a valid path from c to a following the
non-terminal S, according to the given grammar.
Processing Shifts. The last action to process is the shift. The algorithm fetches all
the GSS pairs (a, i) (including the ones generated by the reductions on the same level)
in the current GSS level. For each edge in the data graph in the format (a, t, b), the
algorithm looks in the parsing table if there is a valid shift action to the DFA state j
from state i when looking at the symbol t (lines 35 and 36). If any is found, then the
algorithm adds a new pair (b, j) on the next level of the GSS (line 38).
4.1.1 Algorithm execution example
In this section, we present an execution example for the algorithm. A JavaScript imple-
mentation of our algorithm is available in [9] and may be used to reproduce the steps
described in this section. Consider the graph D in Figure 4.4a as the data graph to
be queried with the input grammar G represented in Figure 4.4b. We want to identify
nodes connected by edges that correspond to expressions with "Matching parenthesis",
possibly "nested". Examples of paths that correspond to the query are a(−→ b
)−→ c,
a(−→ b
(−→ d)−→ e
)−→ f and b(−→ d
)−→ e are valid.
To do so, we initialize the parsing table for the given grammar, represented in Figure
4.4c, and initialize a GSS containing nodes for all the vertices in the given graph on the
42
a
b
c
d e f
( )
() )
(a) Data graph D.
S ′ → S$S → (P )P → ε|(P )(b) LR Grammar G.
State Action Goto( ) $ S’ S P
i0 s2 1i1 acci2 s4 r2 3i3 s5i4 s4 r2 6i5 r1i6 s7i7 r3
(c) Parsing table for the inputgrammar.
Figure 4.4: Input data for the algorithm example.
level U0. This means that we plan to identify all paths starting at each vertex in the
graph at once, according to the given grammar, by passing Q = {a, b, c, d, e, f} to the
algorithm.
The GSS in Figure 4.5 represents the initial state for the parsing method, containing
one node for each vertex in the graph, all of them pointing to the state i0 in the DFA.
a, i0
v0
b, i0
v1
...
f, i0
v5
U0
Figure 4.5: Initialization of the GSS for the graph in Figure 4.4a.
To start querying, we call GrLR(D,G,Q) and start analyzing the level U0. For each
node (v, i) in this level, we identify its vertex v in the graph and read the labels of outgoing
edges (v, e, w). For each label found, we look at the parsing table to identify the actions
to take at state i, according to the outgoing edge’s label e. The algorithm identifies that
there are no reduce or accept actions to make in this level, but there are two shift actions
43
from node v0 when reading the edge a(−→ b and from node v1 when reading the edge b
(−→ d,
both leading to state i2, and adds the two pairs (b, i2) and (d, i2) to level U1. These pairs
are also added to the V isitedPairs set. Since this set was previously empty, the changed
variable is set to TRUE and the level variable is increased by one. Figure 4.6 shows the
resulting GSS after processing level U0.
a, i0
v0
b, i0
v1
...
f, i0
v5
( b, i2
v6
( d, i2
v7
U0 U1
Figure 4.6: Resulting GSS after processing level U0.
Starting to process the U1 level, the algorithm identifies in the parsing table that there
is a reduce action from node v6 when reading the edge b)−→ c and other reduce action from
node v7 when reading the edge d)−→ e. Both reductions happen for the rule P → ε, which
has |RHS| = 0. The algorithm, then, calls the GSS_Up function to identify the GSS
pairs where the parsing of the current rule started. In this case, jumping zero times, the
reduction root returned is the GSS node being processed itself. The algorithm adds two
pairs to the same level being processed, v8 pointing to v6 and v9 pointing to v7, connected
by the LSH of the rule. P in this case.
The algorithm verifies that the tuples (b, P, b) and (d, P, d) did not yet exist in the
ReductionEdges set and adds them to it, also setting the changed variable to TRUE. After
this, the algorithm finds three shift actions to make from the nodes v6, v8 and v9, looks
at the parsing table which state each of them should take and adds them to the next level.
The changed variable was set to TRUE, so the level variable is increased by one. Figure
44
4.7 shows the resulting GSS after processing the level U1.
a, i0
v0
b, i0
v1
...
f, i0
v5
( b, i2
v6
P b, i3
v8
( d, i2
v7
P d, i3
v9
( d, i4
v10
) c, i5
v11
) e, i5
v12
U0 U1 U2
Figure 4.7: Resulting GSS after processing level U1.
Since the changed variable was set to TRUE, we need to process the current GSS level.
Starting on the level U2, the algorithm identifies first three reduce actions from the GSS
nodes v10, v11 and v12 and calls the GSS_Up function to figure out where the parsing for
the rule being reduced started. The node v10 with the edge d)−→ e is reduced by the rule
P → ε. The nodes v11 and v12 with the end of string are reduced by the rule S → (P ),
finding the nodes v0 and v1, respectively, as reduction root. The tuples (d, P, d), (a, S, c)
and (b, S, e) are added to the ReductionEdges set and the changed variable is set to TRUE.
Next, it tries to identify the accept actions, and finds two of them, on nodes v14 and v15.
Looking at the results of GSS_Up, the algorithm adds the tuples (a, c) and (b, e) to the
Answers set. After finding the accept actions, the algorithm identifies that there is one
shift action from the GSS node v13 with the edge d)−→ e to the state i7 and adds the
(e, i7) tuple to the V isitedEdges. Figure 4.8 shows the resulting GSS after processing the
level U2.
Since a new node was added to a new level, we need to process it, by calling evalLevel
on the level 3. Here, the algorithm finds one reduce action from the node v16 with the
edge e)−→ f . This time, the reduction rule is P → (P ), which means the algorithm needs
to go back three steps to find the reduction root v6. After this, there is also a shift action
on the node v17 with the edge e)−→ f .
After processing the level U3 and finding a shift action, the algorithm starts a new
45
a, i0
v0
b, i0
v1
c, i0
v2
d, i0
v3
e, i0
v4
f, i0
v5
( b, i2
v6
P b, i3
v8
( d, i2
v7
P d, i3
v9
( d, i4
v10
P d, i6
v13
) c, i5
v11
S c, i1
v14
) e, i5
v12
S e, i1
v15
) e, i7
v16
U0 U1 U2 U3
Figure 4.8: Resulting GSS after processing level U2.
iteration, this time for level U4. Here, the algorithm only finds the node v18, which only
accepts a reduction, when given the end of the string, to the rule S → (P ). The reduction
root is v0. While searching for accepts, the algorithm finds that the newly added GSS
node v19 has an acceptance state in the parsing table and adds a pair (a, f) to the answers.
The algorithm stops because there were no shift actions performed while parsing the
current GSS level, so the level U5 is empty. The algorithm, then, returns the answers {(a,
c), (a, f), (b, e)}.
4.2 Complexity
We dedicate this section to calculate our algorithm’s runtime and space complexity in the
worst case scenario, which corresponds to queries over a complete data graph. In this
context, as seen on Figure 4.10, a complete graph contains edges connecting every node
to every node with each terminal in the given grammar. This means that there is a path
to every node from every node, causing the algorithm to execute the maximum number
of shift and reduce operations. In addition to that, we consider that the queries will be
46
a, i0
v0
b, i0
v1
c, i0
v2
d, i0
v3
e, i0
v4
f, i0
v5
( b, i2
v6
P b, i3
v8
( d, i2
v7
P d, i3
v9
( d, i4
v10
P d, i6
v13
) c, i5
v11
S c, i1
v14
) e, i5
v12
S e, i1
v15
) e, i7
v16
P e, i3
v17
) f, i5
v18
S f, i1
v19
U0 U1 U2 U3 U4
Figure 4.9: Resulting GSS after processing level U4.
made starting on all vertices in the graph, so that every node has to be investigated as
the source of a path.
1
2 3
a, b
a, ba, ba, b
a, b
a, ba, b
a, ba, b
Figure 4.10: Complete graph with three vertices.
4.2.1 Runtime complexity
Given a data graph D ⊆ V × E × V to query from, a Context-Free Grammar G =
(N, T, S, P ) and a Context-Free Path Query Q ⊆ V , we call ρ the size of the largest RHS
of the rules in P , and I the set of states in the DFA generated for the given grammar.
We can consider the algorithm being composed of two main parts: initialization and
iterations. During the initialization, the algorithm initializes the GSS and generates |Q|
pairs on the U0 level. In the worst case scenario, the user is querying from all vertices from
47
the data graph, so |Q| = |V |. Considering that the creation of each GSS node has a fixed
cost k, we consider the runtime complexity in the worst case scenario for the initialization
to be k ∗ |Q|, which gives us O( |V | ).
After the initialization, the algorithm enters the main loop and iterates until no new
elements may be added in the ReductionEdges, Answers and V isitedPairs sets. The
ReductionEdges set contains tuples (v′, n, v), where v′, v ∈ V , which represent the con-
nections by a non-terminal n ∈ N found between the vertices v′, v of the data graph
during the processing of a reduction. The maximum number of elements which may be
added to the ReductionEdges set is |V |2|N |. The Answers set contains tuples (v′, v),
where v′, v ∈ V , which are considered answers to the query. It contains all the connec-
tions found between the starting vertices v′ ∈ Q of the query and other vertices v ∈ V in
the data graph which can be connected via the starting non-terminal S and can have, at
maximum, |V |2 elements. The V isitedPairs set contains pairs (v, i), where v ∈ V and
i ∈ I, and represent the states of the DFA that the algorithm found when reaching a given
vertex in the data graph. At maximum, it can contain |V ||I| elements. Our algorithm
stops running whenever those three sets have converged to fixed points.
In order to calculate the runtime complexity of the main loop of the algorithm, first
we analyze the complexity of the reduce, accept and shift operations in separate. Also,
we have to compute the runtime complexity of the GSS_Up function, which is called at
lines 18 and 28.
The GSS_Up function is responsible for finding the ancestor GSS nodes for a reduc-
tion in a given GSS node, which is located some levels behind in the GSS, according to
the |RHS| of the rule being reduced. The GSS_Up function receives the GSS node which
enabled the reduction of the production rule and the number of steps to move back. In the
worst case scenario, the number of steps is ρ. At each GSS level, the function GSS_Up
finds |V ||I| GSS nodes to investigate, at maximum. For each GSS node found at a given
level, a GSS_Up is called recursively, ρ times in the worst case. In this manner, the
worst case complexity of the GSS_Up function is O( (|V ||I|)ρ ).
Process Reduces. At line 10, the maximum number of pairs (v, i) returned by
48
GSS_Pairs is |V ||I|, which is the maximum amount of GSS nodes that can be added
to any GSS level. The algorithm iterates through this set (line 11) and, for each pair
found, iterates through all triples (v, e, w) in the data graph (line 15), which, in the worst
case scenario, is exactly |T ||V | per GSS pair. The algorithm then calls the GSS_Up
function in all of these cases, giving us |V |2+ρ|I|1+ρ|T | operations. The algorithm then
adds the tuples (v′, S, v) to the ReductionEdges set if they were not yet present. Since
the algorithm will, in the worst case scenario, process enough reduce operations to fill the
ReductionEdges set, we need to multiply the complexity of a single reduce operation by
|V |2|N |. The complete formula for the runtime complexity of the reduce operations is
O( |V |4+ρ|I|1+ρ|T ||N | ).
Process Accepts. At line 26, the maximum number of pairs retrieved byGSS_Pairs
is |V ||I|, which is the maximum amount of GSS nodes that can be added to any GSS
level. In the worst case scenario, the algorithm calls GSS_Up for each one of them,
giving us (|V ||I|)1+ρ operations. Since the algorithm will, in the worst case scenario,
process enough accept operations to completely fill the Answers set, we need to multiply
this number by |V |2, which is the maximum amount of elements that can be added to the
Answers set. The complete formula for the runtime complexity of the accept operations
is O( |V |3+ρ|I|1+ρ ).
Process Shifts. At line 33, the maximum number of pairs (v, i) retrieved byGSS_Pairs
is |V ||I|, which is the maximum amount of GSS nodes that can be added to any GSS
level. For each one of them, the algorithm iterates through all edges (v, e, w) (line 34),
which takes |T ||V | operations, and executes |I| shift operations (line 35). This gives us
(|V ||I|)2|T |) operations for a single shift operation. Since the algorithm will, in the worst
case scenario, process enough shift operations to completely fill the V isitedPairs set, we
need to multiply this number by |V ||I|, which is the maximum amount of elements that
can be added to the V isitedPairs set. The complete formula for the runtime complexity
of the shift operations is O( (|V ||I|)3|T | ).
The complete formula for the runtime complexity of our algorithm, in terms of oper-
49
ations, is:
O( |V |+ |V |4+ρ|I|1+ρ|T ||N |+ |V |3+ρ|I|1+ρ + (|V ||I|)3|T | )
Since the processing of the reduce operations in the algorithm has the highest runtime
complexity, the complexity of our algorithm in the worst case scenario is:
O( |V |4+ρ|I|1+ρ|T ||N | )
By looking at this formula, one would think that converting the grammar to the
Chomsky Normal Form (CNF) might improve the algorithm’s runtime complexity, since
the largest RHS (ρ) in the production rules for a grammar in this form is 2, but exper-
iments have indicated that while there would be less symbols to process per production
rule, an increased number of reductions would be required to parse the strings for the
grammar, increasing the execution time.
4.2.2 Space complexity
The level U0 of the GSS contains at most |V | tuples (v, i0). During the processing, each
level of the GSS may have up to |V ||I| pairs, connected by |V |2|T | edges. The number of
GSS levels to explore depends on the maximum number of elements that may be stored
in the ReductionEdges, Answers and V isitedPairs sets. So, we need to store up to
|V ||I| ∗ (|V |2|N |+ |V |2 + |V ||I|) pairs and |V |2|T | ∗ (|V |2|N |+ |V |2 + |V ||I|) edges. Thus,
the space complexity of our algorithm is O( |V |3|N ||I|+|V |2|I|2+|V |4|N ||T |+|V |3|T ||I| ).
4.3 Discussion about correctness
In this section, we discuss about the correctness of the algorithm. The algorithm starts
with V isitedPairs, ReductionEdges and Answers empty and these sets will be filled
while the parsing proceeds and the algorithm processes the connections between the ver-
tices in the data graph according to the grammar rules. The first step in the algorithm is
to create the GSS nodes referring to the starting vertices of the query, on the GSS level
50
U0. From this point, the algorithm reaches the main loop and will iterate until no new
information is added into those three sets. In order to find new information, the algorithm
consults the parsing table for the given grammar for valid actions according to the labels
of the edges leaving the vertices currently being parsed.
In the main loop, the algorithm identifies which operations may be executed on the
parsing for each GSS node in the current GSS level. To process reductions, the algorithm
first fetches all GSS nodes in the current GSS level by calling GSS_Pairs (line 10). For
each node (a, i) found, the algorithm iterates through all tuples (a, e, b) in the data graph
starting with a and tries to identify if there is a reduce action allowed by the parsing
table of the given grammar for the e symbol. If the action is allowed, then a production
rule n → RHS will be reduced. The algorithm adds a new GSS node in the current
level, pointing to the reduction roots (c, j) as ancestral nodes. The reduction roots are
the GSS nodes where the parsing of the string being reduced was initiated, with c being
the vertex in the data graph and j the state where the algorithm was when the parsing of
the rule began. This reduction is made in the same way as in the original GSS algorithm.
After this, for each reduction node found, if not yet present, the algorithm adds one tuple
(c, n, a) to ReductionEdges and sets the changed variable to TRUE.
The next step in the main loop is to identify if there are accept actions allowed by
the grammar for any of the GSS nodes in the current GSS level. The algorithm calls
GSS_Pairs (line 26) to get the (a, i) tuples of GSS nodes in the current level. For each
one of them, verifies if there is an accept action enabled by the parsing table for the end
symbol $. If the accept is defined in the parsing table, then the algorithm has found a
valid answer, according to the parsing table. In this case, the vertex in the GSS node
being parsed can be reached from one vertex of the query via S, the start symbol of the
grammar. The algorithm calls GSS_Up passing 1 as the number of steps to look back
and will find GSS nodes (c, i) in the U0 level. If not yet present, the Answers set will be
added with the (c, a) pair and the changed variable is set to TRUE. This new element is
considered as a valid answer to the query and means that there is a connection between
the query vertex c and a vertex a in the data graph via the starting non-terminal S.
51
The last step in the main loop is to identify if there are shift actions allowed by
the grammar for any of the GSS nodes in the current GSS level. The algorithm calls
GSS_Pairs (line 33) to get the (a, i) tuples of GSS nodes in the current level. For
each pair found found, the algorithm iterates through all tuples (a, e, b) in the data graph
starting with a and tries to identify if there is a shift action to state j over e allowed by
the parsing table of the given grammar. If the action is allowed, then the shift operation
is processed the same way as is done in the original GSS algorithm. After this, for each
shift operation performed, the algorithm adds one pair (b, j) to V isitedPairs if not yet
present and sets the changed variable to TRUE.
The requisite for the algorithm to keep running (line 42) is that if there are still
symbols to be parsed, parsing them must result in new elements being added either to
V isitedPairs, ReductionEdges or Answers, setting changed to TRUE (lines 25, 32 and
41). At maximum, the V isitedPairs set may have |V ||I| elements; ReductionEdges may
contain |V |2|N | elements; and Answers may contain |V |2 elements. Since these three
sets are formed by groups of finite elements, they are also finite and the algorithm will
reach a state where there is either nothing else to parse or these three sets will not change
anymore, keeping the changed variable as FALSE at the end of the loop iteration and the
algorithm stops.
In this chapter, we presented our algorithm to query graph databases by exploring
paths described by LR(1) grammars. We discussed about the runtime and space com-
plexity of our proposal, and presented an example of its usage. In the next chapter, we
report some experiments intended to measure how our algorithm behaves in some specific
scenarios, comparing the results with the ones obtained by the related works.
52
5 Experiments
We implemented our algorithm in the Python language in order to be able to execute
some experiments and evaluate how it performs. We also compare them with some of
the related works, presented in Chapter 3. The execution time for each experiment was
obtained by measuring the average time of five executions. The experiments were executed
using a computer with 7.3 GB of RAM and an AMD Phenom II X4 B97 processor, running
Ubuntu 16.04 (x64) and Python 2.7. We also benefit of the speed gains provided by using
the PyPy Python compiler [26], which uses Just-in-Time (JIT) compiling techniques [2].
This is the same computer as the one used to execute the experiments done by C. M.
Medeiros [17], so we may actually compare the value between ours and their algorithm
without speculations.
To compare the results of our implementation with the results of the other proposals,
we executed the same queries over the same databases used in their experiments (Section
5.1). Other experiments were executed to evaluate how the algorithm behaves in the
worst-case scenarios, like complete graphs, binary trees (Section 5.2) and string graphs
(Section 5.3).
5.1 Ontologies stored as RDF graphs
Both the works of X. Zhang et al. [36] and S. Grigorev et al. [7] performed two specific
queries, referred to as Q1 on Figure 5.1a, and Q2 on Figure 5.1b, to search for data
on some popular RDF databases, namely Skos, Generations, Travel, Univ-bench, Foaf,
People-pets, Funding, Atom-primitive, Biomedical, Pizza and Wine.
S → subClassOf subClassOf−1
S → subClassOf S subClassOf−1
S → type type−1
S → type S type−1
(a)
S → B subClassOf−1
B → subClassOf B subClassOf−1
B → ε
(b)
Figure 5.1: Grammars for Queries Q1 (a) and Q2 (b).
53
These databases are composed of some basic information on wineries, and pizza places,
production and distribution, social network user’s relationships, etc. Q1 gives all pairs of
nodes which reside in the same hierarchy level, while Q2 gives the nodes which reside one
level above other nodes.
In this experiment, as expected, we found out that the larger the graph and number
of answers, the longer the queries take to be executed. We managed to get a performance
similar to the experiments in S. Grigorev et al. [7], but we cannot directly compare the
results because of the difference in the hardware utilized. Table 5.1 shows the number of
answers and time that each algorithm takes to perform Q1 starting on all vertices of the
ontology databases. Figure 5.2 shows a bar chart with these results.
GSSLR Zhang [36] Medeiros [17] Grigorev [7]Ontology #tuples #results time (ms) time (ms) time (ms) time (ms)
skos 252 810 19 1044 83 10generations 273 2164 20 6091 173 19travel 277 2499 32 13971 316 24univ-bench 293 2540 23 20981 318 25atom-primitive 425 15454 172 515285 2074 255biomedical 459 15156 223 420604 2288 261foaf 631 4118 25 5027 377 39people-pets 640 9472 51 82081 914 89funding 1086 17634 112 499 1754 212wine 1839 66572 415 4075319 6797 819pizza 1980 56195 436 3233587 7292 697
Table 5.1: Performance evaluation for Query Q1 on RDF databases.
Figure 5.2: Visualization of the results for the Query Q1 on RDF databases.
Table 5.2 shows the number of answers and time that each algorithm takes to perform
Q2 starting on all vertices of the ontology databases. Figure 5.3 shows a bar chart with
54
these results.
GSSLR Zhang [36] Medeiros [17] Grigorev [7]Ontology #tuples #results time (ms) time (ms) time (ms) time (ms)
skos 252 1 0 16 4 1generations 273 0 0 13 3 1travel 277 63 4 281 22 1univ-bench 293 81 3 532 26 11atom-primitive 425 122 1 4711499 45 66biomedical 459 2871 39 1068851 486 45foaf 631 10 1 1154 10 2people-pets 640 37 5 247 23 3funding 1086 1158 21 125 254 23wine 1839 133 9 273 70 8pizza 1980 1262 28 255853 335 29
Table 5.2: Performance evaluation for Query Q2 on RDF databases.
Figure 5.3: Visualization of the results for the Query Q2 on RDF databases.
5.2 Binary trees
S. Grigorev et al. [7], proposed to use two different grammars, Q3, represented in Figure
5.5a, and Q4, represented in Figure 5.5b, which have the same language, but the first
one being ambiguous, containing shift/reduce conflicts, while the second is unambiguous.
Both grammars may be used to detect paths that contain the same amount of as followed
by bs in the form "ababab", "aaabbbab", etc.
The trees in the experiment are binary, which means that every node has exactly zero
or two children. All the edges for each level of the tree have one terminal symbol of the
55
1
2 3
4 5 6 7
a a
b b b b
(a)
1
2 3
4 5 6 7
b b
a a a a
(b)
Figure 5.4: Top-down (a) and Bottom-up (b) tree patterns used in the experiment.
S → εS → a S bS → S S
(a)
S → a S b SS → ε
(b)
Figure 5.5: Grammars for queries Q3 (a) and Q4 (b).
grammar. The paths in the trees follow the pattern v0a−→ v1
b−→ v2. In the experiment,
there are two terminals, a and b.
The experiment starts by executing the queries for both Q3 and Q4 grammars on a tree
with height 1 and consecutively increasing the tree height. Following what was done by C.
M. Medeiros [17], we also used two tree patterns in our experiments. The first one being
a tree where the paths start in the root and spread through the children (Top-down), as
seen on Figure 5.4a, and the second one with paths starting on the leaves, directed to the
root (Bottom-up), as seen on Figure 5.4b.
Table 5.3 and Figure 5.6 show the results of executing the experiment on binary top-
down tree graphs for both the Q3 and Q4 grammars. In this case, our algorithm takes less
time to find the paths in the graph than the algorithm proposed by C. M. Medeiros [17],
but it is notable that using the ambiguous grammar, our approach takes a considerable
amount of extra time.
This difference starts to show up when using the tree with height 13, which has 8191
triples and produce 39139 results. The sudden increase in execution time on every odd
height value, more notable on heights 13 and 15, is due to the increase in the valid paths
in these cases, while on the even height values the last level on the tree only produces
invalid paths, requiring less operations to be performed.
56
Medeiros [17] Medeiros [17] GSSLR GSSLRQ3 Q4 Q3 Q4
Height #Vertices #Results time (ms) time (ms) time (ms) time (ms)
1 0 0 0 0 0 02 3 3 0 0 0 03 7 11 3 3 4 04 15 19 5 4 14 15 31 67 17 17 33 186 63 99 26 25 24 77 127 355 86 90 27 288 255 483 125 123 30 109 511 1763 418 453 127 3910 1023 2275 582 597 187 4411 2047 8419 2010 2235 780 25412 4095 10467 2674 2760 1216 29313 8191 39139 10158 11057 4663 153414 16383 47331 13254 13570 7412 180215 32767 178403 55579 59414 27558 8725
Table 5.3: Execution time for the grammars Q3 and Q4 on top-down binary trees.
10 10.5 11 11.5 12 12.5 13 13.5 14 14.5 15
0
2
4
6
·104
Tree height
Tim
e(m
s)
Q3 Medeiros [17]Q4 Medeiros [17]Q3 GSSLRQ4 GSSLR
Figure 5.6: Visualization of the top-down binary tree experiment results.
The second part of this experiment is to perform the queries on the bottom-up trees.
By analyzing Table 5.4 and Figure 5.7 we notice that using the unambiguous grammar
Q4 allows our algorithm to have a considerably better performance than the algorithm
proposed by C. M. Medeiros [17], but there is a great degradation on performance when
using the ambiguous grammar Q3. This degradation is easily noticeable on height 15,
where the algorithm takes 3,613 milliseconds to execute the query using Q4, against the
395,223 milliseconds using Q3.
Even though using both grammars produce the same results, it is notable that there
57
Medeiros [17] Medeiros [17] GSSLR GSSLRQ3 Q4 Q3 Q4
Height #Vertices #Results time (ms) time (ms) time (ms) time (ms)
1 0 0 0 0 0 02 3 3 1 0 0 03 7 11 3 2 3 04 15 23 7 5 14 15 31 67 18 16 27 76 63 135 37 29 24 197 127 355 90 71 25 168 255 711 183 143 34 149 511 1763 434 344 87 2110 1023 3527 882 691 197 2811 2047 8419 2151 1570 1346 6112 4095 16839 4189 3225 2803 14013 8191 39139 10549 7586 20833 39614 16383 78279 21065 15996 42851 82115 32767 178403 56744 37527 395223 3613
Table 5.4: Execution time for the grammars Q3 and Q4 on bottom-up binary trees.
9 9.5 10 10.5 11 11.5 12 12.5 13 13.5 14 14.5 15
0
2
4
·105
Tree height
Tim
e(m
s)
Q3 Medeiros [17]Q4 Medeiros [17]Q3 GSSLRQ4 GSSLR
Figure 5.7: Visualization of the bottom-up binary tree experiment results.
is an increase in execution time when using the ambiguous grammar to execute the ex-
periments. This happens because for each step that there is a shift/reduce conflict, our
algorithm tries to parse a path via all actions defined in the parsing table in the current
state for the given input string, creating more pairs and connections in the GSS.
5.3 String graphs
In another experiment we execute the same queries as the ones used for the binary trees
experiment, but this time over string graphs. In this case, the graph contains one valid
58
path from the first vertex to the last vertex, in the form shown on Figure 5.8.
1 2 3 4 5a a b b
Figure 5.8: String graph pattern used in the experiments.
Analyzing the graph of the execution time in Figure 5.9, we observe that the algorithm
proposed by C. M. Medeiros [17] manages to take almost linear time to find the valid paths,
while our proposal tends to take polynomial time.
100 200 300 400 500 600 700 800 900 1,000
102
103
Number of vertices
Tim
e(m
s)
Q3 Medeiros [17]Q4 Medeiros [17]Q3 GSSLRQ4 GSSLR
Figure 5.9: Visualization of the strings experiment results.
This happens because a graph like this requires multiple reductions for each path being
parsed, and the reductions are the most complex operations in our algorithm. We are also
querying from all nodes in the graph, which increases the number of valid paths found.
According to these experiments, our algorithm manages to execute the queries in an
acceptable time, even though its asymptotic runtime complexity in the worst case scenario
is high. In the experiments, our algorithm outperformed all the other algorithms when
executing the suggested queries in the ontology databases. We also tried to execute the
experiments on complete graphs, where every node is connected to all the other nodes
in the graph by all of the terminals in the grammar, but the experiments took too much
time to run. This increase in execution time happens because there is a valid path from
every node to every node in complete graphs. This causes the GSS to keep increasing
the number of pairs and connections in each of its level. In the next chapter, we present
concluding remarks and suggestions of future work.
59
60
6 Conclusions
In this work we introduced a context-free path query algorithm for graph databases. The
proposed algorithm is inspired by the LR parsing [1, 3] algorithm and uses a variant of
the GSS structure, introduced in [30, 33], to enable the derivation of multiple paths at
the same time. A Python prototype was implemented and experiments were conducted
to validate and compare the results of our algorithm with those obtained by similar ap-
proaches. We conducted three experiments, using four queries to evaluate our algorithm’s
execution times.
In the first experiment, the ontologies used in [36], [7] and [17] were used as databases.
The main goal of this experiment was to investigate the feasibility of our method as well
as compare our results with those works. In this experiment, our algorithm outperforms
all the other approaches, meaning that our algorithm can be used to query data from real
applications with an acceptable execution time.
In the second and third experiments, synthetic data of different sizes were used to
investigate the scalability of our approach and compare it to [17]. With these experiments,
we discovered that our algorithm scales well when the user provides an unambiguous
grammar, and/or when the user knows the subject from where the query will begin, not
only starting the query on the whole database. We also need to consider that a complete
graph is improbable to happen in real applications, due to its lack of information value.
A complete graph is simply all information connected. Even though we were not able to
execute the experiment on complete graphs, the overall results of the experiments suggest
that it is viable to use our proposed algorithm to perform context-free queries on graphs
on most of the existing scenarios.
Experimental results show that our algorithm behaves well and outperforms the related
works in real application scenarios, but it is costly in cases where the graph is complete or
paths are too long. These cases require an elevated number of reduce operations, which
is the heaviest part of our algorithm’s execution. Our algorithm is best suited to perform
61
queries that require few reductions and where given grammar does not include production
rules with a large RHS.
The most important contributions of our work are:
(i) Analysis of the state of the art related to databases, graph databases and query
languages;
(ii) Adaptation to the GSS structure to manage information of multiple strings simul-
taneously;
(iii) Proposal of an algorithm that allow querying graph databases using LR(1) gram-
mars;
(iv) Prototypes of the proposed algorithm.
During the execution of the experiments, we discovered that, even though our al-
gorithm needs improvements in order to achieve good scalability when processing large
complete graphs, it managed to perform well compared to the related works.
As future work, we suggest some improvements to our algorithm and the data struc-
tures used by it:
(i) The GSS_Up function can be modified to have a decreased time complexity,
significantly improving our algorithm’s performance;
(ii) Since the GSS structure builds and keeps many information on the connections
deducted about the data graph’s vertices and connections, one improvement to our al-
gorithm may be to allow the user to query for valid paths between nodes at any given
non-terminal of the grammar, as is done by (C. M. Medeiros, 2018) [17]. Currently we
only allow querying for the start symbol in the grammar;
(iii) Our algorithm has not been optimized or refined in any way. It’s scalability may
be improved, allowing it to parse even bigger graphs, by improving memory management
and allowing the concurrent processing of the paths.
62
Bibliography
[1] A. Aho, M. Lam, R. Sethi, J. Ullman, Compilers: Principles, Techniques, and Tools,
ADDISON WESLEY Publishing Company Incorporated, 2007 (cit. on pp. 2, 13,
21, 31, 61).
[2] D. Ancona, C. F. Bolz, A. Cuni, A. Rigo, Automatic generation of JIT compilers
for dynamic languages in .NET, tech. rep., DISI, University of Genova and Institut
für Informatik, Heinrich-Heine-Universität Düsseldorf, 2008 (cit. on p. 53).
[3] A. Appel, M. Ginsburg, Modern Compiler Implementation in C: Basic Techniques,
Cambridge University Press, 1997 (cit. on pp. 12–16, 61).
[4] S. Bechhofer, M. Hauswirth, J. Hoffmann, M. Koubarakis, The Semantic Web: Re-
search and Applications: 5th European Semantic Web Conference, ESWC 2008,
Tenerife, Canary Islands, Spain, Springer Berlin Heidelberg, 2008 (cit. on p. 2).
[5] E. F. Codd, Commun. ACM June 1970, 13, 377–387 (cit. on p. 5).
[6] R. Elmasri, S. Navathe, Fundamentals of database systems, Benjamin/Cummings,
1989 (cit. on p. 6).
[7] S. Grigorev, A. Ragozina, arXiv preprint arXiv:1612.08872 2016 (cit. on pp. 31,
53–55, 61).
[8] D. Grune, C. Jacobs, Parsing Techniques: A Practical Guide, Springer New York,
2007 (cit. on p. 28).
[9] GSSLR JavaScript prototype, http : / / htmlpreview . github . io / ?https : / /
github.com/freddcs/gsslr/blob/master/index.htm, Last access: February,
20th 2018, 2018 (cit. on p. 42).
63
[10] G. Harrison, Next Generation Databases: NoSQLand Big Data, Apress, 2015 (cit.
on p. 6).
[11] T. Heath, C. Bizer, Linked Data: Evolving the Web Into a Global Data Space, Morgan
& Claypool, 2011 (cit. on p. 2).
[12] J. Hellings, Conjunctive Context-Free Path Queries, (Eds.: N. Schweikardt, V. Christophides,
V. Leroy), OpenProceedings.org, 2014, pp. 119–130 (cit. on pp. 27, 29, 30).
[13] JSMachines: Collection of Javascript applications illustrating parsing algorithms,
http://jsmachines.sourceforge.net/machines/lr1.html, Last access: January
22nd, 2018, 2012 (cit. on p. 38).
[14] JSON-LD - A JSON-based Serialization for Linked Data, https://www.w3.org/
TR/json-ld/, Last access: February, 25th 2017, 2014 (cit. on p. 9).
[15] P. Linz, An Introduction to Formal Languages and Automata, Jones & Bartlett
Learning, 2016 (cit. on p. 12).
[16] A. Makris, K. Tserpes, V. Andronikou, D. Anagnostopoulos, Procedia Computer Sci-
ence 2016, 97, 2nd International Conference on Cloud Forward: From Distributed
to Complete Computing, 94–103 (cit. on pp. 1, 7).
[17] C. M. Medeiros, MA thesis, Universidade Federal do Rio Grande do Norte, 2018
(cit. on pp. 31, 32, 53–59, 61, 62).
[18] S. MUÑOZ, J. PÉREZ, C. Gutierrez in Proceedings of the 4th European Confer-
ence on The Semantic Web: Research and Applications, Springer-Verlag, Innsbruck,
Austria, 2007, pp. 53–67 (cit. on p. 24).
[19] J. Pérez, M. Arenas, C. Gutierrez, Web Semantics: Science Services and Agents on
the World Wide Web 2010, 8, Semantic Web Challenge 2009User Interaction in
Semantic Web research, 255–270 (cit. on pp. 11, 23–25).
[20] PRIMER RDF 1.1 Primer, https://www.w3.org/TR/2014/NOTE-rdf11-primer-
20140225/, Last access: February, 8th 2017, 2014 (cit. on pp. 7, 8).
64
[21] A. Puntambekar, Formal Languages And Automata Theory, Technical Publications,
2009 (cit. on p. 12).
[22] RDF - Semantics Web Standards, https://www.w3.org/RDF/, Last access: Febru-
ary, 8th 2017, 2014 (cit. on pp. i, iii, 6, 8).
[23] RDFC - Concepts and Abstract Syntax, https : / / www . w3 . org / TR / rdf11 -
concepts/, Last access: February, 8th 2017, 2014 (cit. on p. 9).
[24] RDFS RDF Schema 1.1, https://www.w3.org/TR/rdf-schema/, Last access:
February, 8th 2017, 2014 (cit. on p. 8).
[25] L. Rietveld, Publishing and Consuming Linked Data: Optimizing for the Unknown,
IOS Press, 2016 (cit. on p. 1).
[26] A. Rigo, S. Pedroni, JIT Compiler Architecture, tech. rep. D08.2, PyPy, May 2007
(cit. on p. 53).
[27] I. Robinson, J. Webber, E. Eifrem, Graph Databases: New Opportunities for Con-
nected Data, O’Reilly Media, 2015 (cit. on p. 6).
[28] A. Satinder Bal Gupta, Introduction to Database Management System, Laxmi Pub-
lications, 2009 (cit. on p. 5).
[29] E. Scott, A. Johnstone, Electronic Notes in Theoretical Computer Science 2010,
253, Proceedings of the Ninth Workshop on Language Descriptions Tools and Ap-
plications (LDTA 2009), 177–189 (cit. on p. 31).
[30] E. Scott, A. Johnstone, S. S. Hussain, Tomita-Style Generalised LR Parsers, tech.
rep., Dec. 2000 (cit. on pp. i, iii, 32, 61).
[31] SPARQL 1.1 Overview, https://www.w3.org/TR/rdf- sparql- query/, Last
access: February, 10th 2017, 2013 (cit. on pp. 2, 10).
[32] Theory Of Automata, McGraw-Hill Education (India) Pvt Limited, 2010 (cit. on
p. 12).
[33] M. Tomita, Comput. Linguist. Jan. 1987, 13, 31–46 (cit. on p. 61).
65
[34] TriG - RDF Dataset Language, https://www.w3.org/TR/trig/, Last access:
February, 25th 2017, 2014 (cit. on p. 9).
[35] Turtle - Terse RDF Triple Language, https://www.w3.org/TR/turtle/, Last
access: February, 25th 2017, 2014 (cit. on p. 9).
[36] X. Zhang, Z. Feng, X. Wang, G. Rao, W. Wu in The Semantic Web – ISWC 2016:
15th International Semantic Web Conference, Kobe, Japan, October 17–21, 2016,
Proceedings, Part I, (Eds.: P. Groth, E. Simperl, A. Gray, M. Sabou, M. Krötzsch, F.
Lecue, F. Flöck, Y. Gil), Springer International Publishing, Cham, 2016, pp. 632–
648 (cit. on pp. 29, 30, 53–55, 61).
66