38
1 Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph eries Towards Understanding Modern Graph Processing, Storage, and Analytics MACIEJ BESTA, EMANUEL PETER, Department of Computer Science, ETH Zurich ROBERT GERSTENBERGER, Department of Computer Science, ETH Zurich MARC FISCHER, PRODYNA (Schweiz) AG MICHAŁ PODSTAWSKI, Future Processing CLAUDE BARTHELS, GUSTAVO ALONSO, Department of Computer Science, ETH Zurich TORSTEN HOEFLER, Department of Computer Science, ETH Zurich Graph processing has become an important part of multiple areas of computer science, such as machine learning, computational sciences, medical applications, social network analysis, and many others. Numerous graphs such as web or social networks may contain up to trillions of edges. Often, these graphs are also dynamic (their structure changes over time) and have domain-specific rich data associated with vertices and edges. Graph database systems such as Neo4j enable storing, processing, and analyzing such large, evolving, and rich datasets. Due to the sheer size of such datasets, combined with the irregular nature of graph processing, these systems face unique design challenges. To facilitate the understanding of this emerging domain, we present the first survey and taxonomy of graph database systems. We focus on identifying and analyzing fundamental categories of these systems (e.g., triple stores, tuple stores, native graph database systems, or object-oriented systems), the associated graph models (e.g., RDF or Labeled Property Graph), data organization techniques (e.g., storing graph data in indexing structures or dividing data into records), and different aspects of data distribution and query execution (e.g., support for sharding and ACID). 45 graph database systems are presented and compared, including Neo4j, OrientDB, or Virtuoso. We outline graph database queries and relationships with associated domains (NoSQL stores, graph streaming, and dynamic graph algorithms). Finally, we describe research and engineering challenges to outline the future of graph databases. CCS Concepts: General and reference Surveys and overviews; Information systems Data management systems; Graph-based database models; Data structures; DBMS engine architectures; Database query processing; Parallel and distributed DBMSs; Database design and models; Distributed database transactions; Theory of computation Data modeling; Data structures and algorithms for data management ; Distributed algorithms; • Computer systems organization → Distributed architectures; Additional Key Words and Phrases: Graphs, Graph Databases, NoSQL Stores, Graph Database Management Systems, Graph Models, Data Layout, Graph Queries, Graph Transactions, Graph Representations, RDF, Labeled Property Graph, Triple Stores, Key-Value Stores, RDBMS, Wide-Column Stores, Document Stores ACM Reference format: Maciej Besta, Emanuel Peter, Robert Gerstenberger, Marc Fischer, Michał Podstawski, Claude Barthels, Gustavo Alonso, and Torsten Hoefler. 2019. Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries. 38 pages.

Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

1

Demystifying Graph Databases: Analysis and Taxonomyof Data Organization, System Designs, and GraphQueriesTowards Understanding Modern Graph Processing, Storage, and Analytics

MACIEJ BESTA, EMANUEL PETER, Department of Computer Science, ETH ZurichROBERT GERSTENBERGER, Department of Computer Science, ETH ZurichMARC FISCHER, PRODYNA (Schweiz) AGMICHAŁ PODSTAWSKI, Future ProcessingCLAUDE BARTHELS, GUSTAVO ALONSO, Department of Computer Science, ETH ZurichTORSTEN HOEFLER, Department of Computer Science, ETH Zurich

Graph processing has become an important part of multiple areas of computer science, such as machinelearning, computational sciences, medical applications, social network analysis, and many others. Numerousgraphs such as web or social networks may contain up to trillions of edges. Often, these graphs are alsodynamic (their structure changes over time) and have domain-specific rich data associated with vertices andedges. Graph database systems such as Neo4j enable storing, processing, and analyzing such large, evolving,and rich datasets. Due to the sheer size of such datasets, combined with the irregular nature of graph processing,these systems face unique design challenges. To facilitate the understanding of this emerging domain, wepresent the first survey and taxonomy of graph database systems. We focus on identifying and analyzingfundamental categories of these systems (e.g., triple stores, tuple stores, native graph database systems, orobject-oriented systems), the associated graph models (e.g., RDF or Labeled Property Graph), data organizationtechniques (e.g., storing graph data in indexing structures or dividing data into records), and different aspectsof data distribution and query execution (e.g., support for sharding and ACID). 45 graph database systemsare presented and compared, including Neo4j, OrientDB, or Virtuoso. We outline graph database queriesand relationships with associated domains (NoSQL stores, graph streaming, and dynamic graph algorithms).Finally, we describe research and engineering challenges to outline the future of graph databases.

CCS Concepts: • General and reference → Surveys and overviews; • Information systems → Datamanagement systems; Graph-based database models; Data structures; DBMS engine architectures;Database query processing; Parallel and distributed DBMSs; Database design and models; Distributeddatabase transactions; • Theory of computation→ Data modeling; Data structures and algorithms for datamanagement; Distributed algorithms; • Computer systems organization → Distributed architectures;

Additional Key Words and Phrases: Graphs, Graph Databases, NoSQL Stores, Graph Database ManagementSystems, Graph Models, Data Layout, Graph Queries, Graph Transactions, Graph Representations, RDF,Labeled Property Graph, Triple Stores, Key-Value Stores, RDBMS, Wide-Column Stores, Document Stores

ACM Reference format:Maciej Besta, Emanuel Peter, Robert Gerstenberger, Marc Fischer, Michał Podstawski, Claude Barthels,Gustavo Alonso, and Torsten Hoefler. 2019. Demystifying Graph Databases: Analysis and Taxonomy of DataOrganization, System Designs, and Graph Queries. 38 pages.

Page 2: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

1:2 M. Besta et al.

1 INTRODUCTIONGraph processing is behind many computational problems in medicine, machine learning, compu-tational sciences, and others [95, 112]. Several properties of graph computations such as irregularcommunication patterns or little locality [112, 165], combined with the sheer size of graph datasets(up to trillions of edges [20, 46, 110, 151]), make graph algorithms inherently difficult to design.The difficulties are increased by the fact that many such graphs are also dynamic (their structurechanges over time) and have rich properties associated with vertices and edges.

Graph databases1 such as Neo4j [148] emerged to enable storing, processing, and analyzing large,evolving, and rich graph datasets. In graph databases, contrarily to traditional relational databases,the underlying data layout usually does not follow the fixed schema based on tables that implementrelations. Instead, the layout in graph databases often uses the fact that a graph consists of verticesand edges between vertices. Although a graph can also be modeled with tables corresponding tovertices and edges, various graph algorithms or queries, for example traversals of a graph, benefitfrom storing a graph as an adjacency list array, where the neighbors of each vertex can be accessedwith a simple memory lookup through a pointer attached to each vertex [148].

Graph databases face unique challenges due to the sheer size of graph datasets, the irregularnature of graph processing, and the demand for low latency and high throughput of graph queriesthat can be both local (i.e., accessing or modifying a small part of the graph, for example a singleedge) and global (i.e., accessing or modifying a large part of the graph, for example all the edges).Many of these challenges belong to the following areas: “general design” (i.e., what is the mostadvantageous general structure of a graph database engine), “data models and organization” (i.e.,how to model and store the underlying graph dataset), “data distribution” (i.e., whether and how todistribute the data across multiple servers), and “transactions and queries” (i.e., how to query theunderlying graph dataset to extract useful information). This distinction is illustrated in Figure 1.In this work, we present the first survey and taxonomy on these system aspects of graph databases.In general, we provide the following contributions:• We provide the first taxonomy of graph databases, identifying and analyzing key dimen-sions in the design of graph databases: (1) general database engine, (2) data model, (3) dataorganization, (4) data distribution, (5) query execution, and (6) type of transactions.• We use our taxonomy to survey, categorize, and compare 45 graph database systems.• We discuss in detail the design of selected graph databases.• We outline related domains, such as queries and workloads in graph databases.• We describe research and engineering challenges to outline the future of graph databases.

Graph Databases vs. NoSQL Stores and Other Database Systems NoSQL stores addressvarious deficiencies of relational database systems. An example such deficiency is little supportfor flexible data models [54]. Graph databases such as Neo4j can be seen as one particular type ofNoSQL stores; these systems are sometimes referred to as “native” graph databases [148]. Othertypes of NoSQL systems include key-value stores, wide-column stores, and document stores [54]. Inthis survey, we focus on any database system that enables storing and processing graphs, includingnative graph databases and other types of NoSQL stores, relational databases, object-orienteddatabases, and others. Figure 2 shows the types of considered systems.

1Lists of graph databases can be found at http://nosql-database.org, https://database.guide,https://www.g2crowd.com/categories/graph-databases, https://www.predictiveanalyticstoday.com/top-graph-databases,and https://db-engines.com/en/ranking/graph+dbms.

Page 3: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

Survey and Taxonomy of Graph Databases 1:3

This symbol indicatesthat a given categoryis surveyed in another

publication

Integrity constraintsin graph databases...

Query languagesin graph databases...

Data modelsin graph databases...

History ofgraph databases...

Compressinggraph databases...

Relateddomainscoveredin othersurveys

Object-oriented

databases(§ 4.7)

Data hubs(§ 4.9)

Native graphsystems (§ 4.8)

Graphdata

models(§ 2.1)

RDBMSfor graphs

(§ 4.6)Document

stores (§ 4.4)

Key-valuestores(§ 4.3)

Wide-columnstores(§ 4.5)

RDF stores(§ 4.1)

Tuplestores (§ 4.2)Design details

of selected graphdatabases (§ 4)

Queries andworkloads (§ 5)

Graphdatabases

Data models andrepresentations (§ 2)

Graphrepresentations

(§ 2.2)

Non-graphdata models

(§ 2.3)

Datadistribution

(§ 3.4)

Queryexecution

(§ 3.5)

Transactiontypes (§ 3.6)

Datamodel(§ 3.2)

Dataorganization

(§ 3.3)

Generalsystem

category(§ 3.1)

Taxonomy andkey dimensionsof systems (§ 3)

Languagesupport(§ 3.7)

Fig. 1. The illustration of the considered areas of graph databases.

Hierarchicaland networksystems

Relationalsystems

Object-orientedsystems

NoSQLsystems

NewSQLsystems

Key-valuestores

Documentstores

Wide-columnstores

Nativegraphstores

Tuplestores

RDFsystems

No longeractively

developed

The focus of this survey: anysystems used as graph databases.

We consider native graph storesand parts of other domains relatedto storing and processing graphs

RDF systems canbe implementedas NoSQL or as

traditional RDBMsystems

Types of database systems

Fig. 2. The illustration of the considered types of databases.

Graph Databases vs. Graph Streaming Frameworks In graph streaming, the input graph ispassed as a stream of updates, allowing to add and remove edges in a simple way. Graph databasesare related to graph streaming in that they face graph updates of various types. Still, they usuallydeal with complex graph models (such as the Labeled Property Graph [3] or Resource DescriptionFramework [50]) where both vertices and edges may be of different types and may be associatedwith arbitrary properties. Contrarily, graph streaming frameworks such as STINGER [64] focuson simple graph models where edges or vertices may have weights and, in some cases, simpleadditional properties such as time stamps. Moreover, challenges in the design of graph databasesinclude transactional support, a topic unrelated to graph streaming frameworks.

Page 4: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

1:4 M. Besta et al.

What Is the Scope of Related Surveys? There exist several surveys dedicated to the theory ofgraph databases. In 2008, Angles et al. [5] described the history of graph databases, and, in particular,the used data models, data structures, query languages, and integrity constraints. In 2017, Angleset al. [3] analyzed in more detail query languages for graph databases, taking both an edge-labeledand a property graph model into account and studying queries such as graph pattern matching andnavigational expressions. Moreover, there are surveys that focus on NoSQL stores [54, 71, 81] andRDF [133]. There is no survey dedicated to the systems aspects of graph databases, except for severalbrief papers that cover small parts of the domain [97, 99, 104, 107, 119, 120, 135, 138, 152, 169, 170].

2 GRAPHS AND GRAPH MODELS IN THE LANDSCAPE OF GRAPH DATABASESWe start with graphmodels (§ 2.1) and representations (§ 2.2) used in graph databases.We summarizethe key symbols and abbreviations in Table 1.

G,M A graph G = (V ,E) and its adjacency matrix; V and E are sets of vertices and edges.n,m Numbers of vertices and edges in a graph G; |V | = n, |E | =m.d, d The average degree and the maximum degree in a given graph, respectively.P (S ) = 2S The power set of S (the set of all subsets of S).AM The Adjacency Matrix representation. M ∈ {0, 1}n,n ,Mu,v = 1⇔ (u,v ) ∈ E.AL, Au The Adjacency List representation and the adjacency list of a vertex u; v ∈ Au ⇔ (u,v ) ∈ E.LPG, RDF Labeled Property Graph (§ 2.1.3) and Resource Description Framework (§ 2.1.5).KV, RDBMS Key-Value store (§ 4.3) and Relational Database Management Systems (§ 4.6).OODBMS Object-Oriented Database Management Systems (§ 4.7).OLTP, OLAP Online Transaction Processing (§ 3.6) and Online Analytics Processing (§ 3.6).ACID Transaction guarantees (Atomicity, Consistency, Isolation, Durability).

Table 1. The most important symbols and abbreviations used in the paper.

2.1 Graph ModelsWe start with graph models used by the surveyed systems.

2.1.1 Simple Graph Model. A graphG can be modeled as a tuple (V ,E) (denoted also asG (V ,E));V is a set of vertices and E ⊆ V ×V is a set of edges; |V | = n and |E | =m. If G is directed, an edgee = (u,v ) ∈ E is a tuple of two vertices, where u is the out-vertex (source) and v is the in-vertex(destination). IfG is undirected, an edge e = {u,v} ∈ E is a set of two vertices. IfG is weighted, it ismodeled with a triple (V ,E,w );w : E → R maps edges to weights.A simple graph model is often used in graph processing frameworks such as Pregel [113] or

STINGER [64]. However, it is not commonly used with graph databases. Instead, it is a basis formore complex models, such as the Labeled Property Graph.

2.1.2 Hypergraph Model. A hypergraph H generalizes a simple graph in that any of its edgescan join any number of vertices. Formally, a hypergraph is also modeled as a tuple (V ,E); V is a setof vertices and E ⊆ (P (V ) \ ∅) is a set of hyperedges: non-empty subsets of V .

Hypergraphs are rarely used in graph databases and graph processing systems. In this survey, wedescribe a system called HyperGraphDB (§ 4.3.2) that focuses on storing and querying hypergraphs.

2.1.3 Labeled Property Graph Model. The classical graph model, a tuple G = (V ,E), is adequatefor many problems such as computing vertex centralities [37]. However, it is not rich enough tomodel various real-world problems. This is why graph databases often use the Labeled PropertyGraph Model (LPG), sometimes simply called the property graph [3]. In LPG, one augments thesimple graph model (V ,E) with labels that define different subsets (or classes) of vertices and

Page 5: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

Survey and Taxonomy of Graph Databases 1:5

edges. Furthermore, every vertex and edge can have an arbitrary number of properties (often alsocalled attributes). A property is a pair (key,value ), where key identifies a property and value is thecorresponding value of this property. Formally, an LPG is defined as a tuple

(V ,E,L, lV , lE ,K ,W ,pV ,pE )

L is the set of labels. lV : V 7→ P (L) and lE : E 7→ P (L) are labeling functions. Note that P (L) isthe power set of L, denoting all the possible subsets of L. Thus, each vertex and edge is mapped toa subset of labels. Next, vertices and edges can have an arbitrary number of properties. We model aproperty as a key-value pair p = (key,value ), where key ∈ K and value ∈W . K andW are setsof all possible keys and values. Finally, pV (u) denotes the set of property key-value pairs of thevertex u, pE (e ) denotes the set of property key-value pairs of the edge e . An example LPG is inFigure 3. All systems considered in this work use some variant of the LPG, with the exception ofRDF systems or when explicitly discussed.

2.1.4 Variants of Labeled Property Graph Model. Several databases support variants of LPG. First,Neo4j [148] (a graph database described in detail in § 4.8.1) supports an arbitrary number of labelsfor vertices. However it only allows for one label, (called edge-type), per edge. Next, ArangoDB [10](a graph database described in detail in § 4.4.2) only allows for one label per vertex (vertex-type) andone label per edge (edge-type). This facilitates the separation of vertices and edges into differentdocument collections. Finally, edge-labeled graphs [3] do not allow for any properties and uselabels in a restricted way. Specifically, only edges have labels and each edge has exactly one label.Formally, G = (V ,E,L), where V is the set of vertices and E ⊆ V × L ×V is the set of edges. Notethat this definition enables two vertices to be connected by multiple edges with different labels.

:Personname = Alice

age = 21:knows

since = 09.08.2007

:Personname = Bob

age = 24

:Message:Post

title = Holidaystext = We had...

:hasCreator

:Message:Comment

text = Wow! ...

:hasCreator

:replyOf

Fig. 3. The illustration of an example Labeled Property Graph (LPG). Vertices and edges can have labels (bold,prefixed with colon) and properties (key = value). We present a subgraph of a social network, where a person can knowother persons, post messages, and comment on others’ messages.

2.1.5 Resource Description Framework (RDF). The Resource Description Framework (RDF) [50] isa collection of specifications for representing information. It was introduced by the World WideWeb Consortium (W3C) in 1999 and the latest version (1.1) of the RDF specification was publishedin 2014. Its goal is to enable a simple format that allows for easy data exchange between differentformats of data. It is especially useful as a description of irregularly connected data. The corepart of the RDF model is a collection of triples, each consisting of a subject, a predicate, and anobject. Thus, RDF databases are also often called triple stores (or triplestores). Subjects can either beidentifiers (called Uniform Resource Identifiers (URIs)) or blank nodes (which are dummy identifiersfor internal use). Objects can be URIs, blank nodes, or literals (which are simple values). Withtriples, one can connect identifiers with identifiers or identifiers with literals. The connections arenamed with another URI (the predicate). RDF triples can be formally described as

(s,p,o) ∈ (URI ∪ blank ) × (URI ) × (URI ∪ blank ∪ literal )

Page 6: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

1:6 M. Besta et al.

s represents a subject, p models a predicate, and o represents an object. URI is a set of UniformResource Identifiers; blank is a set of blank node identifiers, that substitute internally URIs to allowfor more complex data structures; literal is a set of literal values [84, 133].

2.1.6 Transformations between LPG and RDF. To represent a Labeled Property Graph in the RDFmodel, LPG vertices are mapped to URIs (❶) and then RDF triples are used to link those vertices withtheir LPG properties by representing a property key and a property value with, respectively, an RDFpredicate and an RDF object (❷). For example, for a vertex with an ID vertex-id and a correspondingproperty with a key property-key and a value property-value, one creates an RDF triple (vertex-id,property-key, property-value). Similarly, one can represent edges from the LPG graph model in theRDF model by giving each edge the URI status (❸), and by linking edge properties with specificedges analogously to vertices: (edge-id, property-key, property-value) (❹). Then, one has to use twotriples to connect each edge to any of its adjacent vertices (❺). Finally, LPG labels can also betransformed into RDF triples in a way similar to that of properties [94], by creating RDF triples forvertices (❻) and edges (❼) such that the predicate becomes a “label” URI and contains the stringname of this label. Figure 4 shows an example of transforming an LPG graph into RDF triples.

V-ID

type

from to

21 24Alice Bob09.08.2007

agename name

agesince

knows

label

Personlabellabel

LPG graph

RDF graph

:Personname = Bob

age = 24

:Personname = Alice

age = 21

:knowssince = 09.08.2007

1

2

3

4

57

6

5

Numbers in black circles correspond to thetransformation steps describes in § 2.1.5

vertex vertex

edge

E-IDtype

type

V-ID

Fig. 4. Comparison of an LPG and an RDF graph: a transformation from LPG to RDF. “V-ID”, “E-ID”, “age”, “name”,“type”, “from”, “to”, “since” and “label” are RDF URIs. Numbers in black circles correspond to the transformation stepsdescribes in § 2.1.6.

If all vertices and edges only have one label, one can omit the triples for labels and store the label(e.g., “Person”) together with the vertex or the edge name (“V-ID” and “E-ID”) in the identifier. Weillustrate a corresponding example in Figure 5.

Transforming RDF data into the LPG model is more complex, since RDF predicates, which wouldnormally be translated into edges, are URIs. Thus, while deriving an LPG graph from an RDF graph,one must map edges to vertices and link such vertices, otherwise the resulting LPG graph maybe disconnected. There are several schemes for such an RDF to LPG transformation, for examplederiving an LPG graph which is bipartite, at the cost of an increased graph size [84]. Details andexamples are provided in a report by Hayes [84].

2.2 Fundamental Graph RepresentationsWe also summarize fundamental graph representations. A graphG can be represented in variousways. Three common representations are the adjacency matrix format (AM), the adjacency listformat (AL), and the edge list format (EL). We illustrate these representations in Figure 6.In the AM format, a matrix M ∈ {0, 1}n,n determines the connectivity of vertices:

Page 7: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

Survey and Taxonomy of Graph Databases 1:7

RDF graph

Person/V-ID knows/E-ID Person/V-IDfrom to

21 24Alice Bob09.08.2007

agename name

agesince

LPG graph

:Personname = Bob

age = 24

:Personname = Alice

age = 21

:knowssince = 09.08.2007

vertextype

vertexedge

type type

Fig. 5. Comparison of an LPG and an RDF graph: a transformation from LPG to RDF, given vertices and edges haveonly one label. “Person/V-ID”, “knows/E-ID”, “age”, “name”, “type”, “from”, “to” and “since” are RDF URIs.

Mu,v = 1⇔ (u,v ) ∈ E.

In the AL format, each vertex u has an associated adjacency listAu . This adjacency list maintainsthe IDs of all vertices adjacent to u. Each adjacency list is often stored as a contiguous array ofvertex IDs. We have:

v ∈ Au ⇔ (u,v ) ∈ E.

AM uses O(n2)space and can check connectivity of two vertices in O (1) time. AL requires

O (n +m) space and it can check connectivity in O ( |Au |) ⊆ O(d)time.

EL is similar to AL in the asymptotic time and space complexity as well as the general design.The main difference is that each edge is stored explicitly, with both its source and destination vertex.In AL and EL, a potential cause for inefficiency is scanning all edges to find neighbors of a givenvertex. To alleviate this, index structures are employed [32].

Many graph databases (e.g., OrientDB (§ 4.4.1), MS Graph Engine (§ 4.3.1), Dgraph [56], Janus-Graph (§ 4.5.1), Neo4j (§ 4.8.1)) use variants of AL since it makes traversing neighborhoods efficientand straightforward [148]. No systems that we analyzed use an uncompressed AM as it is inefficientwith O (n2) space, especially for sparse graphs. Systems using AM focus on compression of theadjacency matrix [27], trying to mitigate storage and query overheads (e.g., GBase § 4.8.3).

2.3 Non-Graph Data Models Used in Graph DatabasesThere exist data models that do not target specifically graphs but are used in various systemsto model and store graphs. These models include collections of key-value pairs, documents, andtuples (used in different types of NoSQL stores), relations and tables (used in traditional relationaldatabases), and objects (used in object-oriented databases). Different details of these models and thedatabase systems based on them are described in other surveys, for example in a recent publicationon NoSQL stores by Davoudian et al. [54]. Thus, we omit extensive discussions and instead offerbrief summaries, focusing on how they are used to model or represent graphs.

2.3.1 Collection of Key-Value Pairs. Key-value stores are the simplest NoSQL stores [54]. Here,the data is stored as a collection of key-value pairs, with the focus on high-performance and highly-scalable lookups based on keys. The exact form of both keys and values depends on a specificsystem or an application. Keys can be simple (e.g., an URI or a hash) or structured. Values are often

Page 8: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

1:8 M. Besta et al.

n: number of verticesm: number of edgesd: maximum graph degree

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 1 0 0 0 0 00 0 0 0 0 0 0 0 0 0 1 1 0 0 0 00 0 0 0 0 0 0 0 0 0 0 1 0 0 0 00 0 0 0 0 0 0 0 0 1 0 0 0 0 0 00 0 0 0 0 0 0 0 1 1 1 1 0 0 0 10 0 0 0 0 0 0 0 1 1 0 1 0 1 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 1 1 0 0 0 1 0 0 0 0 10 0 0 0 1 1 1 0 0 0 1 1 0 0 0 00 1 1 0 0 1 0 0 1 1 0 0 0 1 0 10 0 1 1 0 1 1 0 0 1 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 1 10 0 0 0 0 0 1 0 0 0 1 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 1 0 0 10 0 0 0 0 1 1 1 1 0 1 1 1 1 1 0

1

n

...

1 ... n

123456789

10111213141516

1

n

...

1

∅11

121112109 101112169 10121416

166 7 11165 6 7 11122 3 6 9 10 16143 4 6 7 1016

16157 1116

16136 7 8 9 11 1312 1514

d...

The number ofall elements inthe adjacencydata structure:2m (undirectedgraph) and m

(directed graph)

...

Numberof tuples:

23

456

7

16

1111

1210910111216910121416

15

3 12

6666

7777

...

2

3

45

6

7

16

11

11

1210

910

11

12

16

910

12

14

16

15

3 12

6

6

6

6

7

7

7

71

...

2m (undirected),m (directed)

123456789

10111213141516

1

n

...

Neighborhoodscan be sortedor unsorted

...

1

2morm

...

An n x n matrix

Unweighted graph: a cell is one bit

Pointers from verticesto their neighborhoods

Neighbor-hoods canbe arbitrarystructures,e.g., arrays

or lists

Weighted graph: a cell is one integer

Pointers fromvertices to theirneighborhoods

2morm

Onetuple

corres-pondsto oneedge

Offs

et a

rray

is o

ptio

nal

Adjacency Matrix

Adjacency List Edge List (sorted & unsorted)

INPUT GRAPH:

Fig. 6. Illustration of fundamental graph representations: Adjacency Matrix, Adjacency List, and Edge List.

encoded as byte arrays (i.e., the structure of values is usually schema-less). However, a key-valuestore can also impose some additional data layout, structuring the schema-less values [54].

Due to the general nature of key-value stores, there can be many ways of representing a graphas a collection of KV values. We describe several concrete example systems in § 4.3. For example,one can use vertex labels as keys and encode the neighborhoods of vertices as values.

2.3.2 Collection of Documents. A document is a fundamental storage unit in a class of NoSQLdatabases called document stores [54]. These documents are stored in collections. Multiple collec-tions of documents constitute a database. Documents are encoded in some standard semi-structuredformat such as XML [39] or JSON [38]. Document stores extend key-value stores in that a documentcan be seen as a value that has a certain flexible schema. This schema consists of attributes, where

Page 9: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

Survey and Taxonomy of Graph Databases 1:9

each attribute has a name along with one or more values. Such a structure based on documentswith attributes allows for various value types, key-value pair storage, and recursive data storage(attribute values can be lists or key-value dictionaries).

In all surveyed systems, each vertex is stored in a vertex document. The capability of documentsto store key-value pairs is used to store vertex labels and properties within the correspondingvertex document. The details of edge storage, however, is system-dependent: edges can be storedin the document corresponding to the source vertex of each edge, or in the documents of thedestination vertices. As documents do not impose any restriction on what key-value pairs can bestored, vertices and edges may have different sets of properties.

2.3.3 Collection of Tuples. Tuples are a basis of NoSQL stores called tuple stores. A tuple storegeneralizes an RDF store: RDF stores are restricted to triples (or – in some cases – 4-tuples, alsoreferred to as quads) whereas tuple stores can contain tuples of an arbitrary size. Thus, the numberof elements in a tuple is not fixed and can vary, even within a single database. Each tuple has an IDwhich may also be a direct memory pointer.

A collection of tuples can model a graph in different ways. For example, one tuple of size n canstore pointers to other tuples that contain neighborhoods of vertices. The exact mapping betweensuch tuples and graph data is specific to different databases; we describe examples in § 4.2.

2.3.4 Collection of Tables. Tables are the basis of Relational Database Management Systems(RDBMS) [14, 49, 85]. Tables consist of rows and columns. Each row represents a single data element,for example a car. A single column usually defines a certain data attribute, for example the color ofa car. Some columns can define unique IDs of data elements, called primary keys. Primary keys canbe used to implement relations between data elements. A one-to-one or a one-to-many relationcan be implemented with a single additional column that contains the copy of a primary key of therelated data element (such primary key copy is called the foreign key). A many-to-many relationcan be implemented with a dedicated table containing foreign keys of related data elements.

To model a graph as a collection of tables, one can implement vertices and edges as rows in twoseparate tables. Each vertex has a unique primary key that constitutes its ID. Edges can relate totheir source or destination vertices by referring to their primary keys (as foreign keys). LPG labelsand properties, as well as RDF predicates, can be modeled with additional columns [174, 176].

2.3.5 Collection of Objects. Finally, one can use collections of objects in Object-Oriented Data-base Management Systems (OODBMS) [13] to model graphs. Here, data elements and their relationsare implemented as as objects linked with some form of pointers. The details of modeling graphs asobjects heavily depend on specific designs. We provide these details for selected systems in § 4.7.

3 TAXONOMY OF GRAPH DATABASE SYSTEMSWe now describe how we categorize graph database systems considered in this survey. The mainidentified aspects are: (1) general types of database systems (§ 3.1), (2) supported data models (§ 3.2),(3) used forms of data organization (§ 3.3), (4) enabled methods for distributing data (§ 3.4), (5)provided schemes for query execution (§ 3.5), and (6) supported transaction types (§ 3.6). Figure 7illustrates the general types of considered databases together with certain aspects of data modelsand organization. Figure 8 summarizes all elements of the proposed taxonomy.

3.1 Types of Graph Database SystemsWe survey existing graph database systems [2, 8, 10, 11, 15, 34, 40, 42, 52, 56, 68, 70, 77, 92, 100, 109,115, 117, 123, 127, 129, 130, 139, 145–148, 157, 166, 172–175, 175] and we identify types of graphdatabases that differ in their general design. Some systems use a certain backend technology,

Page 10: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

1:10 M. Besta et al.

adapting this backend to storing graph data, and adding a frontend to query the graph data.Examples of such systems are ❶ RDF systems (also called triple stores), ❷ tuple stores, ❸ docu-ment stores, ❹ key-value stores, ❺ wide-column stores, ❻ Relational Database ManagementSystems (RDBMS), or ❼ Object-Oriented Database Management Systems (OODBMS). Several ofthese categories of systems fall into the domain of NoSQL stores (❶ – ❺; RDF systems can also beimplemented as, e.g., RDBMS). Some graph databases are designed specifically for maintaining andquerying graphs; we call such systems ❽ native graph databases (or native graph stores); they arealso often categorized as NoSQL stores [54]. Finally, we consider a group of designs called the ❾data hub; they enable using many different storage backends, facilitating storing data in differentformats and models. Figure 7 illustrates these systems, they are discussed in more detail in § 4.

3.2 Data ModelsWe also investigate what data models are supported by different graph databases. Here, we use themodels described in § 2. In addition, we call a system Multi Model if it allows for more than onedata model, for example when it directly supports both LPG and RDF.

3.3 Data OrganizationNext, while surveying databases, we consider different aspects of data organization.

3.3.1 Dividing Data into Records. Many systems (called record based) divide data into small unitscalled records. A certain number of records is kept together in one contiguous block in memory ordisk to enhance data access locality. Some systems allow variable sized records (e.g., ArangoDB),others only enable fixed sized records (e.g., Neo4j). Next, key-value stores usually maintain a singlevalue in a single record, while in document stores a document is a record.

Systems that use records for graph storage often use one or more records per vertex (theserecords are sometimes referred to as vertex records). Neo4j uses multiple fixed-size records forvertices, document database use one document per vertex (e.g., ArangoDB). Edges are sometimesstored in the same record together with the associated (source or destination) vertices (e.g., Titanor JanusGraph). Otherwise, edges are stored in separate edge records (e.g., ArangoDB).

3.3.2 Storing Data in Index Structures. Some databases (called index based) store data in indexstructures that are used to maintain locations of different parts of the data for faster data access.In such cases, the index does not point to a certain data record but the index itself contains thedesired data. Example systems with such functionality are Sparksee/DEX and Cray Graph Engine.To maintain indices, the former uses bitmaps and B+ trees while the latter uses hash tables.

3.3.3 Enabling Lightweight Edges. Some systems (e.g., OrientDB) allow edges without labels orproperties to be stored as lightweight edges. Such edges are stored in the records of the correspondingsource and/or destination vertices. These lightweight edges are represented by the ID of theirdestination vertex, or by a pointer to this vertex. This can save storage space and accelerate resolvingdifferent graph queries such as verifying connectivity of two vertices [41].

3.3.4 Linking Records with Direct Pointers. In record based systems, vertices and edges are storedin records. To enable efficient resolution of connectivity queries (i.e., verifying whether two verticesare connected), these records have to point to other records. One option is to store direct pointers(i.e., memory addresses) to the respective connected records. For example, an edge record can storedirect pointers to vertex records with adjacent vertices. Another option is to assign each recorda unique ID and use these IDs instead of direct pointers to refer to other records. On one hand,this requires an additional indexing structure to find the physical location of a record based on its

Page 11: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

Survey and Taxonomy of Graph Databases 1:11

Vertices and edges are encoded invalues and indexed by keys (IDs)

Subject URIs are linked toobject URIs via predicates

Vertices and edges are stored intuples, linked via pointers or IDs

of other tuples

Vertices and edges are encodedin documents (e.g., JSON) and

linked via pointers or document IDs

Combines multiple models and/or storage schemes

Custom database systems,optimized for graph storage

and traversal queries.

Vertices and edges are storedin Java, C#, ... language objects

Vertices and edges are stored inrows of two row-oriented tables

A vertex is stored in a row andit is indexed by a unique ID; itsproperties, labels, and adjacent

edges are stored in row cells

Examples: Cayley, InfoGrid, MarkLogicOpenLink Virtuoso, Stardog

Examples:AllegroGraph,Cray Graph

Engine

Examples: Dgraph,HyperGraphDB,

MS Graph Engine

Examples:WhiteDB, Graphd

Examples: Titan,JanusGraph, DSE Graph,

Examples: OrientDB,ArangoDB, Azure

Cosmos DB, FaunaDB

Examples:Sparksee/DEX,

TigerGraphGraphBase,Memgraph,

Neo4j

Examples:GraphObjectivity

ThingSpan,Velocity

ID

ID

ID

vertex/edge

vertex/edge

vertex/edge

Keys Values

URIp

p

p

(Subj, Pred, Obj)

(Subj, P

red, O

bj)

URI

URIURI

(Subj, Pred, Obj)

vertex/edge

vertex/edge

vertex/edge

Pointersor IDs

ID

Row (vertex)

prop edge edgeedgeprop

ID

ID

Key

prop edgeedge

prop edgeprop prop

ID

DocumentID

ID

ID

Table withvertices

Table withedges

Wide-Column Store

Row RDBMS

Key-Value Store

Document Store

Native Graph Store

Tuple Store

RDF Store (Triple Store)

Object-Oriented DBMS

Data Hub

Differentrecords

Differentrecords

Model used:

tables (imple-

menting

relations)

Records forming a row are storedcontiguously in memory or on disk

Example: OracleSpatial and Graph

One column can implementa property, a label, or an

ID (primary or foreign key)

Vertices and edges are stored inrows of two column-oriented tables

Table withvertices

Table withedges

Column RDBMS

Differentrecords

Differentrecords

Model used:

tables (imple-

menting

relations)

Records forming a column are storedcontiguously in memory or on disk

Example:SAP HANA

One column can implementa property, a label, or an

ID (primary or foreign key)

An opaque value contains properties,labels, adjacent vertices and edges

Model used:

pairs of

keys and

values

Model

used:

key-value

pairs and

tables

One cell containsa key-value pair

One valueoften formsone record

A documentoften formsone record

vertex /edge

vertex/edge

vertex/edge

Document(JSON / XML)

Attributes implementproperties and labels

attr attr attr

attr

attr attr attr attr

Model

used:

documents

Model

used:

triples

Triples canform records

Model

used:

tuples

Tuplescontain

propertiesand labels

Division intorecords depends

on a system

Model

used:

several

different

ones

+ ++...

Details of dataorganization are

system-dependant

Model

used:

objects

Model used:

Labeled

Property

Graph

Details of data organization aresystem-dependant. Adjacency

information is explicitly maintainedto accelerate graph traversals.

NoSQL

stores

OO

DBM

S

RD

BM

S

Fig. 7. Overview of different categories of graph database systems, with examples.

Page 12: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

1:12 M. Besta et al.

• Yes (fully)• Yes (partially)• No

• OLAP• OLTP• OLAP & OLTP

Is ACID supported?

ACID

What processing typeis supported?

Processing Type

• Yes• No

Is data shardingsupported?

Data Sharding

Can system run in adistributed mode?

Distributed Mode

• Yes• No

• Yes• No

Is data replicationsupported?

Data Replication

Data Distribution

Query Execution

Can multiple queriesbe run concurrently?

Concurrent Execution

• Yes• No

• Yes• No

Can a single querybe parallelized?

Parallelization

• RDF store• Tuple store• Document store• Key-value store• Wide-column store• RDBMS• OODBMS• Native graph store• Data hub

• RDF triples• LPG• Tuples• Documents• Key-value pairs• Tables• Objects• Hypergraph• Multiple models

What is a generaltype of a database?

Types of Systems

What models of graphdata are supported?

Data Models

Taxonomy of Graph Databases

Transaction Support

• Fixed sized• Variable sized

• With direct pointers• With IDs or references

What types of recordsare supported?

Types of Records

How are recordslinked together?

Linking Records

• Yes• No

Is there support forstoring data withinindex structures?

Index StructuresWhat representationsof graphs are used?

Representations

• Adjacency matrix• Adjacency list• Edge list

• Yes• No

• Within vertex records• Within edge records

Is there support forlightweight edges?

Lightweight Edges

How are edges stored?

Edge Records

Data Organization

• SPARQL• Gremlin• Cypher• SQL• GraphQL• Other

What graph databasequery languageis supported?

Language Support

Fig. 8. Overview of the identified taxonomy of graph databases.

Page 13: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

Survey and Taxonomy of Graph Databases 1:13

ID. On the other hand, if the physical location changes, it is usually easier to update the indexingstructure instead of changing all associated direct pointers.A given system uses index-free adjacency when it only uses direct pointers and no index to

traverse records and thus to traverse the graph. Note that an index may still be used to find a vertex;using index-free adjacency means that only the structure of the adjacency data is index-free. Usingdirect pointers can accelerate graph traversals [148], as additional index traversals are avoided.However, when the adjacency data needs to be updated, usually a large number of pointers need tobe updated as well, generating additional overhead [11].

3.4 Data DistributionWe call a system distributed or multi-server if it can run on multiple servers (also called computenodes) connected with a network. In such systems, data may be replicated [72] (maintaining copiesof the dataset at each server), or it may allow for sharding [141] (data fragmentation, i.e., storing onlya part of the given dataset on one server). Replication often allows for more fault tolerance [140],sharding reduces the amount of used memory per node and can improve performance [140].

3.5 Query ExecutionQuery execution on multiple servers can be enabled in various ways. We define concurrent executionas the execution of separate queries at the same time. Concurrent execution of queries can lead tohigher throughput. We also define parallel execution as the parallelized execution of a single query,possibly on more than one server or compute node. Parallel execution can lead to lower latenciesfor queries that can be parallelized.

3.6 Types of TransactionsMany graph database systems support transactions.

3.6.1 Support for ACID. ACID [87] (Atomicity, Consistency, Isolation, Durability) is a well-known set of properties that database transactions uphold in many database systems. Differentgraph databases explicitly ensure some or all of the ACID properties.

3.6.2 Support for OLTP vs. OLAP. Some systems are oriented towards the Online TransactionProcessing (OLTP) [137], where focus is on executing many smaller, interactive, transactionalqueries. Other systems are designed for the Online Analytics Processing (OLAP) [137]: they executeanalytics queries that span the whole graphs, usually taking more time than OLTP operations.Analytics queries are often parallelized to minimize their latency. In this survey, we focus on theOLTP systems and not on OLAP graph processing engines. However, some of the systems wesurvey also provide certain OLAP capabilities, in addition to their OLTP functionalities.

3.7 Query Language SupportAlthough we do not focus on graph database languages, we report which query languages aresupported by each considered graph database system. We consider the leading languages such asSPARQL [136], Gremlin [149], Cypher [69, 88], and SQL [53]. We also mention other system-specificlanguages such asGraphQL [83] and support for APIs from languages such as C++ or Java.

4 DATABASE SYSTEMSWe survey and describe selected graph database systems with respect to the proposed taxonomy. Ineach system category, we describe several example systems, focusing on the associated graph model,data and storage organization, data distribution, and query execution. We select the described

Page 14: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

1:14 M. Besta et al.

Graph Database System Model Records Storing Edges Data Distribution &Query Execution Additional remarksMM LPG RDF FS VS DP AL SE SV LW MN RP SH CE PE TR OLTP OLAP

❶ RDF STORES (TRIPLE STORES) (§ 4.1). The main data model used: RDF triples (§ 2.1.5).

BlazeGraph [34] ∗ ∗ ? ? ? ? ? ? ∗BlazeGraph uses RDF*, an extensionof RDF (details in § 4.1).

Cray Graph Engine [146] ∗ ∗ ∗RDF triples are stored in hashtables.AllegroGraph [70] ∗ ? ∗Triples are stored as integers

(RDF strings are mapped to integers).Amazon Neptune [2] ∗ ? ? ∗LPG is enabled via Gremlin [150].AnzoGraph [42] ∗ ? ? ∗Data model and schema support SPARQL.Apache Marmotta [8] ∗ ? ? ? ∗The structure of data records is based on

that of different RDBMS systems(H2 [126], PostgreSQL [125], MySQL [62]).

BrightstarDB [127] ? ? ? ? ? ? ? —Ontotext GraphDB [129] ? ? ? ? —Profium Sense [139] ∗ ? ? ? ? ? Profium Sense is “AI powered”.

∗The format used is called JSON-LD:JSON for vertices and RDF for edge.

TripleBit [175] ∗ ‡ ? ? The data organization uses compression.∗Strings are mapped to variable size integers.‡Described as future work.

❷ TUPLE STORES (§ 4.2). The main data model used: tuples (§ 2.3.3).

WhiteDB [173] ∗ ∗ ‡ ‡ ‡ ‡ † ? ∗Implicit support for triples of integers.‡Implementable by the user. †Transactionsuse a global shared/exclusive lock.

Graphd [77] ∗ ? ? ? ? ? ‡ ? ? Backend of Google Freebase.∗Implicit support for triples. ‡Subset of ACID.

❸ DOCUMENT STORES (§ 4.4). The main data model used: documents (§ 2.3.2).

ArangoDB [10] ∗ ∗Uses a hybrid index for retrieving edges.OrientDB [40] ∗ ‡ ∗AL contains RIDs (i.e., physical locations)

of edge and vertex records.‡Sharding is defined by the user.

Azure Cosmos DB [123] ? —Bitsy [109] The system is disk based and uses JSON files.

The storage only allows for appending data.FaunaDB [68] ∗ ‡ ∗Document, RDBMS, graph, and “time series”.

‡Adjacency lists are separately precomputed.

❹ KEY-VALUE STORES (§ 4.3). The main data model used: key-value pairs (§ 2.3.1).

HyperGraphDB [92] ∗ ‡ † ∗A Hypergraph model. ‡The system usesan incidence index to retrieve edgesof a vertex. †Support for ACI only.

MS Graph Engine [157] ∗ ‡ ∗Schema is defined by TrinitySpecification Language (TSL). ‡AL containsIDs of edges and/or vertices.

Dgraph [56] Dgraph is based on Badger [55].RedisGraph [142, 145, 160] ∗ ‡ RedisGraph is based on Redis [144].

‡The OLAP part uses GraphBLAS [102].

❺ WIDE-COLUMN STORES (§ 4.5). The main data model used: key-value pairs and tables (§ 2.3.1, § 2.3.4).

JanusGraph [15] JanusGraph is the continuation of Titan.Titan [15] Titan enables various backends

(e.g., Cassandra [108]).DSE Graph (DataStax) [52] ∗ ? ‡ DSE Graph is based on Cassandra [108].

∗Enabled by Gremlin query language.‡Support for AID, Consistency is configurable.

HGraphDB [147] ? ? ∗ HGraphDB uses TinkerPop3 with HBase [73].∗ACID is supported only within a row.

Table 2. Comparison of graph databases. Bolded systems are described in more detail in the corresponding sections.MM: A system is multi model. LPG, RDF: A system supports, respectively, the Labeled Property Graph and RDFwithout prior data transformation. FS, VS: Data records are fixed size and variable size, respectively. DP: A system canuse direct pointers to link records. This enables storing and traversing adjacency data without maintaining indices. AL:Edges are stored in the adjacency list format. SE: Edges can be stored in a separate edge record. SV: Edges can bestored in a vertex record. LW: Edges can be lightweight (containing just a vertex ID or a pointer, both stored in a vertexrecord). MN: A system can operate in a Multi Server (distributed) mode. RP: Given a distributed mode, a system enablesReplication of datasets. SH: Given a distributed mode, a system enables Sharding of datasets. CE: Given a distributedmode, a system enables Concurrent Execution of multiple queries. PE: Given a distributed mode, a system enablesParallel Execution of single queries on multiple nodes/CPUs. TR: Support for ACID Transactions. OLTP: Support forOnline Transaction Processing.OLAP: Support forOnline Analytical Processing.: A system offers a given feature.: A system offers a given feature in a limited way.: A system does not offer a given feature.?: Unknown.

Page 15: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

Survey and Taxonomy of Graph Databases 1:15

Graph Database System Model Records Storing Edges Data Distribution &Query Execution Additional remarksMM LPG RDF FS VS DP AL SE SV LW MN RP SH CE PE TR OLTP OLAP

❻ RELATIONAL DBMS (RDBMS) (§ 4.6). The main data model used: tables (implementing relations) (§ 2.3.4).

Oracle Spatial & Graph [131] ∗ ?∗ ?∗ ∗ ∗The system stores LPG and RDFin relational tables (row-oriented).

AgensGraph [33] ? ? ? AgensGraph is based on PostgreSQL.FlockDB [168] ? ? ? FlockDB is based on MySQL. The system

focuses on “shallow” graph queries,such as finding mutual friends.

MS SQL Server 2017 [124] ? ? ? The system uses an SQL graph extension.OQGRAPH [114] ?∗ ?∗ ∗ ? OQGRAPH uses MariaDB [18].

∗OQGRAPH uses row-oriented storage.SAP HANA [153] ∗ ∗ ∗ ∗SAP HANA is column-oriented,

edges and vertices are stored in rows.

❼ OBJECT-ORIENTED DATABASES (OODBMS) (§ 4.7). The main data model used: objects (§ 2.3.5).

VelocityGraph [172] The system is based on VelocityDB [171]Objectivity ThingSpan [128] ? ? ? ? ? ? The system is based on ObjectivityDB [80].

❽ NATIVE GRAPH DATABASES (§ 4.8). The main data model used: LPG (§ 2.1.3, § 2.1.4).

Neo4j [148] Neo4j is provided as a cloud serviceby a system called Graph Story [78].

GBase [100] ∗ ‡ ? ? ? ? ? ∗GBase supports simple graphs only(§ 2.1.1). ‡GBase stores the AM sparsely.

Sparksee/DEX [117] ∗ ∗ ‡ ∗The system uses maps only.‡Bitmaps are used for connectivity.

GraphBase [67] ∗ ? ? ? ? ? ? ? ? ∗No support for edge properties,only two types of edges available.

Memgraph [122] ? ? ? ? ? ? ∗ ‡ ∗This feature is under development.‡Available only for some algorithms.

TigerGraph [167] ? ? ? ? ? ? ? —Weaver [61] ? ? ? ? ? ? ? —

❾ DATA HUBS (§ 4.9). The main data model used: several different ones.

MarkLogic [115] ∗ ? ∗ ? Supported storage/models: relationaltables, RDF, various documents.∗Vertices are stored as documents,edges are stored as RDF triples.

OpenLink Virtuoso [130] ∗ ? ? ‡ Supported storage/models: relationaltables and RDF triples. ‡This featurecan be used with relational data only.

Cayley [44] ? ? ? ? ? ? ? ∗ ? ? Supported storage/models: relationaltables, RDF, document, key-value.∗This feature depends on the backend.

InfoGrid [91] ? ? ? ? ∗ ? ? Supported storage/models: relationaltables, Hadoop’s filesystem, grid storage.∗A weaker consistency model is usedinstead of ACID.

Stardog [162] ∗ ∗ ? ∗ ? Supported storage/models: relationaltables, documents. ∗RDF is simulatedon relational tables.

Table 3. Comparison of graph databases. Bolded systems are described in more detail in the corresponding sections.MM: A system is multi model. LPG, RDF: A system supports, respectively, the Labeled Property Graph and RDFwithout prior data transformation. FS, VS: Data records are fixed size and variable size, respectively. DP: A system canuse direct pointers to link records. This enables storing and traversing adjacency data without maintaining indices. AL:Edges are stored in the adjacency list format. SE: Edges can be stored in a separate edge record. SV: Edges can bestored in a vertex record. LW: Edges can be lightweight (containing just a vertex ID or a pointer, both stored in a vertexrecord). MN: A system can operate in a Multi Server (distributed) mode. RP: Given a distributed mode, a system enablesReplication of datasets. SH: Given a distributed mode, a system enables Sharding of datasets. CE: Given a distributedmode, a system enables Concurrent Execution of multiple queries. PE: Given a distributed mode, a system enablesParallel Execution of single queries on multiple nodes/CPUs. TR: Support for ACID Transactions. OLTP: Support forOnline Transaction Processing.OLAP: Support forOnline Analytical Processing.: A system offers a given feature.: A system offers a given feature in a limited way.: A system does not offer a given feature.?: Unknown.

Page 16: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

1:16 M. Besta et al.

Graph Database System Graph database query language Other languages and additional remarksSPARQL Gremlin Cypher SQL GraphQL Progr. API –

❶ RDF STORES (TRIPLE STORES) (§ 4.1).

AllegroGraph – – – – – –Amazon Neptune – – – – –AnzoGraph – – – – –

Apache Marmotta – – – – –Apache Marmotta also supports its nativeLDP and LDPath languages.

BlazeGraph – – – – –BrightstarDB – – – – – –Cray Graph Engine – – – – – –Ontotext GraphDB – – – – – –Profium Sense – – – – – –TripleBit – – – – – –

❷ TUPLE STORES (§ 4.2).

Graphd – – – – – – Graphd uses MQL [77].WhiteDB – – – – – (C, Python) –

❸ DOCUMENT STORES (§ 4.4).

OrientDB – – – – –ArangoDB – – – – – – ArangoDB uses AQL (ArangoDBQuery Language).Azure Cosmos DB – – – – –

Bitsy – – – – –Bitsy also supports other Tinkerpop-compatible languagessuch as SQL2Gremlin and Pixy.

FaunaDB – – – – – –

❹ KEY-VALUE STORES (§ 4.3).

MS Graph Engine – – – – – – MS Graph Engine uses LINQ [157].HyperGraphDB – – – – – (Java) –Dgraph – – – – ∗ – ∗A variant of GraphQL.RedisGraph – – – – – –

❺ WIDE-COLUMN STORES (§ 4.5).

Titan – – – – – –JanusGraph – – – – – –DSE Graph (DataStax) – – – – – DSE Graph also supports CQL [52].HGraphDB – – – – – –

❻ RELATIONAL DBMS (RDBMS) (§ 4.6).

Oracle Spatial and Graph – – ∗ – – ∗An SQL-like graph query language.MS SQL Server 2017 – – – ∗ – – ∗Transact-SQL.SAP HANA – – – – – –FlockDB – – – – – FlockDB uses the Gizzard framework and MySQL.AgensGraph – – ∗ ∗∗ – – ∗A variant called openCypher [79, 116]. ∗∗ANSI-SQL.OQGRAPH – – – – – –

❼ OBJECT-ORIENTED DATABASES (OODBMS) (§ 4.7).

Objectivity ThingSpan – – – – – – Objectivity ThingSpan uses a native DO query language [128].bjectivity ThingSpan VelocityGraph – – – – – (.NET) –

❽ NATIVE GRAPH DATABASES (§ 4.8).

Neo4j – – – – – –Gbase – – – – – –GraphBase – – – – – – GraphBase uses its native query language.Memgraph – – ∗ – – – ∗openCypher.

Sparksee/DEX – – – – (.NET)∗ –∗Sparksee/DEX also supports C++, Python,Objective-C, and Java APIs.

TigerGraph – – – – – – TigerGraph uses GSQL [167].Weaver – – – – – (Python) –

❾ DATA HUBS (§ 4.9).

Cayley – ∗ – – –∗Cayley supports Gizmo, a Gremlin dialect [44].Cayley also uses MQL [44].

MarkLogic – – – – – – MarkLogic uses XQuery [35].

OpenLink Virtuoso – – – –OpenLink Virtuoso also supports XQuery [35],XPath v1.0 [48], and XSLT v1.0 [101].

Stardog ∗ – – – ∗Stardog supports the PathQuery extension [162].

Table 4. Support for different graph database query languages in different graph database systems. “Progr. API”determines whether a given system supports formulating queries using some native programming language such as C++.“”: A system supports a given language. “”: A system supports a given language in a limited way. “–”: A system doesnot support a given language.

Page 17: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

Survey and Taxonomy of Graph Databases 1:17

systems based on the DB-Engines Ranking [159] and the availability of technical details about thedesign of the given system. Tables 2 and 3 illustrate the details of different graph database systems,including the ones described in this section. The tables indicate which features are supported bywhich systems. We use symbols “”, “”, and “” to indicate that a given system offers a givenfeature, offers a given feature in a limited way, and does not offer a given feature, respectively.“?” indicates we were unable to infer this information based on the available documentation2. Wereport the support for different graph query languages in Table 4.

4.1 RDF Stores (Triple Stores)RDF stores, also called triple stores, implement the Resource Description Framework (RDF) model(§ 2.1.5). These systems organize data into triples. We now describe in more detail a selectedrecent RDF store, Cray Graph Engine (§ 4.1.1). We also provide more details on two other systems,AllegroGraph and BlazeGraph, focusing on variants of the RDF model used in these systems (§ 4.1.2).

4.1.1 Cray Graph Engine. In a recent paper [146], Cray presented its Cray Graph Engine (CGE)which is able to load, maintain, and query databases with a trillion RDF triples. The architectureof CGE is based on Coarray C++ [164], a Cray’s language and infrastructure that makes memoryresources of multiple compute nodes available as a single global address space. This facilitatesparallel programming in a distributed environment [23, 74, 155].

CGE does not store triples but quads (4-tuples), where the fourth element is a graph ID. Thus, onecan store multiple graphs in one CGE database. CGE optimizes the way in which it stores stringsfrom its triples/quads. Storing multiple long strings per triple/quad would be inefficient, consideringthe fact that many triples/quads may share strings. Therefore, CGE maintains a dictionary thatmaps strings to unique 48-bit integer identifiers (HURIs). For this, two distributed hashtables areused (one for mapping strings to HURIs and one for mapping HURIs to strings). In the loadingprocess, the strings are sorted and then assigned to HURIs. This allows integer comparisons (equal,greater, smaller, etc.) to be used instead of more expensive string comparisons.

Quads in CGE are grouped by their predicate and the identifier of the graph that they are a partof. Thus, only a pair with a subject and an object needs to be stored for one such group of quads.These subject/object pairs are stored in hashtables (one hashtable per group). Since each subjectand object is represented as a 48-bit HURI, the subject/object pairs can be packed into 12 bytes andstored in a 32-bit unsigned integer array, ultimately reducing the amount of needed storage.

The computation in CGE is distributed over the participating processes. To minimize the amountof all-to-all communication, query results are aggregated locally and – whenever possible – eachprocess only communicates with a few peers to avoid network congestion.

CGE uses SPARQL [156] for querying. The CGE authors focused on developing high-performancescan, join, and merge operations needed in many queries. The execution of all these operations isdistributed over the processes. Scans and joins scale well due to their high locality, as each processparticipating in these operations reads mostly local data. Merge needs to access most or all the datafrom all other processes. Therefore, it requires all-to-all communication [86], limiting scalability.CGE can only execute one query at a time. Currently, no parallel transactions are possible

(including parallel reads). If one uses a front-end server that allows multiple queries to be submittedin parallel to the CGE execution engine, the execution of these queries is serialized. Thus, CGEallows for the parallel execution of single queries, but not parallel execution of multiple queries.

2We encourage participation in this survey. In case the reader is in possession of additional information relevant for thetables, the authors would welcome the input.

Page 18: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

1:18 M. Besta et al.

4.1.2 Other RDF Graph Databases. There exist many other RDF graph databases. We brieflydescribe two systems that extend the original RDF model: AllegroGraph and BlazeGraph.First, some RDF stores allow for attaching attributes to a triple explicitly. AllegroGraph [70]

allows an arbitrary set of attributes to be defined per triple when the triple is created. However,these attributes are immutable. Figure 9 presents an example RDF graph with such attributes. Thisfigure uses the same LPG graph as in previous examples provided in Figure 4 and Figure 5, whichcontain example transformations from the LPG into the original RDF model.

RDF graph, with triple attributes

Person/V-ID Person/V-ID

21 24Alice Bob

agename name

ageknows{since:09.08.2007}

triple attribute

LPG graph

:Personname = Bob

age = 24

:Personname = Alice

age = 21

:knowssince = 09.08.2007

vertex

type

vertex

type

Fig. 9. Comparison of an LPG graph and an RDF graph: a transformation from LPG to RDF with triple attributes.We represent the triple attributes as a set of key value pairs. “Person/V-ID”, “age”, “name”, “type” and “knows” are RDFURIs. The transformation uses the assumption that there is one label per vertex and edge.

Second, BlazeGraph [34] implements RDF* [82], an augmentation of RDF that allows for attachingtriples to triple predicates (see Figure 10). Vertices can use triples for storing labels and properties,analogously as with the plain RDF. However, with RDF*, one can represent LPG edges morenaturally than in the plain RDF. Specifically, edges can be stored as triples, and edge properties canbe linked to the edge triple via other triples.

RDF* graph

Person/V-ID Person/V-ID

21 24Alice Bob

agename name

age

knowsA triple attached

to a triple

09.08.2007

since

LPG graph

:Personname = Bob

age = 24

:Personname = Alice

age = 21

:knowssince = 09.08.2007

vertextype

vertextype

Fig. 10. Comparison of an LPG graph and an RDF* graph: a transformation from LPG to RDF*, that enables attachingtriples to triple predicates. “Person/V-ID”, “age”, “name”, “type”, “since” and “knows” are RDF URIs. The transformationuses the assumption that there is one label per vertex and edge.

4.2 Tuple StoresA tuple store is a generalization of an RDF store. RDF stores are restricted to triples (or quads, as inCGE) whereas tuple stores can maintain tuples of arbitrary sizes, as detailed in § 2.3.3.

Page 19: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

Survey and Taxonomy of Graph Databases 1:19

4.2.1 WhiteDB. WhiteDB [173] is a tuple store written in C that uses shared-memory parallelism.WhiteDB’s API enables allocating new records (tuples) with an arbitrary tuple length (number oftuple elements). Small values and pointers to other tuples are stored directly in a given field. Largestrings are kept in a separate store. Each large value is only stored once, and a reference counterkeeps track of how many tuples refer to it at any time. WhiteDB only provides database widelocking with a reader-writer lock [143]. As an alternative to locking the whole database, one canalso update fields of tuples atomically (set, compare and set, add). WhiteDB itself does not enforceconsistency, it is up to the user to use locks and atomics correctly. Finally, WhiteDB only enablesaccessing single tuple records, there is no higher level query engine or graph API that would allowto, for example, execute a query that fetches all neighbors of a given vertex. However, one can stilluse tuples as vertex and edge storage, linking them to one another via memory pointers.

4.3 Key-Value StoresOne can also explicitly use key-value (KV) stores for maintaining a graph. We provide details ofusing a collection of key-value pairs to model a graph in § 2.3.1. Here, we describe select KV storesused as graph databases: MS Graph Engine (also called Trinity) and HyperGraphDB.

4.3.1 Microsoft’s Graph Engine (Trinity). Microsoft’s Graph Engine [157] is based on a dis-tributed KV store called Trinity. Trinity implements a globally addressable distributed RAM storage.Processes use one-sided communication to access one another’s data directly [74]. Trinity can bedeployed on InfiniBand [90] to leverage Remote Direct Memory Access (RDMA) [74].

In Trinity, keys are called cell IDs and values are called cells. Cells are associated with a schemathat is defined using the Trinity Specification Language (TSL) [157]. TSL enables defining thestructure of cells similarly to C-structs. For example, a cell can hold data items of different datatypes, including IDs of other cells. MS Graph Engine introduces a graph storage layer on top of theTrinity KV storage layer. Vertices are stored in cells, where a dedicated field contains a vertex ID ora hash of this ID. Edges adjacent to a given vertex v are stored as a list of IDs of v’s neighboringvertices, directly in v’s cell. However, if an edge holds rich data, such an edge (together with theassociated data) can also be stored in a separate dedicated cell.

To keep the size of the exchanged messages small, Trinity maintains special accessors that allowfor accessing single attributes within a cell without needing to load the complete cell. This lowersthe I/O cost for many operations that do not need the whole cells.

4.3.2 HyperGraphDB. HyperGraphDB [92] stores hypergraphs (we define hypergraphs in § 2.1.2).The basic building blocks of HyperGraphDB are atoms, the values of the KV store. Every atom hasa cryptographically strong ID. This reduces a chance of collisions (i.e., creating identical IDs fordifferent graph elements by different peers in a distributed environment). Both hypergraph verticesand hyperedges are atoms. Thus, they have their own unique IDs. An atom of a hyperedge storesa list of IDs corresponding to the vertices connected by this hyperedge. Vertices and hyperedgesalso have a type ID and they can store additional data (such as properties) in a recursive structure(referenced by a value ID). This recursive structure contains value IDs identifying other atoms (withother recursive structures) or binary data. Figure 11 shows an example of how a KV store is used torepresent a hypergraph in HyperGraphDB.

HyperGraphDB maintains an incidence index that maps a vertex ID to the IDs of all hyperedgescontaining this vertex. This allows for an efficient traversal from a vertex to a hyperedge. Next, atype index enables retrieving all vertices and hyperedges with a certain type ID. Finally, a valueindex allows one to quickly find atoms that carry specific payload (identified with a value ID).

Page 20: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

1:20 M. Besta et al.

key (atom ID) value (ID-list or binary data))

node ID

link ID

value ID

type ID value ID

type ID value ID node ID node-ID...

value ID ... value ID or binary data

Fig. 11. An example utilization of key-value stores for maintaining hypergraphs in HyperGraphDB.

4.4 Document StoresIn document stores, a fundamental storage unit is a document, described in § 2.3.2. We select twodocument stores for a more detailed discussion, OrientDB and ArangoDB.

4.4.1 OrientDB. In OrientDB [40], every document d has a Record ID (RID), consisting of the IDof the collection of documents where d is stored, and the position (also referred to as the offset) withinthis collection. Pointers (called links) between documents are represented using these unique RIDs.OrientDB [40] introduces regular edges and lightweight edges. Regular edges are stored in an

edge document and can have their own associated key/value pairs (e.g., to encode edge propertiesor labels). Lightweight edges, on the other hand, are stored directly in the document of the adjacent(source or destination) vertex. Such edges do not have any associated key/value pairs. They consti-tute simple pointers to other vertices, and they are implemented as document RIDs. Thus, a vertexdocument not only stores the labels and properties of the vertex, but also a list of lightweight edges(as a list of RIDs of the documents associated with neighboring vertices), and a list of pointers tothe attached regular edges (as a list of RIDs of the documents associated with these regular edges).Each regular edge has pointers (RIDs) to the documents storing the source and the destinationvertex. Each vertex stores a list of links (RIDs) to its incoming and the outgoing edges.

Figure 12 contains an example of using documents for representing vertices, regular edges, andlightweight edges in OrientDB. Figure 13 shows example vertex and edge documents.OrientDB provides distributed transactions with ACID semantics. OrientDB offers a form of

sharding, in which not all collections of documents have to be copied on each server. However,OrientDB does not enable sharding of the collections themselves (i.e., distributing one collectionacross many servers). If an individual collection grows large, it is the responsibility of the user topartition the collection to avoid any additional overheads.

vertex 1name: Alice

age: 21

edge 1since: 09.08.2007

vertex 2name: Bob

age: 24

in out

A lightweight edge

A regular edge of type "knows"

Fig. 12. Two vertex documents connected with a lightweight edge and a regular edge (knows) in OrientDB.

4.4.2 ArangoDB. ArangoDB [10, 11] keeps its documents in a binary format called VelocyPack,which is a compacted implementation of JSON documents. Documents can be stored in differentcollections and have a _key attribute which is a unique ID within a given collection. UnlikeOrientDB, these IDs are no direct memory pointers. Documents in these collections are indexedusing a hashtable, where the _key attribute serves as the hashtable key.

Page 21: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

Survey and Taxonomy of Graph Databases 1:21

attribute (key/value)vertex document incoming edge RID's outgoing edge RID's lightweight edges: vertex RID's

attribute (key/value)regular edge document incoming vertex RID outgoing vertex RID

Fig. 13. An illustration of example OrientDB vertex and edge documents.

For maintaining graphs, ArangoDB uses vertex collections and edge collections. The former areregular document collections with vertex documents. Vertex documents store no information aboutedges attached to the associated vertices. This has the advantage that a vertex document doesnot have to be modified when one adds or removes edges. Second, edge collections store edgedocuments. Edge documents have two particular properties: _from and _to, which are the IDs ofthe documents associated with two vertices connected by a given edge.

To find a particular vertex or an edge document, one can query the respective collection’s hashtable using the document _key. To find the edges of a vertex, one would have to scan all edgedocuments since vertex documents store no information about edges. To avoid this, there exists ahybrid index, which allows to efficiently retrieve a list of edge documents with a particular _fromor _to attribute. This enables fast reconstruction of the adjacency list of a given vertex.A traversal over the neighbors of a given vertex works as follows. First, given the _key of a

vertex v , ArangoDB finds all v’s adjacent edges using the hybrid index. Next, the system retrievesthe corresponding edge documents and fetches all the associated _to properties. Finally, the _toproperties serve as the new _key properties when searching for the neighboring vertices. An opti-mization in ArangoDB’s design prevents reading vertex documents and enables directly accessingone edge document based on the vertex ID within another edge document. This may improve cacheefficiency and thus reduce query execution time [11].ArangoDB’s indexing may accelerate edge removal. If the edges to be removed were stored

in a vertex document, then the document would have to be modified, generating overheads. Itcould become expensive when removing a vertex with a large degree, as the documents of all theneighboring vertices would have to be altered. ArangoDB’s design prevents from such modificationsof vertex documents. Figure 14 contains example ArangoDB’s vertex and edge indexing structures.

edge index (hybrid)

_from_to hash

linked list of edges

vertex index

_key hash

vertex document

Fig. 14. Left: an edge index (hybrid: a hashtable and linked lists). Right: a vertex index (a simple hashtable).

One can use different collections of documents to store different edge types (e.g., “friend_of” or“likes”). When retrieving edges conditioned on some edge type (e.g., “friend_of”), one does not haveto traverse the whole adjacency list (all “friend_of” and “likes” edges). Instead, one can target thecollection with the edges of the specific edge type (“friend_of”).

Page 22: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

1:22 M. Besta et al.

ArangoDB can be executed in a distributed mode with replication and data sharding. Sharding isdone by partitioning document collections using the _key attribute. Thus, given the _key attribute,one can straightforwardly locate a host with the queried document.

4.5 Wide-Column StoresWide-column stores combine different features of key-value stores and relational tables. On onehand, a wide-column store maps keys to rows (a KV store that maps keys to values). Every row canhave an arbitrary number of cells and every cell constitutes a key-value pair. Thus, a row containsa mapping of cell keys to cell values, effectively making a wide-column store a two-dimensional KVstore (a row key and a cell key both identify a specific value). On the other hand, a wide-columnstore is a table, where cell keys constitute column names. However, unlike in a relational database,the names and the format of columns may differ between rows within the same table. We illustratean example subset of rows and rows cells in a wide-column store in Figure 15.

key cell key | value cell cell

key cell key | value cell cell

key cell key | value cell cell

sorted by cell key

sortedby key

Fig. 15. An illustration of wide-column stores: mapping keys to rows and column-keys to cells within the rows.

4.5.1 Titan and JanusGraph. Titan [15] is a graph database that was built by Aurelius, nowowned by DataStax. The Linux Foundation develops JanusGraph [166], a continuation of Titan.Both are graph databases built on top of wide-column stores. They can use different wide-columnstores as backends, for example Apache Cassandra [6]. They additionally use document-orientedsearch engines (e.g., Elasticsearch [65] or Apache Solr [9]) to accelerate lookups by labels.

In both systems, when storing a graph, each row represents a vertex. Each vertex property andadjacent edge is stored in a separate cell. One edge is thus encoded in a single cell, including all theproperties of this edge. Since cells in each row are sorted by the cell key, this sorting order can beused to find cells efficiently. For graphs, cell keys for properties and edges are chosen such thatafter sorting the cells, the cells storing properties come first, followed by the cells containing edges,see Figure 16. Since rows are ordered by the key, both systems straightforwardly partition tablesinto so called tablets, which can then be distributed over multiple data servers.

vertex id property

vertex id

vertex id

sorted by cell key

sortedby id

property edge edge edge

property edge edge edge edge

property property property edge edge

Fig. 16. An illustration of Titan and JanusGraph: using wide-column stores for storing graphs.

Page 23: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

Survey and Taxonomy of Graph Databases 1:23

4.6 Relational Database Management SystemsRelational Database Management Systems (RDBMS) store data in two dimensional tables with rowsand columns, described in more detail in the corresponding data model section in § 2.3.4.There are two types of RDBMS: column RDBMS (not to be confused with wide-column stores)

and row RDBMS (also referred to as column-oriented or columnar and row-oriented). They differ inphysical data persistence. Row RDBMS store table rows in consecutive memory blocks. ColumnRDBMS, on the other hand, store table columns contiguously. Row RDBMS are more efficient whenonly a few rows need to be retrieved, but with all their columns. Conversely, column RDBMSare more efficient when many rows need to be retrieved, but only with a few columns. Graphdatabase solutions that use RDBMS as their backends use both row RDBMS (e.g., Oracle Spatialand Graph [131], OQGRAPH built on MariaDB [114]) and column RDBMS (e.g., SAP HANA [153]).Graph analytics based on RDBMS has became a separate and growing area of research. Zhao

et al. [176] study the general use of RDBMS for graphs. They define four new relational algebraoperations for modeling graph operations. They show how to define these four operations withsix smaller building blocks: basic relational algebra operations, such as group-by and aggregation.Xirogiannopoulos et al. [174] describe GraphGen, an end-to-end graph analysis framework thatis built on top of an RDBMS. GraphGen supports graph queries through so called Graph-Viewsthat define graphs as transformations over underlying relational datasets. This provides a graphmodeling abstraction, and the underlying representation can be optimized independently.

4.6.1 Oracle Spatial and Graph. Oracle Spatial and Graph [131] is a database managementplatform built on top of Oracle Database. It provides a rich set of tools for administration andanalysis of graph data. Oracle Spatial and Graph comes with a range of built-in parallel graphalgorithms (e.g. for community detection, path finding, traversals, link prediction, PageRank, etc.).Both LPG and RDF models are supported. The data model is based on RDBMS. Rows of RDBMSconstitute vertices and relationships between them form edges. Associated properties and attributesare stored as key-value pairs in separate structures.Querying graphs in Oracle Spatial and Graph is possible using PGQL, a declarative, SQL-like,

graph pattern matching query language. PGQL allows for querying both data stored on disk inOracle Database as well as in in-memory parts of graph datasets. In-memory analysis is boosted bythe PGX.D graph processing engine [89]. Support for SQL and for SPARQL (for RDF graphs) is alsoincluded. Moreover, the offered Java API also implements Apache Tinkerpop interfaces, includingthe Gremlin API. Additionally, to further accelerate searching using properties, Oracle Spatial andGraph utilizes Apache Lucene [118] and optionally Apache SolrCloud [106] capabilities.

4.7 Object-Oriented DatabasesObject-oriented database management systems (OODBMS) [13] enable modeling, storing, andmanaging data in the form of language objects used in object-oriented programming languages. Wesummarize such objects in § 2.3.5.

4.7.1 VelocityGraph. VelocityGraph [172] is a graph database relying on the VelocityDB [171]distributed object database. VelocityGraph edges, vertices, as well as edge or vertex properties arestored in C# objects that contain references to other objects. To handle this structure, VelocityGraphintroduces abstractions such as VertexType, EdgeType, and PropertyType. Each object has a uniqueobject identifier (Oid), pointing to its location in physical storage. Each vertex and edge has onetype (label). Properties are stored in dictionaries. Vertices keep the attached edges in collections.

Page 24: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

1:24 M. Besta et al.

4.8 Native Graph DatabasesGraph database systems described in the previous sections are all based on some database backendthat was not originally built just for managing graphs. In what follows, we describe native graphdatabases: systems that were specifically build to maintain and process graphs.

4.8.1 Neo4j: Index-Free Adjacency. Neo4j [148] is the most popular graph database system,according to different database rankings (see the links on page 2). Neo4j implements the LPG modelusing a storage design based on fixed-size records. A vertex v is represented with a vertex record,which stores (1)v’s labels, (2) a pointer to a linked list ofv’s properties, (3) a pointer to the first edgeattached to v , and (4) some flags. An edge e is represented with an edge record, which stores (1) e’sedge type (a label), (2) a pointer to a linked list of e’s properties, (3) a pointer to two vertex recordsthat represent vertices attached to e , (4) pointers to the ALs of both adjacent vertices, and (5) someflags. Each property record can store up to four properties, depending on the size of the propertyvalue. Large values (e.g., long strings) are stored in a separate dynamic store. Storing propertiesoutside vertex and edge records allows those records to be small. Moreover, if no properties areaccessed in a query, they are not loaded at all. The AL of a vertex is implemented as a doubly linkedlist. An edge is stored once, but is part of two such linked lists (one list for each attached vertex).Thus, an edge has two pointers to the previous edges and two pointers to the next edges. Figure 17outlines the Neo4j design; Figure 18 shows the details of vertex and edge records.

Previous edges inthe neighborhoods ofthe attached vertices

vertex 1

name: Alice

age: 21

knows

Next edges in thethe neighborhoods

the attached vertices

vertex 2

name: Bob

age: 24

Vertex properties

Fig. 17. Summary of the Neo4j structure: two vertices linked by a relationship “LIKES”. Both vertices maintain linkedlists of properties. The relationships are parts of two doubly linked lists, one linked list per attached vertex.

A core concept in Neo4j is the index-free adjacency [148]: a vertex stores pointers to the physicallocations of its neighbors. Thus, for neighborhood queries or traversals, one needs no index and caninstead follow direct pointers (except for the root vertices in traversals). Consequently, the querycomplexity is independent of the graph size and only depends on the size of the visited subgraph.

Neo4j supports replication but no sharding. Thus, each server has to store the full database.

4.8.2 Sparksee/DEX: B+ Trees and Bitmaps. Sparksee is a graph database system that wasformerly known as DEX [117]. Sparksee implements the LPG model in the following way. Verticesand edges (both are called objects) are identified by unique IDs. For each property name, thereis an associated B+ tree that maps vertex and edge IDs to the respective property values. Thereverse mapping from a property value to vertex and edge IDs is maintained by a bitmap, wherea bit set to one indicates that the corresponding ID has some property value. Labels and vertices

Page 25: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

Survey and Taxonomy of Graph Databases 1:25

1 5 9 14

inUsenextEdgeID nextPropID labels

flags

1 5 9 13 17 21 25 29 33

inUsefirstVertex secondVertex relType

firstPrevEdgeID secondPrevEdgeID

firstNextEdgeID secondNextEdgeID

nextPropIDflags

Links to attached vertices A pointer to a doublylinked adjacency list of

the first attached vertex

A pointer to a doublylinked adjacency list of the

second attached vertex

A link to thefirst edge-

recordA vertex record:

An edge record:

A linked list of property-records,each holding four property blocks

Fig. 18. An overview of the Neo4j vertex and edge records.

and edges are mapped to each other in a similar way. Moreover, for each vertex, two bitmaps arestored. One bitmap indicates the in-coming edges, and one indicates the out-going edges attachedto every vertex. Furthermore, two B+ trees maintain the information about what vertices an edgeis connected to (one tree for each edge direction). Figure 19 illustrates example mappings.

Edge orvertex ID

A value ora label

B+ tree ptr0001001000001

Bitmap ptr

Edge orvertex ID

An attribute / A label

B+ tree ptr00100110000010011000

Bitmap ptr

Edge IDVertex/Edge connectivity (in / out directions)

Edge ID Vertex ID

Fig. 19. An illustration of Sparksee maps for attributes, labels, and vertex/edge connectivity. All mappings arebidirectional.

Sparksee is one of the few systems that are not record based. Instead, Sparksee uses mapsimplemented as B+ trees [60] and bitmaps. The use of bitmaps allows for some operations to beperformed as bit-level operations. For example, if one wants to find all vertices with certain valuesof properties such as “age” and “first name”, one can simply find two bitmaps associated with the“age” and the “first name” properties, and then derive a third bitmap that is a result of applying abitwise AND operation to the two input bitmaps.Uncompressed bitmaps could grow unmanageably in size. As most graphs are sparse, bitmaps

indexed by vertices or edges mostly contain zeros. To alleviate large sizes of such sparse bitmaps,they are cut into 32-bit clusters. If a cluster contains a non-zero bit, it is stored explicitly. The bitmapis then represented by a collection of (cluster-id, bit-data) pairs. These pairs are stored in a sortedtree structure to allow for efficient lookup, insertion, and deletion.Heavy use of indexing has its costs. Instead of constant time pointer chasing, index lookups

cost logarithmic time in the size of the stored graph. However, bitmaps in Sparksee can at least bestored in linear space (in the number of bits set to one). Thus, finding all edges attached to a vertexcan be done in linear time with respect to the number of edges to be retrieved. This is possible due

Page 26: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

1:26 M. Besta et al.

to the bitmap compression used in Sparksee: each stored 32-bit cluster has at least one non-zeroentry (stores at least one edge ID).

4.8.3 GBase: Sparse Adjacency Matrix Format. GBase [100] is a system that can only representthe structure of a directed graph; it stores neither properties nor labels. The goal of GBase is tomaintain a compression of the adjacency matrix of a graph such that one can efficiently retrievalall incoming and outgoing edges of a selected vertex without the prohibitive O (n2) matrix storageoverheads. Simultaneously, using the adjacency matrix enables verifying in O (1) time whethertwo arbitrary vertices are connected. To compress the adjacency matrix, GBase cuts it into K2

quadratic blocks (there are K blocks along each row and column). Thus, queries that fetch in- andout-neighbors of each vertex require only to fetch K blocks. The parameter K can be optimizedfor specific databases. When K becomes smaller, one has to retrieve more small files (assumingone block is stored in one file). If K grows larger, there are fewer files but they become larger,generating overheads. Further optimizations can be made when blocks contain either only zeroesor only ones; this enables higher compression rates.

4.9 Data HubsData hubs are systems that enable using multiple data models and corresponding storage designs.They often combine relational databases with RDF, document, and key-value stores. This can bebeneficial for applications that require a variety of data models, because it provides a variety ofstorage options in a single unified database management system. One can keep using RDBMSfeatures, upon which many companies heavily rely, while also storing graph data.

4.9.1 OpenLink Virtuoso. OpenLink Virtuoso [130] provides RDBMS, RDF, and document capa-bilities by connecting to a variety of storage systems. Graphs are stored in the RDF format only,thus the whole discussion from § 2.1.5 also applies to Virtuoso RDF.

4.9.2 MarkLogic. MarkLogic [115] models graphs with documents for vertices, therefore allow-ing an arbitrary number of properties for vertices. However, it uses RDF triples for edges.

4.10 Discussion and TakeawaysIn this section, we summarize selected aspects of data organization in graph database systems. Fora detailed description and analysis of all the considered aspects, see Section 3 and Tables 2 and 3.

4.10.1 Graph Models. There is no one standard graph model, but two models have proven to bepopular: RDF and LPG. RDF is a well-defined standard. However, it only supports simple triples(subject, predicate, object) representing edges from subject identifiers via predicates to objects. LPGis more popular (cf. Tables 2–3) and it allows vertices and edges to have labels and properties, thusenabling more natural data modeling. However, it is not standardized, and there are many variants(cf. § 2.1.4). For example, some systems limit the number of labels to just one. MarkLogic (§ 4.9.2)allows properties for vertices but none for edges, and thus can be viewed as a combination of LPG(vertices) and RDF (edges). Data stored in the LPG model can be converted to RDF, as describedin § 2.1.6. To benefit from different LPG features while keeping RDF advantages such as simplicity,some researchers proposed and implemented modifications to RDF. Examples are triple attributesor attaching triples to other triples (described in § 4.1.2).

There are very few systems that use neither RDF nor LPG. HyperGraphDB uses the hypergraphmodel and GBase uses a simple directed graph model without any labels or properties.

4.10.2 Storage Designs. Most graph database systems are build upon existing storage designs,such as key-value stores, document stores, wide-column stores, RDBMS, and others. The advantage

Page 27: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

Survey and Taxonomy of Graph Databases 1:27

of using existing storage designs is that these systems are usually mature and well-tested. Thedisadvantage is that they may not be perfectly optimized for graph data and graph queries. This iswhat native graph databases attempt to address.

One can distinguish record based systems from purely indexing based systems. Most systems arerecord based: they store data in consecutive memory blocks called records. These memory blocksmay be unstructured, as is the case with key-value stores. They can also be structured: documentdatabases often use the JSON format, wide-column stores have a key-value mapping inside eachrow, row-oriented RDBMS divide each row into columns, OODBMS impose some class definition,and tuple stores as well as some RDF stores use tuples. Record based systems store vertices inrecords, most often one vertex per one record. Some systems store edges in separate records, othersstore them together with the attached vertices. If one wants to find a property of a particular vertex,one has to find a record containing the vertex. The searched property is either stored directly inthat record, or its location is accessible via a pointer.Some systems do not use records at all, we call them purely indexing based. Sparksee and some

RDF systems as well as column-oriented RDBMS do not store information about vertices and edgesin dedicated records. Instead, they maintain data structures (usually in the form of indices) for eachproperty or label. The information about a given vertex is thus distributed over different structures.If one wants to find a property of a particular vertex, one has to query the associated data structure(index) for that property and find the value for the given vertex. Examples of such used indexstructures are B+ trees (in Sparksee) or hashtables (in some RDF systems).

4.10.3 Data Layout. Data layout determines how vertices and edges are represented and encodedin records. Some systems enable highly flexible structure inside their records. For example, documentdatabases that use JSON or wide-columns stores such as Titan and JanusGraph allow for differentkey-value mappings for each vertex and edge. Other record based systems are more fixed in theirstructure. For example, in OODBMS, one has to define a class for each configuration of vertex andedge properties. In RDBMS, one has to define tables for each vertex or edge type.Another aspect of a graph data layout is the design of the adjacency between records. One can

either assign each record an ID and then link records to one another via IDs, or one can use directmemory pointers. Using IDs requires an indexing structure to find the physical storage addressof a record associated with a particular ID. Direct memory pointers do not require an index fora traversal from one record to all its adjacent records, thus the use of pointers is often called theindex-free adjacency. Note that an index might still be used, for example to retrieve a vertex with aparticular property value (index-free adjacency only concerns adjacency between vertices).

4.10.4 Data Organization vs. Query Performance. Record based systems usually deliver moreperformance for queries that need to retrieve all or most information about a vertex or an edge.They are more efficient because the required data is stored in consecutive memory blocks. In purelyindexing based systems, one queries a data structure per property, which results in a more randomaccess pattern. On the other hand, if one only wants to retrieve single properties about vertices oredges, purely indexing based systems may only have to retrieve a single value. Contrarily, manyrecord based systems cannot retrieve only parts of records, fetching more data than necessary.Furthermore, a decision on whether to use IDs versus direct memory pointers to link records

depends on the read/write ratio of the workload for the given system. In the former case, onehas to use an indexing structure to find the address of the record. This slows down read queriescompared to following direct pointers. However, write queries can be more efficient with the use ofIDs instead of pointers. For example, when a record has to be moved to a new address, all pointers

Page 28: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

1:28 M. Besta et al.

to this record need to be updated to reflect this new address. IDs could remain the same, only theindexing structure needs to modify the address of the given record.

5 GRAPH DATABASE QUERIES ANDWORKLOADSWe provide a taxonomy of graph database queries and workloads. First, we categorize them usingthe scope of the accessed graph and thus, implicitly, the amount of accessed data (§ 5.1). We thenoutline the classification from the LDBC Benchmark [4] (§ 5.2). Next, the distinction into OLTP andOLAP is outlined (§ 5.3). Finally, a categorization of graph queries based on the matched patterns(§ 5.4) and loading input datasets into the database (§ 5.5) are discussed. Figure 20 summaries allelements of the proposed taxonomy.We omit detailed discussions and examples as they are provided in different benchmarks and

the associated papers [3, 4, 7, 12, 17, 43, 47, 59, 66, 93, 96, 120, 163]. Instead, our goal is to deliver abroad overview and taxonomy, and point the reader to the detailed material available elsewhere.

Interactive

Business Intelligence

Graph Analytics

Short Read-Only

Complex Read-Only

Transactional UpdateOLTP

OLAP

Simple Pattern Matching

Complex Pattern Matching

Navigational Pattern Matching

Path

Input Loading

Local Queries

Neighborhood Queries

Traversals

Global Analytics Queries

Input Accessing

Taxonomy of GraphDatabase Queries

Scope of Access (§ 5.1)

Online vs. Offline (§ 5.3)

LDBC Workloads (§ 5.2)

Matched Patterns (§ 5.4)General Scenario (§ 5.5)

Fig. 20. Taxonomy of different graph database queries and workloads.

5.1 Scopes of GraphQueriesWe describe queries in the increasing order of their scope. We focus on the LPG model, see § 2.1.3.Figure 21 depicts the scope of graph queries.

5.1.1 Local Queries. Local queries involve single vertices or edges. For example, given a vertexor an edge ID, one may want to retrieve the labels and properties of this vertex or edge. Otherexamples include fetching the value of a given property (given the property key), deriving the set

Page 29: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

Survey and Taxonomy of Graph Databases 1:29

P

PP

L L

PP

L

V

P

P

L

EV

Single vertices and edges

E

V

VE

V

VE

V

E

V

V

V

VV

V

V

V

V

E

E

E E

E

E

E

E

E

E

Subgraphs Whole graph

Local queries Neighborhood queries Traversals Global analytics

Scope / Complexity

Vertex Edge

Property

Label

Fig. 21. Illustration of different query scopes when accessing a Labeled Property Graph.

of all labels, or verifying whether a given vertex or an edge has a given label (given the label name).These queries are used in social network workloads [12, 17] (e.g., to fetch the profile information ofa user) and in benchmarks [96] (e.g., to measure the vertex look-up time).

5.1.2 NeighborhoodQueries. Neighborhood queries retrieve all edges attached to a given vertex,or the vertices adjacent to a given edge. This query can be further restricted by, for example,retrieving only the edgeswith a specific label. Similarly to local queries, social networks often requirea retrieval of the friends of a given person, which results in querying the local neighborhood [12, 17].

5.1.3 Traversals. In a traversal query, one explores a part of the graph beyond a single neigh-borhood. These queries usually start at a single vertex (or a small set of vertices) and traversesome graph part. We call the initial vertex or the set of vertices the anchor or root of the traversal.Queries can restrict what edges or vertices can be retrieved or traversed. As this is a common graphdatabase task, this query is also used in different performance benchmarks [47, 59, 96].

5.1.4 Global Graph Analytics. Finally, we identify graph analytics queries, often referred toas OLAP, which by definition consider the whole graph (not necessarily every property but allvertices and edges). Different benchmarks [43, 59, 120] take these large-scale queries into accountsince they are used in different fields such as threat detection [63] or computational chemistry [16].As indicated in Tables 2 and 3, many graph databases support such queries. Graph processingsystems such as Pregel [113] or Giraph [7] focus specifically on resolving OLAP. Example queriesinclude resolving global pattern matching [45, 158], shortest paths [58], max-flow or min-cut [51],minimum spanning trees [105], diameter, eccentricity, connected components, PageRank [134],and many others. Some traversals can also be global (e.g., finding all shortest paths of unrestrictedlength), thus falling into the category of global analytics queries.

5.2 Classes of Graph WorkloadsWe also outline an existing taxonomy of graph database workloads that is provided as a part of theLBDC benchmarks [4]. LDBC is an effort by academia and industry to establish a set of standardbenchmarks for measuring the performance of graph databases. The effort currently specifiesinteractive workloads, business intelligence workloads, and graph analytics workloads.

5.2.1 Interactive Workloads. A part of LDBC called the Social Network Benchmark (SNB) [66]identifies and analyzes interactive workloads that can collectively be described as either read-onlyqueries or simple transactional updates. They are divided into three further categories. First, shortread-only queries start with a single graph element (e.g., a vertex) and look up its neighbors orconduct small traversals. Second, complex read-only queries traverse larger parts of the graph; they

Page 30: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

1:30 M. Besta et al.

are used in the LDBC benchmark to not just assess the efficiency of the data retrieval processbut also the quality of query optimizers. Finally, transactional update queries insert either a singleelement (e.g., a vertex), possibly together with its adjacent edges, or a single edge. This workloadtests common graph database operations such as the look up of a friend profile in a social network.

5.2.2 Business Intelligence Workloads. Next, LDBC identifies business intelligence (BI) work-loads [163], which fetch large data volumes, spanning large parts of a graph. Contrarily to theinteractive workloads, the BI workloads heavily use summarization and aggregation operationssuch as sorting, counting, or deriving minimum, maximum, and average values. They are read-only. The LDBC specification provides an extensive list of BI workloads that were selected so thatdifferent performance aspects of a database are properly stressed when benchmarking.

5.2.3 Graph Analytics Workloads. Finally, the LDBC effort comes with a graph analytics bench-mark [93], where six graph algorithms are proposed as a standard benchmark for a graph analyticspart of a graph database. These algorithms are Breadth-First Search, PageRank [134], weakly con-nected components [76], community detection using label propagation [36], deriving the localclustering coefficient [154], and computing single-source shortest paths [93].

5.2.4 Scope of LDBC Workloads. The LDBC interactive workloads correspond to local, neighbor-hood, and traversals. The LDBC business intelligence workloads range from traversals to global graphanalytics queries. The LDBC graph analytics benchmark corresponds to global graph analytics.

5.3 Interactive TransactionalQuerying (OLTP) and Offline Graph Processing (OLAP)One can also distinguish between Online Transactional Processing (OLTP) and Offline AnalyticalProcessing (OLAP). Typically, OLTP workloads consist of many queries local in scope, such asneighborhood queries, certain restricted traversals, or lookups, inserts, deletes, and updates ofsingle vertices and edges. OLTP corresponds to LDBC’s interactive workloads. They are usuallyexecuted with some transactional guarantees. The goal is to achieve high throughput and answerthe queries at interactive speed (low latency). OLAP workloads have been a subject of numerousresearch efforts in the last decade [19, 28, 29, 98, 121, 161]. They are usually not processed atinteractive speeds, as the queries are inherently complex and global in scope. They correspond toLDBC’s graph analytics and to certain BI workloads.

5.4 Graph Patterns and Navigational ExpressionsAngles et al. [3] inspected in detail the theory of graph queries. In one identified family of graphqueries, called simple graph pattern matching, one prescribes a graph pattern (e.g., a specificationof a class of subgraphs) that is then matched to the graph maintained by the database, searchingfor the occurrences of this pattern. This query can be extended with aggregation and a projectionfunction to so called complex graph pattern matching. Furthermore, path queries allow to search forpaths of arbitrary distances in the graph. One can also combine complex graph pattern matchingand path queries, resulting in navigational graph pattern matching, in which a graph pattern can beapplied recursively on the parts of the path.

5.5 Input LoadingFinally, certain benchmarks also analyze bulk input loading [47, 59, 96]. Specifically, given an inputdataset, they measure the time to load this dataset into a database. This scenario is common whendata is migrated between systems.

Page 31: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

Survey and Taxonomy of Graph Databases 1:31

6 CHALLENGESThese are numerous research challenges related to the design of graph database systems.

First, establishing a single graph model for these systems is far from being complete. While LPGis used most often, (1) its definition is very broad and it is rarely fully supported, and (2) RDF is alsooften used in the context of storing and managing graphs. Moreover, it is unclear what are preciserelationships between a selected graph model and the corresponding consequences for storage andperformance tradeoffs when executing different types of workloads.

Second, a clear identification of the most advantageous design choices for different use cases isyet to be determined. As illustrated in this survey, existing systems support a plethora of forms ofdata organization, and it is not clear which ones are best for many scenarios. A strongly relatedchallenge is the best design for a system that supports both OLAP and OLTP workloads, ensuringhigh throughput and low latency of both of them.Third, while there exists past research into the impact of the underlying network on the per-

formance of a distributed graph analytics framework [132], little was done into investigating thisperformance relationship in the context of graph database workloads. To the best of our knowledge,there are no efforts into developing a topology-aware or routing-aware data distribution scheme forgraph databases, especially in the context of recently proposed data center and high-performancecomputing network topologies [24, 103] and routing architectures [30, 75, 111].

Moreover, contrarily to the general static graph processing and graph streaming, little researchexists into accelerating graph databases using different types of hardware architectures, accelerators,and hardware-related designs, for example FPGAs [21, 31], designs related to network interfacecards [26, 57], hardware transactions [25], and others [1, 22].

Finally, many research challenges in the design of graph databases are related specifically to thedesign of NoSQL stores. These challenges are discussed in more detail in past recent work [54] andinclude efficient data partitioning, user-friendly query formulation, high-performance transactionprocessing, and ensuring security in the form of authentication and encryption.

7 CONCLUSIONGraph databases constitute an important area of academic research and different industry efforts.They are used to maintain, query, and analyze numerous datasets in different domains in scienceand industry. A large number of graph databases of different types have been developed. They usemany data models and representations, they are constructed using miscellaneous design choices,and they enable a plethora of queries and workloads. We present the first survey and taxonomythat analyze the rich landscape of graph databases. We list and categorize the existing systems anddiscuss key design choices. Our work can be used by architects, developers, and project managerswho want to select the most advantageous graph database system or design for a given use case.

ACKNOWLEDGEMENTS We thank Mark Klein, Hussein Harake, Colin McMurtrie, and thewhole CSCS team granting access to the Ault and Daint machines, and for their excellent technicalsupport. We thank Timo Schneider for his immense help with computing infrastructure at SPCL.

REFERENCES[1] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi. A scalable processing-in-memory accelerator for parallel

graph processing. ACM SIGARCH Computer Architecture News, 43(3):105–117, 2016.[2] Amazon. Amazon Neptune. Available at https://aws.amazon.com/neptune/.[3] R. Angles, M. Arenas, P. Barceló, A. Hogan, J. Reutter, and D. Vrgoč. Foundations of Modern Query

Languages for Graph Databases. in ACM Comput. Surv., 50(5):68:1–68:40, 2017.

Page 32: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

1:32 M. Besta et al.

[4] R. Angles, P. Boncz, J. Larriba-Pey, I. Fundulaki, T. Neumann, O. Erling, P. Neubauer, N. Martinez-Bazan,V. Kotsev, and I. Toma. The linked data benchmark council: a graph and RDF industry benchmarkingeffort. ACM SIGMOD Record, 43(1):27–31, 2014.

[5] R. Angles and C. Gutierrez. Survey of Graph Database Models. in ACM Comput. Surv., pages 1:1–1:39,2008.

[6] Apache. Apache Cassandra. Available at https://cassandra.apache.org/.[7] Apache. Apache Giraph. Available at https://giraph.apache.org/.[8] Apache. Apache Mormotta. Available at http://marmotta.apache.org/.[9] Apache. Apache Solr. Available at https://lucene.apache.org/solr/.[10] ArangoDB Inc. ArangoDB. Available at https://docs.arangodb.com/3.3/Manual/DataModeling/Concepts.

html.[11] ArangoDB Inc. ArangoDB: Index Free Adjacency or Hybrid Indexes for Graph Databases. Available at

https://www.arangodb.com/2016/04/index-free-adjacency-hybrid-indexes-graph-databases/.[12] T. G. Armstrong, V. Ponnekanti, D. Borthakur, and M. Callaghan. Linkbench: a database benchmark

based on the facebook social graph. In Proceedings of the 2013 ACM SIGMOD International Conferenceon Management of Data, pages 1185–1196. ACM, 2013.

[13] M. Atkinson, F. Bancilhon, D. DeWitt, K. Dittrich, D. Maier, and S. Zdonik. The Object-OrientedDatabase System Manifesto. in DOOD, pages 223–40, 1989.

[14] P. Atzeni and V. De Antonellis. Relational database theory. Benjamin/Cummings Redwood City, CA,1993.

[15] Aurelius. Titan Data Model. Available at http://s3.thinkaurelius.com/docs/titan/1.0.0/data-model.html.[16] A. T. Balaban. Applications of graph theory in chemistry. Journal of chemical information and computer

sciences, 25(3):334–343, 1985.[17] S. Barahmand and S. Ghandeharizadeh. BG: A Benchmark to Evaluate Interactive Social Networking

Actions. In CIDR. Citeseer, 2013.[18] D. Bartholomew. Mariadb vs. MySQL. Dostopano, 7(10):2014, 2012.[19] O. Batarfi, R. El Shawi, A. G. Fayoumi, R. Nouri, A. Barnawi, S. Sakr, et al. Large scale graph processing

systems: survey and an experimental evaluation. Cluster Computing, 18(3):1189–1213, 2015.[20] M. Besta et al. Slim graph: Practical lossy graph compression for approximate graph processing, storage,

and analytics. 2019.[21] M. Besta, M. Fischer, T. Ben-Nun, J. De Fine Licht, and T. Hoefler. Substream-centric maximum match-

ings on fpga. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-ProgrammableGate Arrays, pages 152–161. ACM, 2019.

[22] M. Besta, S. M. Hassan, S. Yalamanchili, R. Ausavarungnirun, O. Mutlu, and T. Hoefler. Slim noc: Alow-diameter on-chip network topology for high energy efficiency and scalability. In ACM SIGPLANNotices, volume 53, pages 43–55. ACM, 2018.

[23] M. Besta and T. Hoefler. Fault tolerance for remote memory access programming models. In Proceedingsof the 23rd international symposium on High-performance parallel and distributed computing, pages 37–48.ACM, 2014.

[24] M. Besta and T. Hoefler. Slim fly: A cost effective low-diameter network topology. In Proceedings ofthe International Conference for High Performance Computing, Networking, Storage and Analysis, pages348–359. IEEE Press, 2014.

[25] M. Besta and T. Hoefler. Accelerating irregular computations with hardware transactional memoryand active messages. In Proceedings of the 24th International Symposium on High-Performance Paralleland Distributed Computing, pages 161–172. ACM, 2015.

[26] M. Besta and T. Hoefler. Active access: A mechanism for high-performance distributed data-centriccomputations. In Proceedings of the 29th ACM on International Conference on Supercomputing, pages155–164. ACM, 2015.

[27] M. Besta and T. Hoefler. Survey and taxonomy of lossless graph compression and space-efficient graphrepresentations. arXiv preprint arXiv:1806.01799, 2018.

Page 33: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

Survey and Taxonomy of Graph Databases 1:33

[28] M. Besta, F. Marending, E. Solomonik, and T. Hoefler. Slimsell: A vectorizable graph representation forbreadth-first search. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS),pages 32–41. IEEE, 2017.

[29] M. Besta, M. Podstawski, L. Groner, E. Solomonik, and T. Hoefler. To push or to pull: On reducingcommunication and synchronization in graph computations. In Proceedings of the 26th InternationalSymposium on High-Performance Parallel and Distributed Computing, pages 93–104. ACM, 2017.

[30] M. Besta, M. Schneider, K. Cynk, M. Konieczny, E. Henriksson, S. Di Girolamo, A. Singla, and T. Hoefler.Fatpaths: Routing in supercomputers, data centers, and clouds with low-diameter networks whenshortest paths fall short. arXiv preprint arXiv:1906.10885, 2019.

[31] M. Besta, D. Stanojevic, J. D. F. Licht, T. Ben-Nun, and T. Hoefler. Graph processing on fpgas: Taxonomy,survey, challenges. arXiv preprint arXiv:1903.06697, 2019.

[32] M. Besta, D. Stanojevic, T. Zivic, J. Singh, M. Hoerold, and T. Hoefler. Log (graph): a near-optimalhigh-performance graph representation. In PACT, pages 7–1, 2018.

[33] Bitnine Global Inc. AgensGraph. Available at https://bitnine.net/agensgraph-2/.[34] Blazegraph. BlazeGraph DB. Available at https://www.blazegraph.com/.[35] S. Boag, D. Chamberlin, M. F. Fernández, D. Florescu, J. Robie, J. Siméon, and M. Stefanescu. Xquery

1.0: An xml query language. 2002.[36] P. Boldi, M. Rosa, M. Santini, and S. Vigna. Layered label propagation: A multiresolution coordinate-free

ordering for compressing social networks. In Proceedings of the 20th international conference on Worldwide web, pages 587–596. ACM, 2011.

[37] U. Brandes. A faster algorithm for betweenness centrality*. Journal of Mathematical Sociology, 25(2):163–177, 2001.

[38] T. Bray. The JavaScript object notation (JSON) data interchange format. 2014.[39] T. Bray, J. Paoli, C. M. Sperberg-McQueen, E. Maler, and F. Yergeau. Extensible markup language (XML)

1.0, 2000.[40] Callidus Software Inc. OrientDB. Available at https://orientdb.com.[41] Callidus Software Inc. OrientDB: Lightweight Edges. Available at https://orientdb.com/docs/3.0.x/java/

Lightweight-Edges.html.[42] Cambridge Semantics. AnzoGraph. Available at https://www.cambridgesemantics.com/product/

anzograph/.[43] M. Capotă, T. Hegeman, A. Iosup, A. Prat-Pérez, O. Erling, and P. Boncz. Graphalytics: A big data

benchmark for graph-processing platforms. In Proceedings of the GRADES’15, page 7. ACM, 2015.[44] Cayley. CayleyGraph. Available at https://cayley.io/ and https://github.com/cayleygraph/cayley.[45] J. Cheng, J. X. Yu, B. Ding, P. S. Yu, and H. Wang. Fast Graph Pattern Matching. ICDE, pages 913–922,

2008.[46] A. Ching, S. Edunov, M. Kabiljo, D. Logothetis, and S. Muthukrishnan. One trillion edges: Graph

processing at facebook-scale. Proceedings of the VLDB Endowment, 8(12):1804–1815, 2015.[47] M. Ciglan, A. Averbuch, and L. Hluchy. Benchmarking traversal operations over graph databases. In

2012 IEEE 28th International Conference on Data Engineering Workshops, pages 186–189. IEEE, 2012.[48] J. Clark, S. DeRose, et al. Xml path language (xpath) version 1.0, 1999.[49] E. F. Codd. Relational database: a practical foundation for productivity. In Readings in Artificial

Intelligence and Databases, pages 60–68. Elsevier, 1989.[50] R. Cyganiak, D. Wood, and M. Lanthaler. RDF 1.1 Concepts and Abstract Syntax. Available at

https://www.w3.org/TR/rdf11-concepts/.[51] G. B. Dantzig and D. R. Fulkerson. On the max flow min cut theorem of networks. RAND CORP, 1955.[52] DataStax, Inc. DSE Graph (DataStax). Available at https://www.datastax.com/.[53] C. J. Date and H. Darwen. A Guide to the SQL Standard, volume 3. Addison-Wesley New York, 1987.[54] A. Davoudian, L. Chen, and M. Liu. A survey on NoSQL stores. ACM Computing Surveys (CSUR),

51(2):40, 2018.[55] Dgraph Labs Inc. BadgerDB. https://dbdb.io/db/badgerdb.[56] Dgraph Labs, Inc. DGraph. Available at https://dgraph.io/, https://docs.dgraph.io/design-concepts.

Page 34: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

1:34 M. Besta et al.

[57] S. Di Girolamo, K. Taranov, A. Kurth, M. Schaffner, T. Schneider, J. Beránek, M. Besta, L. Benini,D. Roweth, and T. Hoefler. Network-accelerated non-contiguous memory transfers. arXiv preprintarXiv:1908.08590, 2019.

[58] E. W. Dijkstra. A note on two problems in connexion with graphs. Numerische Mathematik, pages269–271, 1959.

[59] D. Dominguez-Sal, P. Urbón-Bayes, A. Giménez-Vanó, S. Gómez-Villamor, N. Martínez-Bazan, and J.-L.Larriba-Pey. Survey of graph database performance on the HPC scalable graph analysis benchmark. InInternational Conference on Web-Age Information Management, pages 37–48. Springer, 2010.

[60] C. Douglas. Ubiquitous B-Tree. in ACM Comput. Surv., pages 121–137, 1979.[61] A. Dubey, G. D. Hill, R. Escriva, and E. G. Sirer. Weaver: A High-performance, Transactional Graph

Database Based on Refinable Timestamps. in VLDB, pages 852–863, 2016. Available at http://weaver.systems/.

[62] P. DuBois and M. Foreword By-Widenius. MySQL. New Riders Publishing, 1999.[63] W. Eberle, J. Graves, and L. Holder. Insider threat detection using a graph-based approach. Journal of

Applied Security Research, 6(1):32–81, 2010.[64] D. Ediger, R. McColl, J. Riedy, and D. A. Bader. Stinger: High performance data structure for streaming

graphs. In 2012 IEEE Conference on High Performance Extreme Computing, pages 1–5. IEEE, 2012.[65] Elastic. Elasticsearch. Available at https://www.elastic.co/products/elasticsearch.[66] O. Erling, A. Averbuch, J. Larriba-Pey, H. Chafi, A. Gubichev, A. Prat, M.-D. Pham, and P. Boncz. The

LDBC Social Network Benchmark: Interactive Workload. in SIGMOD, pages 619–630, 2015.[67] FactNexus. GraphBase. Available at https://graphbase.ai/.[68] Fauna. FaunaDB. Available at https://fauna.com/.[69] N. Francis, A. Green, P. Guagliardo, L. Libkin, T. Lindaaker, V. Marsault, S. Plantikow, M. Rydberg,

P. Selmer, and A. Taylor. Cypher: An evolving query language for property graphs. In Proceedings ofthe 2018 International Conference on Management of Data, pages 1433–1445. ACM, 2018.

[70] Franz Inc. AllegroGraph. Available at https://franz.com/agraph/allegrograph/.[71] S. K. Gajendran. A survey on NoSQL databases. University of Illinois, 2012.[72] H. Garcia-Molina, J. D. Ullman, and J. Widom. Data Replication. In Database Systems: The Complete

Book, 1st Edition, chapter 19.4.3, page 1021. Pearson, 2002.[73] L. George. HBase: the definitive guide: random access to your planet-size data. O’Reilly Media, Inc., 2011.[74] R. Gerstenberger, M. Besta, and T. Hoefler. Enabling highly-scalable remote memory access program-

ming with mpi-3 one sided. Scientific Programming, 22(2):75–91, 2014.[75] S. Ghorbani, Z. Yang, P. Godfrey, Y. Ganjali, and A. Firoozshahian. Drill: Micro load balancing for

low-latency data center networks. In Proceedings of the Conference of the ACM Special Interest Groupon Data Communication, pages 225–238. ACM, 2017.

[76] L. Gianinazzi, P. Kalvoda, A. De Palma, M. Besta, and T. Hoefler. Communication-avoiding parallelminimum cuts and connected components. In ACM SIGPLAN Notices, volume 53, pages 219–232. ACM,2018.

[77] Google. Graphd. Available at https://github.com/google/graphd.[78] Graph Story Inc. Graph Story. Available at https://www.graphstory.com/.[79] A. Green, M. Junghanns, M. Kießling, T. Lindaaker, S. Plantikow, and P. Selmer. opencypher: New

directions in property graph querying. In EDBT, pages 520–523, 2018.[80] L. Guzenda. Objectivity/DB-a high performance object database architecture. InWorkshop on High

performance object databases, 2000.[81] J. Han, E. Haihong, G. Le, and J. Du. Survey on NoSQL database. In 2011 6th international conference on

pervasive computing and applications, pages 363–366. IEEE, 2011.[82] O. Hartig. Reconciliation of RDF* and Property Graphs. Available at https://arxiv.org/pdf/1409.3288.pdf .[83] O. Hartig and J. Pérez. Semantics and complexity of graphql. In Proceedings of the 2018 World Wide Web

Conference, pages 1155–1164. International World Wide Web Conferences Steering Committee, 2018.[84] J. Hayes. A Graph Model for RDF. diploma thesis, Technische Universität Darmstadt, Universidad de

Chile, 2004.

Page 35: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

Survey and Taxonomy of Graph Databases 1:35

[85] J. M. Hellerstein and M. Stonebraker. Readings in database systems. MIT Press, 2005.[86] T. Hoefler, J. Dinan, R. Thakur, B. Barrett, P. Balaji, W. Gropp, and K. Underwood. Remote memory

access programming in MPI-3. ACM Transactions on Parallel Computing, 2(2):9, 2015.[87] J. A. Hoffer, V. Ramesh, and H. Topi. Modern database management. Upper Saddle River, NJ: Prentice

Hall, 2011.[88] F. Holzschuher and R. Peinl. Performance of graph query languages: comparison of cypher, gremlin

and native access in neo4j. In Proceedings of the Joint EDBT/ICDT 2013 Workshops, pages 195–204. ACM,2013.

[89] S. Hong, S. Depner, T. Manhardt, J. Van Der Lugt, M. Verstraaten, and H. Chafi. Pgx.d: a fast distributedgraph processing engine. In Proceedings of the International Conference for High Performance Computing,Networking, Storage and Analysis, page 58. ACM, 2015.

[90] InfiniBand Trade Association. Infiniband: Architecture specification 1.3. 2015.[91] InfoGrid. The InfoGrid Graph Database.[92] B. Iordanov. HyperGraphDB: A Generalized Graph Database. In WAIM, pages 25–36, 2010.[93] A. Iosup, T. Hegeman, W. L. Ngai, S. Heldens, A. Prat-Pérez, T. Manhardto, H. Chafio, M. Capotă,

N. Sundaram, M. Anderson, I. G. Tănase, Y. Xia, L. Nai, and P. Boncz. LDBC Graphalytics: A Benchmarkfor Large-scale Graph Analysis on Parallel and Distributed Platforms. in VLDB, pages 1317–1328, 2016.

[94] Jesús Barrasa. RDF Triple Stores vs. Labeled Property Graphs: What’s the Difference? Available athttps://neo4j.com/blog/rdf-triple-store-vs-labeled-property-graph-difference/.

[95] B. Jiang. A short note on data-intensive geospatial computing. In Information Fusion and GeographicInformation Systems, pages 13–17. Springer, 2011.

[96] S. Jouili and V. Vansteenberghe. An empirical comparison of graph databases. In 2013 InternationalConference on Social Computing, pages 708–715. IEEE, 2013.

[97] M. Junghanns, A. Petermann, M. Neumann, and E. Rahm. Management and analysis of big graph data:current systems and open challenges. In Handbook of Big Data Technologies, pages 457–505. Springer,2017.

[98] V. Kalavri, V. Vlassov, and S. Haridi. High-level programming abstractions for distributed graphprocessing. IEEE Transactions on Knowledge and Data Engineering, 30(2):305–324, 2017.

[99] R. K. Kaliyar. Graph databases: A survey. In ICCCA, pages 785–790, 2015.[100] U. Kang, H. Tong, J. Sun, C.-Y. Lin, and C. Faloutsos. Gbase: An efficient analysis platform for large

graphs. In PVLDB, pages 637–650, 2012.[101] M. Kay. XSLT: programmer’s reference. Wrox Press Ltd., 2001.[102] J. Kepner, P. Aaltonen, D. Bader, A. Buluç, F. Franchetti, J. Gilbert, D. Hutchison, M. Kumar, A. Lumsdaine,

H. Meyerhenke, et al. Mathematical Foundations of the GraphBLAS. arXiv preprint arXiv:1606.05790,2016.

[103] J. Kim, W. J. Dally, S. Scott, and D. Abts. Technology-driven, highly-scalable dragonfly topology. In 2008International Symposium on Computer Architecture, pages 77–88. IEEE, 2008.

[104] V. Kolomicenko. Analysis and Experimental Comparison of Graph Databases. Charles University inPrague, 2013. Master Thesis.

[105] J. B. Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem. Proc.Amer. Math. Soc., pages 48–50, 1956.

[106] R. Kuć. Apache Solr 4 Cookbook. Packt Publishing Ltd, 2013.[107] V. Kumar and A. Babu. Domain suitable graph database selection: A preliminary report. In 3rd

International Conference on Advances in Engineering Sciences & Applied Mathematics, London, UK, pages26–29, 2015.

[108] A. Lakshman and P. Malik. Cassandra: a decentralized structured storage system. ACM SIGOPSOperating Systems Review, 44(2):35–40, 2010.

[109] LambdaZen LLC. Bitsy. Available at https://github.com/lambdazen/bitsy and https://bitbucket.org/lambdazen/bitsy/wiki/Home.

[110] H. Lin, X. Zhu, B. Yu, X. Tang, W. Xue, W. Chen, L. Zhang, T. Hoefler, X. Ma, X. Liu, et al. Shentu:processing multi-trillion edge graphs on millions of cores in seconds. In Proceedings of the International

Page 36: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

1:36 M. Besta et al.

Conference for High Performance Computing, Networking, Storage, and Analysis, page 56. IEEE Press,2018.

[111] Y. Lu, G. Chen, B. Li, K. Tan, Y. Xiong, P. Cheng, J. Zhang, E. Chen, and T. Moscibroda. Multi-pathtransport for {RDMA} in datacenters. In 15th {USENIX} Symposium on Networked Systems Design andImplementation ({NSDI} 18), pages 357–371, 2018.

[112] A. Lumsdaine, D. Gregor, B. Hendrickson, and J. W. Berry. Challenges in Parallel Graph Processing.Par. Proc. Let., 17(1):5–20, 2007.

[113] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: asystem for large-scale graph processing. In Proc. of the ACM SIGMOD Intl. Conf. on Manag. of Data,SIGMOD ’10, pages 135–146, New York, NY, USA, 2010. ACM.

[114] MariaDB. OQGRAPH. Available at https://mariadb.com/kb/en/library/oqgraph-storage-engine/.[115] MarkLogic Corporation. MarkLogic. Available at https://www.marklogic.com.[116] J. Marton, G. Szárnyas, and D. Varró. Formalising opencypher graph queries in relational algebra. In

European Conference on Advances in Databases and Information Systems, pages 182–196. Springer, 2017.[117] N. Martínez-Bazan, V. Muntés-Mulero, S. Gómez-Villamor, M. Águila Lorente, D. Dominguez-Sal, and

J.-L. Larriba-Pey. Efficient Graph Management Based On Bitmap Indices. In IDEAS, pages 110–119,2012.

[118] M. McCandless, E. Hatcher, and O. Gospodnetic. Lucene in action: covers Apache Lucene 3.0. ManningPublications Co., 2010.

[119] R. McColl, D. Ediger, J. Poovey, D. Campbell, and D. Bader. A brief study of open source graph databases.arXiv preprint arXiv:1309.2675, 2013.

[120] R. C. McColl, D. Ediger, J. Poovey, D. Campbell, and D. A. Bader. A performance evaluation of opensource graph databases. In Proceedings of the first workshop on Parallel programming for analyticsapplications, pages 11–18. ACM, 2014.

[121] R. R. McCune, T. Weninger, and G. Madey. Thinking like a vertex: a survey of vertex-centric frameworksfor large-scale distributed graph processing. ACM Computing Surveys (CSUR), 48(2):25, 2015.

[122] Memgraph Ltd. Memgraph. Available at https://memgraph.com/.[123] Microsoft. Azure Cosmos DB. Available at https://azure.microsoft.com/en-us/services/cosmos-db/.[124] Microsoft. Microsoft SQL Server 2017. Available at https://www.microsoft.com/en-us/sql-server/

sql-server-2017.[125] B. Momjian. PostgreSQL: introduction and concepts, volume 192. Addison-Wesley New York, 2001.[126] T. Mueller. H2 database engine, 2006.[127] Networked Planet Limited. BrightstarDB. Available at http://brightstardb.com/.[128] Objectivity Inc. ThingSpan. Available at https://www.objectivity.com/products/thingspan/.[129] Ontotext. GraphDB. Available at https://www.ontotext.com/products/graphdb/.[130] OpenLink. Virtuoso. Available at https://virtuoso.openlinksw.com/.[131] Oracle. Oracle Spatial and Graph. Available at https://www.oracle.com/database/technologies/

spatialandgraph.html.[132] K. Ousterhout, R. Rasti, S. Ratnasamy, S. Shenker, and B.-G. Chun. Making sense of performance in data

analytics frameworks. In 12th USENIX Symposium on Networked Systems Design and Implementation(NSDI 15), pages 293–307, 2015.

[133] M. T. Özsu. A Survey of RDF Data Management Systems. CoRR, abs/1601.00707, 2016.[134] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the

web. Technical report, Stanford InfoLab, 1999.[135] N. Patil, P. Kiran, N. Kavya, and K. Naresh Patel. A Survey on Graph Database Management Techniques

for Huge Unstructured Data. International Journal of Electrical and Computer Engineering, 81:1140–1149,2018.

[136] J. Pérez, M. Arenas, and C. Gutierrez. Semantics and complexity of sparql. ACM Transactions onDatabase Systems (TODS), 34(3):16, 2009.

[137] H. Plattner. A common database approach for OLTP and OLAP using an in-memory column database.In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 1–2.

Page 37: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

Survey and Taxonomy of Graph Databases 1:37

ACM, 2009.[138] J. Pokorny. Graph databases: their power and limitations. In IFIP International Conference on Computer

Information Systems and Industrial Management, pages 58–69. Springer, 2015.[139] Profium. Profium Sense. Available at https://www.profium.com/en/.[140] S. B. N. Ramez Elmasri. Advantages of Distributed Databases. In Fundamentals of Database Systems,

6th edition, chapter 25.1.5, page 882. Pearson, 2011.[141] S. B. N. Ramez Elmasri. Data Fragmentation. In Fundamentals of Database Systems, 6th edition, chapter

25.4.1, pages 894–897. Pearson, 2011.[142] A. Rana. Detailed Introduction: Redis Modules, from Graphs to Machine Learning (Part 1), 2 2019.[143] M. Raynal. Using Semaphores to Solve the Readers-Writers Problem. In Concurrent Programming:

Algorithms, Principles, and Foundations, chapter 3.2.4, pages 74–78. Springer, 2013.[144] Redis Labs. Redis. https://redis.io/.[145] Redis Labs. RedisGraph. Available at https://oss.redislabs.com/redisgraph/.[146] C. D. Rickett, U.-U. Haus, J. Maltby, and K. J. Maschhoff. Loading andQuerying a Trillion RDF triples

with Cray Graph Engine on the Cray XC. Cray Users Group, 2018.[147] Robert Yokota. HGraphDB. Available at https://github.com/rayokota/hgraphdb.[148] I. Robinson, J. Webber, and E. Eifrem. Graph database internals. In Graph Databases, Second Edition,

chapter 7, pages 149–170. O’Relly, 2015.[149] M. A. Rodriguez. The gremlin graph traversal machine and language (invited talk). In Proceedings of

the 15th Symposium on Database Programming Languages, pages 1–10. ACM, 2015.[150] M. A. Rodriguez. The Gremlin graph traversal machine and language (invited talk). In Proceedings of

the 15th Symposium on Database Programming Languages, pages 1–10. ACM, 2015.[151] A. Roy, L. Bindschaedler, J. Malicevic, and W. Zwaenepoel. Chaos: Scale-out graph processing from

secondary storage. In Proceedings of the 25th Symposium on Operating Systems Principles, pages 410–424.ACM, 2015.

[152] S. Sahu, A. Mhedhbi, S. Salihoglu, J. Lin, and M. T. Özsu. The Ubiquity of Large Graphs and SurprisingChallenges of Graph Processing. in VLDB, 11(4):420–431, 2017.

[153] SAP. SAP HANA. Available at https://www.sap.com/products/hana.html.[154] S. E. Schaeffer. Graph clustering. Computer science review, 1(1):27–64, 2007.[155] P. Schmid, M. Besta, and T. Hoefler. High-performance distributed rma locks. In Proceedings of the 25th

ACM International Symposium on High-Performance Parallel and Distributed Computing, pages 19–30.ACM, 2016.

[156] A. Seaborne and S. Harris. SPARQL 1.1 query language. Available at http://www.w3.org/TR/2013/REC-sparql11-query-20130321/.

[157] B. Shao, H. Wang, and Y. Li. Trinity: A distributed graph engine on a memory cloud. In SIGMOD, pages505–516, 2013.

[158] K. Singh and V. Singh. Graph pattern matching: A brief survey of challenges and research directions. inINDIACom, pages 199–204, 2016.

[159] solid IT gmbh. DB-Engines: Ranking of Graph DBMS. Available at https://db-engines.com/en/ranking/graph+dbms.

[160] solid IT gmbh. System Properties Comparison: Neo4j vs. Redis. Available at https://db-engines.com/en/system/Neo4j%3BRedis.

[161] E. Solomonik, M. Besta, F. Vella, and T. Hoefler. Scaling betweenness centrality using communication-efficient sparse matrix multiplication. In Proceedings of the International Conference for High PerformanceComputing, Networking, Storage and Analysis, page 47. ACM, 2017.

[162] Stardog Union. Stardog. Available at https://www.stardog.com/.[163] G. Szárnyas, A. Prat-Pérez, A. Averbuch, J. Marton, M. Paradies, M. Kaufmann, O. Erling, P. Boncz,

V. Haprian, and J. B. Antal. An Early Look at the LDBC Social Network Benchmark’s BusinessIntelligence Workload. GRADES-NDA, pages 9:1–9:11, 2018.

[164] A. T. Tan and H. Kaiser. Extending C++ with Co-array Semantics. in ARRAY, pages 63–68, 2016.

Page 38: Demystifying Graph Databases: Analysis and Taxonomyof …htor.inf.ethz.ch/publications/img/graph-databases.pdfDemystifying Graph Databases: Analysis and Taxonomy of Data Organization,

1:38 M. Besta et al.

[165] A. Tate, A. Kamil, A. Dubey, A. Größlinger, B. Chamberlain, B. Goglin, C. Edwards, C. J. Newburn,D. Padua, D. Unat, et al. Programming abstractions for data locality. PADAL Workshop 2014, April28–29, Swiss National Supercomputing Center . . . , 2014.

[166] The Linux Foundation. JanusGraph. Available at http://janusgraph.org/.[167] TigerGraph. TigerGraph. Available at https://www.tigergraph.com/.[168] Twitter. FlockDB. Available at https://github.com/twitter-archive/flockdb.[169] N. Tyagi and N. Singh. Graph database-an overview of its applications and its types.[170] A. Vaikuntam and V. K. Perumal. Evaluation of contemporary graph databases. In Proceedings of the

7th ACM India Computing Conference, page 6. ACM, 2014.[171] VelocityDB Inc. VelocityDB. Available at https://velocitydb.com/.[172] VelocityDB Inc. VelocityGraph. Available at https://velocitydb.com/VelocityGraph.aspx.[173] WhiteDB Team. WhiteDB. Available at http://whitedb.org/

https://github.com/priitj/whitedb.[174] K. Xirogiannopoulos, V. Srinivas, and A. Deshpande. GraphGen: Adaptive Graph Processing Using

Relational Databases. in GRADES, pages 9:1–9:7, 2017.[175] P. Yuan, P. Liu, B. Wu, H. Jin, W. Zhang, and L. Liu. TripleBit: a fast and compact system for large scale

RDF data. Proceedings of the VLDB Endowment, 6(7):517–528, 2013.[176] K. Zhao and J. X. Yu. All-in-One: Graph Processing in RDBMSs Revisited. in SIGMOD, pages 1165–1180,

2017.