Download pdf - GraphandTimeseriesDatabases - Graz University of Technologykti.tugraz.at/staff/rkern/courses/dbase2/slides_graphdbase.pdf · GraphandTimeseriesDatabases Databases2(VU)(706.711/707.030)

Graph and Timeseries DatabasesDatabases 2 (VU) (706.711 / 707.030)

Roman Kern

Institute of Interactive Systems and Data Science,Technical University Graz

2018-10-22

Roman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 1 / 30

Graph DatabasesMotivation and Basics of Graph Databases?


Introduction - Graph Databases

What is a graph database?Datastorage optimised for graph data structure

▶ i.e., efficient storage and access

Scales gracefully with the amount of data

Additional index for look-ups



Why should graph databases work?Networks usually have certain properties

▶ Small world phenomena▶ … even in big networks only a few hops are on average required to reach even distant nodes

Access to data follows certain patterns▶ Locality of reference▶ … operations are focused on certain areas



Why should one not use a graph database?… if all the data is updated at once

▶ E.g. operation applied on all nodes

… if the query cannot easily be expressed as a graph traversal operation▶ E.g. lot of random access, or aggregate functions


Introduction

Graph database vs. relational databaseIn principle a graph database can be implemented

▶ … via a relational database▶ … using joins or consecutive queries

But, relational databases are not optimised for such graph models▶ … i.e., lot of sparse (semi-empty) rows

Additionally, relational databases are not designed▶ … for changes in the schema, e.g. dynamic types of relations


Introduction

Figure: Comparison of import time for 2 graph databases vs. a relational DBRoman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 7 / 30


Main conceptsMany contemporary graph databases are based on property graphs

▶ i.e., each node and edge are associated with a set of key/values▶ … where edges are directed (and often carry a label)

Often support ACID properties▶ i.e., each modification takes place within a transaction



Query types for graph databasesLookup of nodesTraversal of a graph

▶ Start at a node▶ … continue following edges▶ … until a stopping criteria has been reached▶ Breadth-first vs. depth-first

Path finding▶ Find a path between two nodes (e.g. Dijkstra, A*)

Path matching▶ Matching patterns in graph


Modelling of Graph DatabasesHow to represent the data and how to model


Modelling of Graph Databases

ApproachHow should a graph database schema look like?

▶ i.e., how is the data represented as nodes and edges

Many different ways of modelling▶ Graphs are a very flexible data structure▶ … capable of capturing many domains models

In many cases a direct mapping is possible▶ Domain model and graph model

Need to review the model▶ Validate that the graph is suited for the queries being used▶ E.g. don’t mix entities with relations


Modelling of Graph Databases

How to model for graph databases


ApplicationPractical aspects of graph databases


Application of Graph Databases

Main software tools for graph databasesNeo4j

OrientDB

TitanDB

… and many others



Panama Papers - IntroLeaked documents from a firm in Panama (2.6TB of data)

▶ … about offshore activitiesJournalist (around the world) were working on analysing the data

▶ … a graph database was the back-end of this activities▶ Neo4j (plus Apache Solr and Tika)



Panama Papers - StepsPopulate the database

▶ Analyse the documents⋆ e.g., entity extraction (detect names)

▶ → entities in the graph (entity types: company, officer, client, address, …)▶ Extract meta-data of documents▶ → properties for nodes

Detect relationships▶ e.g. using the connection of sender/receiver of E-Mails▶ → connections between the nodes

Refinement of graph▶ Manual work conducted by the journalists▶ More entity types, e.g. money flow, document types



https://neo4j.com/blog/analyzing-panama-papers-neo4j/Roman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 17 / 30

https://neo4j.com/blog/analyzing-panama-papers-neo4j/

Time Series DatabasesMotivation and Basics of Time Series Databases


Introduction Time Series Databases

What is a time series database?Data storage optimised for temporal data

▶ “Endless” stream of incoming data▶ … challenge for traditional databases

Often accompanied by▶ Tools to acquire time series data▶ Tools to visualise and analyse such data



Typical characteristics of time series databasesFast write/append operations

Slow update/delete operations

Scales well to huge amount of dataRetention policy

▶ i.e., to forget old data

Access restrictionsProvide (SQL-like) query languages

▶ Optimised for time range queries▶ Specialised queries for aggregates

Often rely on other storage mechanism▶ e.g., key-value store, wide-column storage



Sources of time seriesObservations

▶ Environmental, e.g., weather data, CO2

Economy▶ e.g., stock exchange data

Sensor data▶ e.g., human sensor data

Log files

…



Types of time series databasesLimited type of payload

▶ E.g. limited to just timestamp + number▶ → least amount of memory needed

Flexible payload▶ Allows for richer representation▶ E.g. timestamp + document

Wide-tables▶ Each row consists of many columns▶ … often hundreds of columns▶ → sparse rows


Modelling of Time Series DatabasesHow to represent the data and how to model


Modelling of Time Series Databases

ApproachOften based on single samples (observations)

▶ Univariat, bivariat, multivariat⋆ E.g. sensor readouts of multiple sensors (temperature, air pressure)

▶ Example: Measurement consists of⋆ Timestamp, metric name, value, list of filters⋆ E.g. 10:32, cpu-usage, 0.87, host=example.com, cpu=01

Flat file▶ Generic vs. specific▶ Store the name of the time series with each observation (generic)

⋆ Needed in case of dynamic systems⋆ e.g. different sensors become available or disappear

▶ Have dedicated time series (specific)


Modelling of Time Series Databases

ApproachWindowed storage

▶ Each row represent a time window▶ Columns for a more fine grained resolution

⋆ Typically between 100 and 1000 observations per row▶ Alternatively, multiple observations are stored in a single columns

⋆ Using a custom (compressed) format

Special case: temporal and spatial data▶ Requires specialised look-up methods


Example for Time Series DatabasesPractical Aspects of Time Series Databases


Time Series Databases Example

TICK StackCollection of tools:

▶ Telegraf: server agent for collecting and reporting metrics (stream or batch processing) towrite data into the DB

▶ InfluxDB: the time series database component▶ Chronograf: Graphing and visualisation frontend for exploration▶ Kapacitor: Data processing engine, can process stream and batch data

https://www.influxdata.com/wp-content/themes/influx/images/TICK-Stack.png


https://www.influxdata.com/wp-content/themes/influx/images/TICK-Stack.png


InfluxDB FeaturesTags

▶ Tags are indexed▶ store commonly-queried meta data▶ if “GROUP BY” should be used on the data

Fields▶ Fields are not indexed▶ Everything that should not be stored as string▶ If aggregation functions should be used on the data (COUNT, MAX, PERCENTILE, CUMSUM)



Figure: Screenshot of example data stored in InfluxDB


The EndNext: Map/Reduce