Graph and Timeseries DatabasesDatabases 2 (VU) (706.711 / 707.030)
Roman Kern
Institute of Interactive Systems and Data Science,Technical University Graz
2018-10-22
Roman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 1 / 30
Graph DatabasesMotivation and Basics of Graph Databases?
Roman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 2 / 30
Introduction - Graph Databases
What is a graph database?Datastorage optimised for graph data structure
▶ i.e., efficient storage and access
Scales gracefully with the amount of data
Additional index for look-ups
Roman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 3 / 30
Introduction - Graph Databases
Why should graph databases work?Networks usually have certain properties
▶ Small world phenomena▶ … even in big networks only a few hops are on average required to reach even distant nodes
Access to data follows certain patterns▶ Locality of reference▶ … operations are focused on certain areas
Roman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 4 / 30
Introduction - Graph Databases
Why should one not use a graph database?… if all the data is updated at once
▶ E.g. operation applied on all nodes
… if the query cannot easily be expressed as a graph traversal operation▶ E.g. lot of random access, or aggregate functions
Roman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 5 / 30
Introduction
Graph database vs. relational databaseIn principle a graph database can be implemented
▶ … via a relational database▶ … using joins or consecutive queries
But, relational databases are not optimised for such graph models▶ … i.e., lot of sparse (semi-empty) rows
Additionally, relational databases are not designed▶ … for changes in the schema, e.g. dynamic types of relations
Roman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 6 / 30
Introduction
Figure: Comparison of import time for 2 graph databases vs. a relational DBRoman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 7 / 30
Introduction - Graph Databases
Main conceptsMany contemporary graph databases are based on property graphs
▶ i.e., each node and edge are associated with a set of key/values▶ … where edges are directed (and often carry a label)
Often support ACID properties▶ i.e., each modification takes place within a transaction
Roman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 8 / 30
Introduction - Graph Databases
Query types for graph databasesLookup of nodesTraversal of a graph
▶ Start at a node▶ … continue following edges▶ … until a stopping criteria has been reached▶ Breadth-first vs. depth-first
Path finding▶ Find a path between two nodes (e.g. Dijkstra, A*)
Path matching▶ Matching patterns in graph
Roman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 9 / 30
Modelling of Graph DatabasesHow to represent the data and how to model
Roman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 10 / 30
Modelling of Graph Databases
ApproachHow should a graph database schema look like?
▶ i.e., how is the data represented as nodes and edges
Many different ways of modelling▶ Graphs are a very flexible data structure▶ … capable of capturing many domains models
In many cases a direct mapping is possible▶ Domain model and graph model
Need to review the model▶ Validate that the graph is suited for the queries being used▶ E.g. don’t mix entities with relations
Roman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 11 / 30
Modelling of Graph Databases
How to model for graph databases
Roman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 12 / 30
ApplicationPractical aspects of graph databases
Roman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 13 / 30
Application of Graph Databases
Main software tools for graph databasesNeo4j
OrientDB
TitanDB
… and many others
Roman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 14 / 30
Application of Graph Databases
Panama Papers - IntroLeaked documents from a firm in Panama (2.6TB of data)
▶ … about offshore activitiesJournalist (around the world) were working on analysing the data
▶ … a graph database was the back-end of this activities▶ Neo4j (plus Apache Solr and Tika)
Roman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 15 / 30
Application of Graph Databases
Panama Papers - StepsPopulate the database
▶ Analyse the documents⋆ e.g., entity extraction (detect names)
▶ → entities in the graph (entity types: company, officer, client, address, …)▶ Extract meta-data of documents▶ → properties for nodes
Detect relationships▶ e.g. using the connection of sender/receiver of E-Mails▶ → connections between the nodes
Refinement of graph▶ Manual work conducted by the journalists▶ More entity types, e.g. money flow, document types
Roman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 16 / 30
Application of Graph Databases
https://neo4j.com/blog/analyzing-panama-papers-neo4j/Roman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 17 / 30
Time Series DatabasesMotivation and Basics of Time Series Databases
Roman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 18 / 30
Introduction Time Series Databases
What is a time series database?Data storage optimised for temporal data
▶ “Endless” stream of incoming data▶ … challenge for traditional databases
Often accompanied by▶ Tools to acquire time series data▶ Tools to visualise and analyse such data
Roman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 19 / 30
Introduction Time Series Databases
Typical characteristics of time series databasesFast write/append operations
Slow update/delete operations
Scales well to huge amount of dataRetention policy
▶ i.e., to forget old data
Access restrictionsProvide (SQL-like) query languages
▶ Optimised for time range queries▶ Specialised queries for aggregates
Often rely on other storage mechanism▶ e.g., key-value store, wide-column storage
Roman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 20 / 30
Introduction Time Series Databases
Sources of time seriesObservations
▶ Environmental, e.g., weather data, CO2
Economy▶ e.g., stock exchange data
Sensor data▶ e.g., human sensor data
Log files
…
Roman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 21 / 30
Introduction Time Series Databases
Types of time series databasesLimited type of payload
▶ E.g. limited to just timestamp + number▶ → least amount of memory needed
Flexible payload▶ Allows for richer representation▶ E.g. timestamp + document
Wide-tables▶ Each row consists of many columns▶ … often hundreds of columns▶ → sparse rows
Roman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 22 / 30
Modelling of Time Series DatabasesHow to represent the data and how to model
Roman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 23 / 30
Modelling of Time Series Databases
ApproachOften based on single samples (observations)
▶ Univariat, bivariat, multivariat⋆ E.g. sensor readouts of multiple sensors (temperature, air pressure)
▶ Example: Measurement consists of⋆ Timestamp, metric name, value, list of filters⋆ E.g. 10:32, cpu-usage, 0.87, host=example.com, cpu=01
Flat file▶ Generic vs. specific▶ Store the name of the time series with each observation (generic)
⋆ Needed in case of dynamic systems⋆ e.g. different sensors become available or disappear
▶ Have dedicated time series (specific)
Roman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 24 / 30
Modelling of Time Series Databases
ApproachWindowed storage
▶ Each row represent a time window▶ Columns for a more fine grained resolution
⋆ Typically between 100 and 1000 observations per row▶ Alternatively, multiple observations are stored in a single columns
⋆ Using a custom (compressed) format
Special case: temporal and spatial data▶ Requires specialised look-up methods
Roman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 25 / 30
Example for Time Series DatabasesPractical Aspects of Time Series Databases
Roman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 26 / 30
Time Series Databases Example
TICK StackCollection of tools:
▶ Telegraf: server agent for collecting and reporting metrics (stream or batch processing) towrite data into the DB
▶ InfluxDB: the time series database component▶ Chronograf: Graphing and visualisation frontend for exploration▶ Kapacitor: Data processing engine, can process stream and batch data
https://www.influxdata.com/wp-content/themes/influx/images/TICK-Stack.png
Roman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 27 / 30
Time Series Databases Example
InfluxDB FeaturesTags
▶ Tags are indexed▶ store commonly-queried meta data▶ if “GROUP BY” should be used on the data
Fields▶ Fields are not indexed▶ Everything that should not be stored as string▶ If aggregation functions should be used on the data (COUNT, MAX, PERCENTILE, CUMSUM)
Roman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 28 / 30
Time Series Databases Example
Figure: Screenshot of example data stored in InfluxDB
Roman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 29 / 30
The EndNext: Map/Reduce
Roman Kern (ISDS, TU Graz) Graph/TimeDBs 2018-10-22 30 / 30