S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

A Semantic Big Data Companion

Stefano Bortoli @[email protected]

Flavio Pompermaier @[email protected]

The company (briefly)

• Okkam is – a SME based in Trento, Italy. – Started as spin-off of the

University of Trento and FBK (2010)• Okkam core business is – large-scale data integration using semantic

technologies and an Entity Name System

• Okkam operative sectors– Services for public administration – Services for restaurants (and more)– Research projects

• FP7, H2020, and Local agencies

Who we are

• Stefano Bortoli, PhD– works as technical director and researcher at Okkam S.R.L.

(Trento, Italy). His research and development interests are in the area of Information Integration, with special focus in entity-centric applications exploiting semantic technologies.

• Flavio Pompermaier, MSc.– works as senior software engineer at Okkam S.R.L. (Trento, Italy).

Flavio is a passionate developer working with state of the art technologies, combining semantic with big data technologies.

Our contributions• Early Flink adopters and promoters (since Stratosphere)• Example on how to use Flink with MongoDB

– https://github.com/okkam-it/flink-mongodb-test• 2 pull requests

– [FLINK-1928] [hbase] Added HBase write example– [FLINK-1828] [hadoop] Fixed missing call to configure() for Configurable HadoopOutputFormats

• 8 JIRA tickets– OPEN FLINK-2503 Inconsistencies in FileInputFormat hierarchy– RESOLVED FLINK-1978 POJO serialization NPE– OPEN FLINK-1834 Is mapred.output.dir conf parameter really required?– CLOSED FLINK-1828 Impossible to output data to an HBase table – OPEN FLINK-1827 Move test classes in test folders and fix scope of test dependencies– OPEN FLINK-2800 kryo serialization problem – RESOLVED FLINK-2394 HadoopOutFormat OutputCommitter is default to FileOutputCommiter– OPEN FLINK-1241 Record processing counter for Dashboard

• Hundreads of email threads and discussions

https://github.com/okkam-it/flink-mongodb-test



https://github.com/apache/flink/commit/3be621cfc270e7aee24b0a5438acbbe191606aea



https://github.com/apache/flink/commit/de573cf5cef3bed6c489af85dba2cc61912db4c0



What we do

Our toolbox

Our Flink Use Cases

• Our objective is to build and manage (very) large entity-centric knowledge bases to serve different purposes and domains

• So far, we used Apache Flink for:– Domain reasoning (Flink + Parquet + Thrift)– RDF data lifecycle (Flink + Parquet + Jena/Sesame )– RDF data intelligence (Flink + ELKiBi)– Duplicate record detection (Flink + HBase + Solr)– Entiton Record linkage (Flink + MongoDB + Kryo)– Telemetry analysis (Flink + MongoDB + Weka)

Why we need Flink

Entiton data model

Database recordRDF statement

Triplestore

NOSQL+ Indexes

+

Quad

provenance IRI

predicate

object

object Type

Subject local IRI

Subject Global IRI

RDF Type

Expensive datawearhouse

Entiton using Parquet+Thrift

namespace java it.okkam.flink.entitons.serialization.thrift

struct EntitonQuad { 1: required string p; //pred 2: required string o; //obj 3: optional string ot; //obj-type 4: required string g; //sourceIRI}

struct EntitonAtom { 1: required string s; //local-IRI 2: optional string oid; // ens-IRI 3: required list<string> types; //rdf-types 4: required list<EntitonQuad> quads; // quads}struct EntitonMolecule { 1: required EntitonAtom r; //root atom 2: optional list<EntitonAtom> atoms; //other atoms}

Quad

Subject local IRI

Subject ENS IRI

RDF Type

Hardware-wise

• We compete with expensive data wearhouse solution– e.g. Oracle Exadata Database Machines, IBM Netezza, etc.

• Test on small machines fosters optimization– If you don’t want to wait, make your code faster!

• Our code is ready to scale, without big investments• Fancy stuff can be done without millions of euros in HW

8 x Gigabyte Brix16GB RAM256GB SSD1T HDDIntel I7 4770 3,2Ghz

+ 1 Gbit Switch

ENS Maintenance

• Entity Name System (ENS) (FP7 ’08-’10)– A Web-scale support for minting and reusing

persistent entity identifiers for SW or LOD

ENS Maintenance

• Duplicate detection of 9.2M entities in 6h45 (using Flink incubator 0.6)

– Query Apache Solr global index to perform flexible blocking given a subset of attributes of each entity (names)

– Distinct/Join pairs of candidate duplicate– Rich Filter implementing Match function– Consolidate sets of candidate duplicates grouping

• Tricks:– If HBase does not distribute uniformly regions do rebalance()– Compress byte[] with LZ4 in custom HBase input format to reduce network

traffic– Reverse keys to speed up join execution (up to 10%, in some cases)

Tax Reasoner

• Pilot project for ACI and Val d’Aosta

Objectives: to produce analytics and investigate:1. Who did not pay Vehicle Excise Duty (Kraftfahrzeugsteuer)?2. Who did not pay Vehicle Insurance?3. Who skipped Vehicle Inspection?4. Who did not pay Vehicle Sales Taxes?5. Who violated exceptions to the above?

Dataset: 15 data sources for 5 year with 12M records about 950k vehicles and 500k subjects for a total of 90M facts

Challenge: consider events (time) and infer implicit information.

Apache Flink jobs: 6. From RDF to Entiton 7. Domain Specific Temporal Inference (Tax Reasoner)8. Build ElasticSearch Indexes

Tax Reasoner

Tax ReasonerTemporal Inference Execution Plan

1h ETA with SSD (2h30 on HDD) on developer machine

11M new facts inferred

It took 1 DAY to perform the

select query for one of the

sources!!

RDF Data IntelligenceSIREn Solution Pivot BroswerTimeline for details about vehicleBusiness Intelligence Analytics with SIREn Solution KiBiGeospatial indicators

Using Entiton with MongoDB• Inspired by the work of Gregg Kellog (2012)

– http://www.slideshare.net/gkellogg1/jsonld-and-mongodb

• We updated JAOB (Java Architecture for OWL Binding)– Serialize RDF into POJO and viceversa– Provides also a Maven plugin to compile OWL ontology into POJOs– https://github.com/okkam-it/jaob

• Database access layer data (using SpringData):– POJO RDF JSON-LD + KRYO MongoDB– MongoDB KRYO POJO

• Bottom line:– we use (framed) JSON-LD to allow (complex) tree queries on an entiton

database modeled according to a domain ontology– We exploit Kryo deserialization for fast reading– We enjoy SpringData abstraction to implement Data access APIs

http://www.slideshare.net/gkellogg1/jsonld-and-mongodb

http://www.slideshare.net/gkellogg1/jsonld-and-mongodb

https://github.com/okkam-it/jaob



Using Entiton with MongoDB@Document(collection = Entiton.TABLE)@CompoundIndexes({ @CompoundIndex(name = "nestedId", def = "{ 'jsonld.@id': 1 }", unique = true), ….})public class Entiton<T extends Thing> implements Serializable {

@Id private String id;@Version private String version;private DBObject jsonld;

private Binary kryo;@GeoSpatialIndexed(type=GeoSpatialIndexType.GEO_2DSPHERE)

private Point point;public String javaClass;

…}

Entiton (Mongo JSON-LD)

Contextual personalization in

Recommendation Architecture

Telemetry data collection

Flink batch processes

Contextual Personalization on demand

Lesson learned• Reversing String Tuples ids leads to performance improvements

of joins• When you make joins, ensure distinct dataset keys• Reuse objects to reduce the impact of garbage collection• When writing Flink jobs, start with small and debuggable unit

tests first, then run it on the cluster on the entire dataset (waiting for big data debugging of Dr. Leich@TUB)

• Serialization matters: less memory required, less gc, faster data loading faster execution

• HD speed matters when RAM is not enough, SSD rulez• Parquet rulez: self-describing data, push-down filters• Use Gelly consciously, sometimes joins are good enough• If your code crashes, it is usually your fault: from 0.9 Flink is

quite stable for batch jobs execution (at least)

Suggested improvements• Running jobs on a cluster is always tricky because of bad

maven dependencies management in flink– Define a POM for the build settings, a parent, and a bill-of-

materials BOM (releasetrain) and common resources folder– Jar validation on client submit wrt deployed Flink dist bundle

giving warning on possibly conflicting classes• Enable fast compilation (–Dmaven.test.skip=true) FLINK-1827• Better monitoring, we’re eager to use the new web client!• Complete Hadoop compatibility

– counters and custom grouping/sorting functions• Start thinking about education of professionals

– e.g. courses and certification

https://issues.apache.org/jira/browse/FLINK-1827

Future works

• Benchmark Entiton serialization models on Parquet (Avro vs Thrift vs Protobuf)

• Manage declarative data fusion policies – a-la LDIF: http://ldif.wbsg.de/

• Define a formal entiton operations algebra (e.g. merge, project, select, filter)

• Try out Cloudera Kudu– novel Hadoop storage engine addressing both bulk

loading stability, scan performance and random access– https://github.com/cloudera/kudu

http://ldif.wbsg.de/

http://ldif.wbsg.de/

https://github.com/cloudera/kudu



Conclusions

• We think we are walking along the “last mile” towards real world enterprise Semantic Applications

• Combining big data and semantics allows us to be flexible, expressive and, thanks to Flink, very scalable with very competitive costs

• Apache Flink gives us the leverage to shuffle data around without much headache

• We proved cool stuff can be done in a simple and efficient way, with the right tools and mindset

• Hopefully, Flink will help us reducing tax evasion in Italy, not bad for a German Squirrel, ah?

Thanks for your attention

Any Questions (before beer)?

Technology

S. Bartoli & F. Pompermaier – A Semantic Big Data Companion