29
A Semantic Big Data Companion Stefano Bortoli @stefanobortoli [email protected] Flavio Pompermaier @fpompermaier [email protected]

S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

Embed Size (px)

Citation preview

Page 1: S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

A Semantic Big Data Companion

Stefano Bortoli @[email protected]

Flavio Pompermaier @[email protected]

Page 2: S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

The company (briefly)

• Okkam is – a SME based in Trento, Italy. – Started as spin-off of the

University of Trento and FBK (2010)• Okkam core business is – large-scale data integration using semantic

technologies and an Entity Name System

• Okkam operative sectors– Services for public administration – Services for restaurants (and more)– Research projects

• FP7, H2020, and Local agencies

Page 3: S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

Who we are

• Stefano Bortoli, PhD– works as technical director and researcher at Okkam S.R.L.

(Trento, Italy). His research and development interests are in the area of Information Integration, with special focus in entity-centric applications exploiting semantic technologies.

• Flavio Pompermaier, MSc.– works as senior software engineer at Okkam S.R.L. (Trento, Italy).

Flavio is a passionate developer working with state of the art technologies, combining semantic with big data technologies.

Page 4: S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

Our contributions• Early Flink adopters and promoters (since Stratosphere)• Example on how to use Flink with MongoDB

– https://github.com/okkam-it/flink-mongodb-test• 2 pull requests

– [FLINK-1928] [hbase] Added HBase write example– [FLINK-1828] [hadoop] Fixed missing call to configure() for Configurable HadoopOutputFormats

• 8 JIRA tickets– OPEN FLINK-2503 Inconsistencies in FileInputFormat hierarchy– RESOLVED FLINK-1978 POJO serialization NPE– OPEN FLINK-1834 Is mapred.output.dir conf parameter really required?– CLOSED FLINK-1828 Impossible to output data to an HBase table – OPEN FLINK-1827 Move test classes in test folders and fix scope of test dependencies– OPEN FLINK-2800 kryo serialization problem – RESOLVED FLINK-2394 HadoopOutFormat OutputCommitter is default to FileOutputCommiter– OPEN FLINK-1241 Record processing counter for Dashboard

• Hundreads of email threads and discussions

Page 5: S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

What we do

Page 6: S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

Our toolbox

Page 7: S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

Our Flink Use Cases

• Our objective is to build and manage (very) large entity-centric knowledge bases to serve different purposes and domains

• So far, we used Apache Flink for:– Domain reasoning (Flink + Parquet + Thrift)– RDF data lifecycle (Flink + Parquet + Jena/Sesame )– RDF data intelligence (Flink + ELKiBi)– Duplicate record detection (Flink + HBase + Solr)– Entiton Record linkage (Flink + MongoDB + Kryo)– Telemetry analysis (Flink + MongoDB + Weka)

Page 8: S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

Why we need Flink

Entiton data model

Database recordRDF statement

Triplestore

NOSQL+ Indexes

+

Quad

provenance IRI

predicate

object

object Type

Subject local IRI

Subject Global IRI

RDF Type

Expensive datawearhouse

Page 9: S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

Entiton using Parquet+Thrift

namespace java it.okkam.flink.entitons.serialization.thrift

struct EntitonQuad { 1: required string p; //pred 2: required string o; //obj 3: optional string ot; //obj-type 4: required string g; //sourceIRI}

struct EntitonAtom { 1: required string s; //local-IRI 2: optional string oid; // ens-IRI 3: required list<string> types; //rdf-types 4: required list<EntitonQuad> quads; // quads}struct EntitonMolecule { 1: required EntitonAtom r; //root atom 2: optional list<EntitonAtom> atoms; //other atoms}

Quad

Subject local IRI

Subject ENS IRI

RDF Type

Page 10: S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

Hardware-wise

• We compete with expensive data wearhouse solution– e.g. Oracle Exadata Database Machines, IBM Netezza, etc.

• Test on small machines fosters optimization– If you don’t want to wait, make your code faster!

• Our code is ready to scale, without big investments• Fancy stuff can be done without millions of euros in HW

8 x Gigabyte Brix16GB RAM256GB SSD1T HDDIntel I7 4770 3,2Ghz

+ 1 Gbit Switch

Page 11: S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

ENS Maintenance

• Entity Name System (ENS) (FP7 ’08-’10)– A Web-scale support for minting and reusing

persistent entity identifiers for SW or LOD

Page 12: S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

ENS Maintenance

• Duplicate detection of 9.2M entities in 6h45 (using Flink incubator 0.6)

– Query Apache Solr global index to perform flexible blocking given a subset of attributes of each entity (names)

– Distinct/Join pairs of candidate duplicate– Rich Filter implementing Match function– Consolidate sets of candidate duplicates grouping

• Tricks:– If HBase does not distribute uniformly regions do rebalance()– Compress byte[] with LZ4 in custom HBase input format to reduce network

traffic– Reverse keys to speed up join execution (up to 10%, in some cases)

Page 13: S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

Tax Reasoner

• Pilot project for ACI and Val d’Aosta

Objectives: to produce analytics and investigate:1. Who did not pay Vehicle Excise Duty (Kraftfahrzeugsteuer)?2. Who did not pay Vehicle Insurance?3. Who skipped Vehicle Inspection?4. Who did not pay Vehicle Sales Taxes?5. Who violated exceptions to the above?

Dataset: 15 data sources for 5 year with 12M records about 950k vehicles and 500k subjects for a total of 90M facts

Challenge: consider events (time) and infer implicit information.

Apache Flink jobs: 6. From RDF to Entiton 7. Domain Specific Temporal Inference (Tax Reasoner)8. Build ElasticSearch Indexes

Page 14: S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

Tax Reasoner

Page 15: S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

Tax ReasonerTemporal Inference Execution Plan

1h ETA with SSD (2h30 on HDD) on developer machine

11M new facts inferred

It took 1 DAY to perform the

select query for one of the

sources!!

Page 16: S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

RDF Data IntelligenceSIREn Solution Pivot BroswerTimeline for details about vehicleBusiness Intelligence Analytics with SIREn Solution KiBiGeospatial indicators

Page 17: S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

Using Entiton with MongoDB• Inspired by the work of Gregg Kellog (2012)

– http://www.slideshare.net/gkellogg1/jsonld-and-mongodb

• We updated JAOB (Java Architecture for OWL Binding)– Serialize RDF into POJO and viceversa– Provides also a Maven plugin to compile OWL ontology into POJOs– https://github.com/okkam-it/jaob

• Database access layer data (using SpringData):– POJO RDF JSON-LD + KRYO MongoDB– MongoDB KRYO POJO

• Bottom line:– we use (framed) JSON-LD to allow (complex) tree queries on an entiton

database modeled according to a domain ontology– We exploit Kryo deserialization for fast reading– We enjoy SpringData abstraction to implement Data access APIs

Page 18: S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

Using Entiton with MongoDB@Document(collection = Entiton.TABLE)@CompoundIndexes({ @CompoundIndex(name = "nestedId", def = "{ 'jsonld.@id': 1 }", unique = true), ….})public class Entiton<T extends Thing> implements Serializable {

@Id private String id;@Version private String version;private DBObject jsonld;

private Binary kryo;@GeoSpatialIndexed(type=GeoSpatialIndexType.GEO_2DSPHERE)

private Point point;public String javaClass;

…}

Page 19: S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

Entiton (Mongo JSON-LD)

Page 20: S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

Contextual personalization in

Page 21: S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

Recommendation Architecture

Page 22: S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

Telemetry data collection

Page 23: S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

Flink batch processes

Page 24: S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

Contextual Personalization on demand

Page 25: S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

Lesson learned• Reversing String Tuples ids leads to performance improvements

of joins• When you make joins, ensure distinct dataset keys• Reuse objects to reduce the impact of garbage collection• When writing Flink jobs, start with small and debuggable unit

tests first, then run it on the cluster on the entire dataset (waiting for big data debugging of Dr. Leich@TUB)

• Serialization matters: less memory required, less gc, faster data loading faster execution

• HD speed matters when RAM is not enough, SSD rulez• Parquet rulez: self-describing data, push-down filters• Use Gelly consciously, sometimes joins are good enough• If your code crashes, it is usually your fault: from 0.9 Flink is

quite stable for batch jobs execution (at least)

Page 26: S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

Suggested improvements• Running jobs on a cluster is always tricky because of bad

maven dependencies management in flink– Define a POM for the build settings, a parent, and a bill-of-

materials BOM (releasetrain) and common resources folder– Jar validation on client submit wrt deployed Flink dist bundle

giving warning on possibly conflicting classes• Enable fast compilation (–Dmaven.test.skip=true) FLINK-1827• Better monitoring, we’re eager to use the new web client!• Complete Hadoop compatibility

– counters and custom grouping/sorting functions• Start thinking about education of professionals

– e.g. courses and certification

Page 27: S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

Future works

• Benchmark Entiton serialization models on Parquet (Avro vs Thrift vs Protobuf)

• Manage declarative data fusion policies – a-la LDIF: http://ldif.wbsg.de/

• Define a formal entiton operations algebra (e.g. merge, project, select, filter)

• Try out Cloudera Kudu– novel Hadoop storage engine addressing both bulk

loading stability, scan performance and random access– https://github.com/cloudera/kudu

Page 28: S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

Conclusions

• We think we are walking along the “last mile” towards real world enterprise Semantic Applications

• Combining big data and semantics allows us to be flexible, expressive and, thanks to Flink, very scalable with very competitive costs

• Apache Flink gives us the leverage to shuffle data around without much headache

• We proved cool stuff can be done in a simple and efficient way, with the right tools and mindset

• Hopefully, Flink will help us reducing tax evasion in Italy, not bad for a German Squirrel, ah?

Page 29: S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

Thanks for your attention

Any Questions (before beer)?