34
Giovanni Tummarello, Ph.D Data Intensive Infrastructure UNIT - DERI.ie CEO SindiceTech Real Time Semantic Warehousing: Sindice.com technology for the enterprise

Real Time Semantic Warehousing: Sindice technology for the enterprise

  • Upload
    samira

  • View
    32

  • Download
    0

Embed Size (px)

DESCRIPTION

Real Time Semantic Warehousing: Sindice.com technology for the enterprise. Giovanni Tummarello , Ph.D Data Intensive Infrastructure UNIT - DERI.ie CEO SindiceTech. How we started : Sindice.com. - PowerPoint PPT Presentation

Citation preview

Page 1: Real Time Semantic Warehousing: Sindice technology for the enterprise

Giovanni Tummarello, Ph.D Data Intensive Infrastructure UNIT - DERI.ie

CEO SindiceTech

Real Time Semantic Warehousing: Sindice.com

technology for the enterprise

Page 2: Real Time Semantic Warehousing: Sindice technology for the enterprise

How we started : Sindice.com

80 Billions triple, 500,000,000 RDF Graphs, 5 TB of data. The Sindice Suite powers Sindice.com. Online with 99,9%+

uptime.

Page 3: Real Time Semantic Warehousing: Sindice technology for the enterprise

Semantic Sandboxes on: Sindice.com

Data Sandboxes in Sindice.com – Powered by CloudSpaces

Page 4: Real Time Semantic Warehousing: Sindice technology for the enterprise

And then we met people asking can you do it for us

Page 5: Real Time Semantic Warehousing: Sindice technology for the enterprise

5 of 16

Example story (Pharmaceutical company0

To stay competitive, Pharmaceutical companies need to leverage all the data available from inside sources as well as from the increasingly many public HCLS data sources available. Due to the diversity of this data with respect to nature, formats, quality, there are complex integration issues. Traditional data warehousing technology require big upfront thinking and is handled within a company in the “go via the IT department” approach. This does not meet the need of data scientists who are the only ones that can do the complex cross-use case thinking required. Via Real Time Semantic Data Warehousing (RETIS) data scientist expect to get:

• The ability to speed up “In silico” scientific workflows (interrelation of diverse large datasets) by orders of magnitude by relying on a data warehousing approach.

• The ability to create large scale “data maps” or “aggregated views” which would allow researchers to see “trends” and gather insights at high level which would not be possible by data accessed via single lookups.

• The ability to receive recommendations and suggestions for new data connections based on an ever evolving ecosystem of available experimental datasets.

• Provide their R&D departments with superior tools for investigating their internal knowledge; search engines and data browsing tools which provide unified views of multiple, evolving, live datasets without leakage of specific “queries” to the outside world which would reveal internal research trends

• The ability to leverage the ever increasing body of public, crowd curated open data

Page 6: Real Time Semantic Warehousing: Sindice technology for the enterprise

Linked Data clouds for the Enterprise

– Strategic knowledge spaces, where new databases can be added and “leveraged” with an unprecedented ease

– Integration “Pay as you go” : explore now, fine tune later.

– Its BigData (Cluster+Clouds) meets RDF and Semantic Technologies

Page 7: Real Time Semantic Warehousing: Sindice technology for the enterprise

Sindice.com

Page 8: Real Time Semantic Warehousing: Sindice technology for the enterprise

Because you need Semantic SandBoxes

Page 9: Real Time Semantic Warehousing: Sindice technology for the enterprise

A Dataspace Template

Semantic Web Data A typical implementation template.

Dataspaces own:• Resources• Services• Datasets for others to reuse

Page 10: Real Time Semantic Warehousing: Sindice technology for the enterprise

Dataspace Composition

10 of 16

Scalable cascading semantic ‘Dataspaces”• Resources allocated in public/private clouds• Allow to get Sindice Data and mix it/ process it for private purposes

Page 11: Real Time Semantic Warehousing: Sindice technology for the enterprise

Cloud powered!

11 of 16

<dataspace id= “iphonedataspace”>

<dependencies>   http://ecommerce01.dataspace.sindice.net/</dataspace>   http://price01.dataspace.sindice.net/</dependencies>

<resources>  <mysql name=“sql”>  <hbase size=“10g”>  <siren name=“index”> <triplestore name=“sparql” kind=“virtuoso” /> </resources>

<retention> (see later) <update-rate>1D</update-rate> <timeout>1D</timeout></retention></dataspace>

Page 12: Real Time Semantic Warehousing: Sindice technology for the enterprise

Scale is only 1 dimension

Multiple dimensions of WeD data integration• RDF tool stack flexibility• Cluster scalable processing scalability• “Cloud” Pipelines dynamicity

Page 13: Real Time Semantic Warehousing: Sindice technology for the enterprise

Full Json Like Search.On Solr.

All operators supported.

Page 14: Real Time Semantic Warehousing: Sindice technology for the enterprise

What is SIREn ?

• Plugin to Solr• Built for searching and operating on

semistructured data and relational datastructures

Page 15: Real Time Semantic Warehousing: Sindice technology for the enterprise

SIREn: Semantic IR Engine

• Extension to Enterprise Search Engine Solr• Semantic, full-text, incremental updates,

distributed searchSemantic Database

sSIREn

Constant time

Page 16: Real Time Semantic Warehousing: Sindice technology for the enterprise

Limitations of Apache Solr

• Not efficient with highly heterogeneous structured data sources– Limitation on the number of attributes:

Dictionary size explosion

Page 17: Real Time Semantic Warehousing: Sindice technology for the enterprise

Dictionary Size Explosion

Record 1label Renaud Delbru

name Renaud Delbru

Page 18: Real Time Semantic Warehousing: Sindice technology for the enterprise

Dictionary Size Explosion

Record 1label Renaud Delbru

name Renaud Delbru

Dictionarylabel:renaud

label:delbru

name:renaud

name:delbru

Dictionary construction Concatenation of attribute name and term N * M complexity (worst case)

2 attributes * 2 terms = 4 dictionary entries 100K attributes * 1B terms = 100B entries

Page 19: Real Time Semantic Warehousing: Sindice technology for the enterprise

Limitations of Apache Solr

• Not efficient with highly heterogeneous structured data sources– Limitation on the number of attributes:

Dictionary size explosionQuery clause explosion when searching across all

attributes

Page 20: Real Time Semantic Warehousing: Sindice technology for the enterprise

Limitations of Apache Solr

• Not efficient with highly heterogeneous structured data sources– Limitation on the number of attributes:

Dictionary size explosionQuery clause explosion when searching across all

attributes

• Limited support for structured query– Multi-valued attributes

Page 21: Real Time Semantic Warehousing: Sindice technology for the enterprise

Multi-valued attributes

Record 1

label man's best friend

pooch

Record 2

label man's worst enemy

friend to no one

• No support in Solr for "all words must match in the same value of a multi-valued field".

• A field value is a bag of words– No distinction between multiple values

Page 22: Real Time Semantic Warehousing: Sindice technology for the enterprise

Multi-valued attributes

Record 1

label man's best friend pooch

Record 2

label man's worst enemy friend to no one

• No support in Solr for "all words must match in the same value of a multi-valued field".

• A field value is a bag of words– No distinction between multiple values

• Query example– label : man’s friend– Solr returns Record 1 & 2 as results

Page 23: Real Time Semantic Warehousing: Sindice technology for the enterprise

Limitations of Apache Solr

• Not efficient with highly heterogeneous structured data sources– Limitation on the number of attributes:

Dictionary size explosionQuery clause explosion when searching across all

attributes

• Limited support for structured query– Multi-valued attributes– No full-text search on attribute names

Page 24: Real Time Semantic Warehousing: Sindice technology for the enterprise

Full-text search on attribute names

Record 1

rdfs:label Renaud Delbru

• No support in Solr for “keyword search in attribute names".

• Query example– (name OR label) = “Renaud Delbru”– Solr is unable to find the records without the exact

attribute nameRecord 2

foaf:name Renaud Delbru

Record 3

sioc:name Renaud Delbru

Record 4

full_name Renaud Delbru

Page 25: Real Time Semantic Warehousing: Sindice technology for the enterprise

Limitations of Apache Solr• Not efficient with highly heterogeneous

structured data sources– Limitation on the number of attributes:

Dictionary size explosionQuery clause explosion when searching across all

attributes

• Limited support for structured query– Multi-valued attributes– No full-text search on attribute names– No 1:N relationship materialisation

Page 26: Real Time Semantic Warehousing: Sindice technology for the enterprise

Relationship materialization

• Its Json like indexing and searching

• Materialize the relationships between your entities and others.

Page 27: Real Time Semantic Warehousing: Sindice technology for the enterprise

Some numbers: Siren on Sindice

Data Collection 500M web data documents

(RDF, RDFa, Microformat, etc.) 200K datasets 50B triples

Settings Cluster of 4 nodes

2 nodes for indexing 2 nodes for querying

Replication

Indexing Performance Full index construction takes

approx 24 hours 436K triples / second

Services Keyword and structured

queries Dataset search >> 99% uptime

Page 28: Real Time Semantic Warehousing: Sindice technology for the enterprise

Large scale RDF ‘Summaries”

Page 29: Real Time Semantic Warehousing: Sindice technology for the enterprise

Introducing large scale RDF ‘Summaries”

We do it for:• Data exploration

– How to find datasets about movies ?• Assisted SPARQL Query Editor

– What is the data structure ?• Dataset Quality

– How to differentiate relevant form irrelevant dataset ?

Page 30: Real Time Semantic Warehousing: Sindice technology for the enterprise

Large Scale RDF summaries

10B relationships

12M relationshipsClass Level

Page 32: Real Time Semantic Warehousing: Sindice technology for the enterprise

Relational Faceted Browsing. At speed of light

Patent Pending

Page 33: Real Time Semantic Warehousing: Sindice technology for the enterprise

SparQL is awesome. And now your guys can actually use it.

Page 34: Real Time Semantic Warehousing: Sindice technology for the enterprise

Thank you

Sindice.com team April 2012

With the contribution of