Upload
samira
View
32
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Real Time Semantic Warehousing: Sindice.com technology for the enterprise. Giovanni Tummarello , Ph.D Data Intensive Infrastructure UNIT - DERI.ie CEO SindiceTech. How we started : Sindice.com. - PowerPoint PPT Presentation
Citation preview
Giovanni Tummarello, Ph.D Data Intensive Infrastructure UNIT - DERI.ie
CEO SindiceTech
Real Time Semantic Warehousing: Sindice.com
technology for the enterprise
How we started : Sindice.com
80 Billions triple, 500,000,000 RDF Graphs, 5 TB of data. The Sindice Suite powers Sindice.com. Online with 99,9%+
uptime.
Semantic Sandboxes on: Sindice.com
Data Sandboxes in Sindice.com – Powered by CloudSpaces
And then we met people asking can you do it for us
5 of 16
Example story (Pharmaceutical company0
To stay competitive, Pharmaceutical companies need to leverage all the data available from inside sources as well as from the increasingly many public HCLS data sources available. Due to the diversity of this data with respect to nature, formats, quality, there are complex integration issues. Traditional data warehousing technology require big upfront thinking and is handled within a company in the “go via the IT department” approach. This does not meet the need of data scientists who are the only ones that can do the complex cross-use case thinking required. Via Real Time Semantic Data Warehousing (RETIS) data scientist expect to get:
• The ability to speed up “In silico” scientific workflows (interrelation of diverse large datasets) by orders of magnitude by relying on a data warehousing approach.
• The ability to create large scale “data maps” or “aggregated views” which would allow researchers to see “trends” and gather insights at high level which would not be possible by data accessed via single lookups.
• The ability to receive recommendations and suggestions for new data connections based on an ever evolving ecosystem of available experimental datasets.
• Provide their R&D departments with superior tools for investigating their internal knowledge; search engines and data browsing tools which provide unified views of multiple, evolving, live datasets without leakage of specific “queries” to the outside world which would reveal internal research trends
• The ability to leverage the ever increasing body of public, crowd curated open data
Linked Data clouds for the Enterprise
– Strategic knowledge spaces, where new databases can be added and “leveraged” with an unprecedented ease
– Integration “Pay as you go” : explore now, fine tune later.
– Its BigData (Cluster+Clouds) meets RDF and Semantic Technologies
Sindice.com
Because you need Semantic SandBoxes
A Dataspace Template
Semantic Web Data A typical implementation template.
Dataspaces own:• Resources• Services• Datasets for others to reuse
Dataspace Composition
10 of 16
Scalable cascading semantic ‘Dataspaces”• Resources allocated in public/private clouds• Allow to get Sindice Data and mix it/ process it for private purposes
Cloud powered!
11 of 16
<dataspace id= “iphonedataspace”>
<dependencies> http://ecommerce01.dataspace.sindice.net/</dataspace> http://price01.dataspace.sindice.net/</dependencies>
<resources> <mysql name=“sql”> <hbase size=“10g”> <siren name=“index”> <triplestore name=“sparql” kind=“virtuoso” /> </resources>
<retention> (see later) <update-rate>1D</update-rate> <timeout>1D</timeout></retention></dataspace>
Scale is only 1 dimension
Multiple dimensions of WeD data integration• RDF tool stack flexibility• Cluster scalable processing scalability• “Cloud” Pipelines dynamicity
Full Json Like Search.On Solr.
All operators supported.
What is SIREn ?
• Plugin to Solr• Built for searching and operating on
semistructured data and relational datastructures
SIREn: Semantic IR Engine
• Extension to Enterprise Search Engine Solr• Semantic, full-text, incremental updates,
distributed searchSemantic Database
sSIREn
Constant time
Limitations of Apache Solr
• Not efficient with highly heterogeneous structured data sources– Limitation on the number of attributes:
Dictionary size explosion
Dictionary Size Explosion
Record 1label Renaud Delbru
name Renaud Delbru
Dictionary Size Explosion
Record 1label Renaud Delbru
name Renaud Delbru
Dictionarylabel:renaud
label:delbru
name:renaud
name:delbru
Dictionary construction Concatenation of attribute name and term N * M complexity (worst case)
2 attributes * 2 terms = 4 dictionary entries 100K attributes * 1B terms = 100B entries
Limitations of Apache Solr
• Not efficient with highly heterogeneous structured data sources– Limitation on the number of attributes:
Dictionary size explosionQuery clause explosion when searching across all
attributes
Limitations of Apache Solr
• Not efficient with highly heterogeneous structured data sources– Limitation on the number of attributes:
Dictionary size explosionQuery clause explosion when searching across all
attributes
• Limited support for structured query– Multi-valued attributes
Multi-valued attributes
Record 1
label man's best friend
pooch
Record 2
label man's worst enemy
friend to no one
• No support in Solr for "all words must match in the same value of a multi-valued field".
• A field value is a bag of words– No distinction between multiple values
Multi-valued attributes
Record 1
label man's best friend pooch
Record 2
label man's worst enemy friend to no one
• No support in Solr for "all words must match in the same value of a multi-valued field".
• A field value is a bag of words– No distinction between multiple values
• Query example– label : man’s friend– Solr returns Record 1 & 2 as results
Limitations of Apache Solr
• Not efficient with highly heterogeneous structured data sources– Limitation on the number of attributes:
Dictionary size explosionQuery clause explosion when searching across all
attributes
• Limited support for structured query– Multi-valued attributes– No full-text search on attribute names
Full-text search on attribute names
Record 1
rdfs:label Renaud Delbru
• No support in Solr for “keyword search in attribute names".
• Query example– (name OR label) = “Renaud Delbru”– Solr is unable to find the records without the exact
attribute nameRecord 2
foaf:name Renaud Delbru
Record 3
sioc:name Renaud Delbru
Record 4
full_name Renaud Delbru
Limitations of Apache Solr• Not efficient with highly heterogeneous
structured data sources– Limitation on the number of attributes:
Dictionary size explosionQuery clause explosion when searching across all
attributes
• Limited support for structured query– Multi-valued attributes– No full-text search on attribute names– No 1:N relationship materialisation
Relationship materialization
• Its Json like indexing and searching
• Materialize the relationships between your entities and others.
Some numbers: Siren on Sindice
Data Collection 500M web data documents
(RDF, RDFa, Microformat, etc.) 200K datasets 50B triples
Settings Cluster of 4 nodes
2 nodes for indexing 2 nodes for querying
Replication
Indexing Performance Full index construction takes
approx 24 hours 436K triples / second
Services Keyword and structured
queries Dataset search >> 99% uptime
Large scale RDF ‘Summaries”
Introducing large scale RDF ‘Summaries”
We do it for:• Data exploration
– How to find datasets about movies ?• Assisted SPARQL Query Editor
– What is the data structure ?• Dataset Quality
– How to differentiate relevant form irrelevant dataset ?
Large Scale RDF summaries
10B relationships
12M relationshipsClass Level
Sindice Analytics Widget Demo
• http://test01.sindice.net:9001/sindice-stats-webapp/
• http://test01.sindice.net/szydan/dataset-view/dataset/default/www.bbc.co.uk
Relational Faceted Browsing. At speed of light
Patent Pending
SparQL is awesome. And now your guys can actually use it.
Thank you
Sindice.com team April 2012
With the contribution of