Upload
odelia
View
48
Download
1
Embed Size (px)
DESCRIPTION
SW-Store: a vertically partitioned DBMS for Semantic Web data Management. Surabhi Mithal Nipun Garg. Introduction to semantic web : An example. A simplified bookstore data (dataset “A”). Source : http://www.w3.org/People/Ivan/CorePresentations/SWTutorial /. - PowerPoint PPT Presentation
Citation preview
SW-STORE: A VERTICALLY PARTITIONED DBMS FOR SEMANTIC WEB DATA MANAGEMENT
Surabhi Mithal Nipun Garg
INTRODUCTION TO SEMANTIC WEB : AN EXAMPLE
ISBN Author Title Publisher Year0006511409X id_xyz The Glass Palace id_qpr 2000
ID Name Homepageid_xyz Ghosh, Amitav http://www.amitavghosh.com
ID Publisher’s name Cityid_qpr Harper Collins London
Source : http://www.w3.org/People/Ivan/CorePresentations/SWTutorial/
A simplified bookstore data (dataset “A”)
EXAMPLE CONT : GRAPH REPRESENATION
http://…isbn/000651409X
Ghosh, Amitav
http://www.amitavghosh.com
The Glass Palace
2000
London
Harper Collins
a:title
a:year
a:city
a:p_name
a:name a:homepage
a:authora:publisher
ANOTHER BOOKSTORE DATA (DATASET “F”)
A B C D
1 ID Titre Traducteur
Original
2 ISBN 2020286682
Le Palais des Miroirs
$A12$ ISBN 0-00-6511409-X
3
4
5
6 ID Auteur7 ISBN 0-00-
6511409-X$A11$
8
9
10 Nom11 Ghosh, Amitav12 Besse,
Christianne
EXAMPLE CONT : GRAPH REPRESENATION
http://…isbn/000651409X
Ghosh, Amitav
Besse, Christianne
Le palais des miroirs
f:original
f:nom
f:traducteur
f:auteur f:titre
http://…isbn/2020386682
f:nom
DATA INTEGRATION ACROSS THE TWO DATASETS : SEMANTIC WEB
http://…isbn/000651409X
Ghosh, Amitav
Besse, Christianne
Le palais des miroirs
f:origina
l
f:nom
f:traducteur
f:auteur f:titre
http://…isbn/2020386682
f:nom
http://…isbn/000651409X
Ghosh, Amitavhttp://www.amitavghosh.com
The Glass Palace
2000
London
Harper Collins
a:title
a:year
a:city
a:p_name
a:name a:homepag
e
a:author
a:publish
er
DATA INTEGRATION ACROSS THE TWO DATASETS : SEMANTIC WEB
http://…isbn/000651409X
Ghosh, Amitav
Besse, Christianne
Le palais des miroirs
f:origin
al
f:nom
f:traducteur
f:auteur f:titre
http://…isbn/2020386682
f:nom
http://…isbn/000651409X
Ghosh, Amitavhttp://www.amitavghosh.com
The Glass Palace
2000
London
Harper Collins
a:title
a:year
a:city
a:p_name
a:name a:homepag
e
a:author
a:publish
erSAME URI
DATA INTEGRATION ACROSS THE TWO DATASETS :SEMANTIC WEB
a:title
Ghosh, Amitav
Besse, Christianne
Le palais des miroirs
f:original
f:nom
f:traducteur
f:auteur
f:titre
http://…isbn/2020386682
f:nom
Ghosh, Amitavhttp://www.amitavghosh.com
The Glass Palace
2000
London
Harper Collins
a:year
a:city
a:p_name
a:name a:homepag
e
a:author
a:publish
er
http://…isbn/000651409X
User of data “F” can now ask queries like:“give me the title of the original”
DATA INTEGRATION ACROSS THE TWO DATASETS :SEMANTIC WEB
Besse, Christianne
Le palais des miroirsf:origin
al
f:nom
f:traducteur
f:auteur
f:titre
http://…isbn/2020386682
f:nom
Ghosh, Amitavhttp://www.amitavghosh.com
The Glass Palace
2000
London
Harper Collins
a:titlea:year
a:city
a:p_name
a:name a:homepag
e
a:author
a:publish
er
http://…isbn/000651409X
http://…foaf/Person
r:typer:type
Even richer queries can be answered -“give me the home page of the original’s ‘auteur’”
PROBLEM STATEMENT Semantic web concept has issues related to scalability and
performance. The current storage procedures for RDF do not perform well
due to the nature and size of the data. RDF data is in the form of triples <subject,property,object>.
These triples are then stored in a relational database. This table becomes really large and when queries are executed there are a lot of self joins which lead to poor performance.
Traditional row oriented databases do not perform well with this kind of data and there is a need for a new mechanism to store data.
MAJOR CONTRIBUTIONS
The analysis of current mechanisms storing RDF data in databases.
Introduction of a new concept of vertically partitioning RDF data to improve query performance and a mechanism to use a column-oriented database with the vertical partitioning approach to improve performance and increase simplicity.
The performance evaluation of the new and existing techniques with a real world and appropriate example and a good explanation of the results are provided. The results show that the new technique outperforms the existing mechanism by a significant magnitude.
A new column oriented database SW-store is proposed which is based on storing vertically partitioned RDF data in an effective manner. The basic structure of this database is explained with some examples.
KEY CONCEPTS – PROPERTY TABLES
Property Clustered Tables A data clustering algorithm is used to find out the related
properties. The limitation with this type of property table is that a
property can occur at most only in one property table.
Property Class Tables This approach creates clusters based on subject’s type
property and one property can occur in multiple property tables.
NULLs in data. Multivalued attributes.
SAMPLE DATABASE
Source: - SW-Store: a vertically partitioned DBMS for Semantic Web data management
KEY CONCEPTS: VERTICAL PARTITIONING The authors propose a storage mechanism which involves vertical
partitioning of data and further storing this vertically partitioned data into a column oriented database.
A two column table for each unique property in the RDF data is created where the first column contains subjects and the second column contains the object values for that property. The advantages of this approach are the following: Effective handling of Multivalued attributes. Elimination of NULLs Clustering of properties The number of unions is less.
Column oriented
no wastage of bandwidth as projections on data happen before it is pulled into main memory
record header is stored in separate columns thus reducing the tuple width and letting us choose different compression techniques for each column.
KEY CONCEPTS: SW-STORE The authors propose SW-store which is a column oriented
DBMS optimized for storing RDF triples after partitioning them vertically. The Key concepts in this DBMS include:-
Storage system: - Some properties can be stored as a single column table having one to one mapping with the table containing subjects.
Query engine: - Lack of tree structure in query plan for column oriented databases can lead to problems where there can be mismatch of rate of consumption of data. This is overcome by ensuring that the graph is still rooted with a single node with no parents and the parents request the data at the same rate.
Overflow tables and Query translation Materialized joins: - Subject object joins can be removed using
materialized paths. An example is storing multiple properties as one property to prevent multiple joins for frequent queries.
ASSUMPTIONS MADE BY THE AUTHOR
Postgres is assumed to be the best available choice for a row oriented RDBMS because of effective handling of NULLs.
Authors assume that queries that do not restrict on property values are very rare for RDF applications and have not evaluated their technique for these queries.
Authors assume that there will be a moderate amount of Insert/Updates on RDF store which will keep the compression and decompression of the data to a minimum.
Authors have assumed that both attributes of two column tables in vertically partitioned RDF store are fixed length using dictionary encoding for strings but have ignored the overheads involved for this.
VALIDATION METHODOLOGY The research group uses the dataset taken from the publicly
available Barton Libraries dataset provided by the Simile Project at MIT (http://simile.mit.edu/rdf-test-data/barton).
The set of queries is based on a browsing session of Long well, a UI built by Simile group for querying the library dataset. These queries are then executed on the all the existing mechanisms and the new techniques proposed namely:- Triple data store (subject, property, object table with no
improvements on Postgres). Property tables ( on Postgres) Vertically partitioned data in a row oriented store (Postgres). Vertically partitioned data in a column oriented store (C- Store).
The Results are captured and analysed.
VALIDATION METHODOLOGY Strengths :
Real world data and effective coverage of the different query scenarios. Comparison of all the techniques which exist and the proposed technique in an
effective medium which is not biased towards any particular approach. The inclusion of special/practical scenarios and revaluating the results is not
very hard using this methodology.
Weaknesses :- Avoiding queries involving unrestricted property problem which are
particularly prevalent for vertical partitioned scenarios. The methodology uses clustering technique for property table’s, accuracy of
which is assumed very high by the authors. With the different setting i.e. use of a different underlying databases the
performance may differ.
Reasons for choosing this methodology The paper describes the storage and effective retrieval of RDF data and the
most logical way to test database performance is to see the response time of complex real world queries.
An unstructured real world data set is already available publically with an effective UI to query it. This helped the authors design the execution flow simulating a real world flow.
EXAMPLE QUERIES AND EXECUTION ANALYSIS
Q1. Counts the different types of data in the RDF store. This requires a search for the objects and counts of those objects with property Type.
Analysis:The vertically partitioned table and the column-store aggregate the object values for the Type table. Because the property table solution has the same schema as the vertically partitioned table for this query, the query plan is the same.
Q2. The user selects Type: Text from the previous panel. Longwell must then
display a list of other defined properties for resources of Type: Text. It must also calculate the frequency of these properties. For example, the Language property is defined 1,028,826 times for resources that are of Type: Text.
Analysis :On a triple-store, this query requires a selection on property=Type an
object=Text, followed by a self-join on subject to find what other properties are defined for these subjects. The final step is an aggregation over the properties of the newly joined triples table.
In the property table solution, the selection predicate Type=Text is applied, and then the counts of the non-NULL values for each of the 28 columns is written to a temporary table. The counts are then selected out of the temporary table and unioned together to produce the correct results schema.
The vertically partitioned store and column-store select the subjects for which the Type table has object value Text, and store these in a temporary table, t. They then union the results of joining each property’s table with t and count all elements of the resulting joins.
RESULTS
From the results, it is clear that column-stores present some clear advantages in storing RDF data.
A new RDF database, called SW-Store has been proposed which will be faster in executing RDF based queries
IMPROVEMENTS Authors have not paid much attention to the RDF concept from
spatial perspective – Schema design- Queries are fired on vertically partitioned tables
as well as overflow tables. Owing to the heaviness of spatial data, there should be some spatial indexing like R* TREE or GRID to make these queries faster.
Restrictive nature - Spatial queries are not restricted to only specific “properties” which is an important assumption on their part.
E.g. Landmarks Tables should be partitioned in a better way rather than just
handling one property per table!e.g. Taxonomies can be helpful grouping similar properties
together.
Thank you !