SW-Store: a vertically partitioned DBMS for Semantic Web data Management

SW-STORE: A VERTICALLY PARTITIONED DBMS FOR SEMANTIC WEB DATA MANAGEMENT

Surabhi Mithal Nipun Garg

INTRODUCTION TO SEMANTIC WEB : AN EXAMPLE

ISBN Author Title Publisher Year0006511409X id_xyz The Glass Palace id_qpr 2000

ID Name Homepageid_xyz Ghosh, Amitav http://www.amitavghosh.com

ID Publisher’s name Cityid_qpr Harper Collins London

Source : http://www.w3.org/People/Ivan/CorePresentations/SWTutorial/

A simplified bookstore data (dataset “A”)

http://www.w3.org/People/Ivan/CorePresentations/SWTutorial

http://www.w3.org/People/Ivan/CorePresentations/SWTutorial/

EXAMPLE CONT : GRAPH REPRESENATION

http://…isbn/000651409X

Ghosh, Amitav

http://www.amitavghosh.com

The Glass Palace

2000

London

Harper Collins

a:title

a:year

a:city

a:p_name

a:name a:homepage

a:authora:publisher

ANOTHER BOOKSTORE DATA (DATASET “F”)

A B C D

1 ID Titre Traducteur

Original

2 ISBN 2020286682

Le Palais des Miroirs

$A12$ ISBN 0-00-6511409-X

3

4

5

6 ID Auteur7 ISBN 0-00-

6511409-X$A11$

8

9

10 Nom11 Ghosh, Amitav12 Besse,

Christianne

EXAMPLE CONT : GRAPH REPRESENATION

http://…isbn/000651409X

Ghosh, Amitav

Besse, Christianne

Le palais des miroirs

f:original

f:nom

f:traducteur

f:auteur f:titre

http://…isbn/2020386682

f:nom

DATA INTEGRATION ACROSS THE TWO DATASETS : SEMANTIC WEB

http://…isbn/000651409X

Ghosh, Amitav

Besse, Christianne


f:origina

l

f:nom

f:traducteur

f:auteur f:titre

http://…isbn/2020386682

f:nom

http://…isbn/000651409X

Ghosh, Amitavhttp://www.amitavghosh.com

The Glass Palace

2000

London

Harper Collins

a:title

a:year

a:city

a:p_name

a:name a:homepag

e

a:author

a:publish

er

DATA INTEGRATION ACROSS THE TWO DATASETS : SEMANTIC WEB

http://…isbn/000651409X

Ghosh, Amitav

Besse, Christianne


f:origin

al

f:nom

f:traducteur

f:auteur f:titre

http://…isbn/2020386682

f:nom

http://…isbn/000651409X


The Glass Palace

2000

London

Harper Collins

a:title

a:year

a:city

a:p_name

a:name a:homepag

e

a:author

a:publish

erSAME URI

DATA INTEGRATION ACROSS THE TWO DATASETS :SEMANTIC WEB

a:title

Ghosh, Amitav

Besse, Christianne


f:original

f:nom

f:traducteur

f:auteur

f:titre

http://…isbn/2020386682

f:nom


The Glass Palace

2000

London

Harper Collins

a:year

a:city

a:p_name

a:name a:homepag

e

a:author

a:publish

er

http://…isbn/000651409X

User of data “F” can now ask queries like:“give me the title of the original”

DATA INTEGRATION ACROSS THE TWO DATASETS :SEMANTIC WEB

Besse, Christianne

Le palais des miroirsf:origin

al

f:nom

f:traducteur

f:auteur

f:titre

http://…isbn/2020386682

f:nom


The Glass Palace

2000

London

Harper Collins

a:titlea:year

a:city

a:p_name

a:name a:homepag

e

a:author

a:publish

er

http://…isbn/000651409X

http://…foaf/Person

r:typer:type

Even richer queries can be answered -“give me the home page of the original’s ‘auteur’”

PROBLEM STATEMENT Semantic web concept has issues related to scalability and

performance. The current storage procedures for RDF do not perform well

due to the nature and size of the data. RDF data is in the form of triples <subject,property,object>.

These triples are then stored in a relational database. This table becomes really large and when queries are executed there are a lot of self joins which lead to poor performance.

Traditional row oriented databases do not perform well with this kind of data and there is a need for a new mechanism to store data.

MAJOR CONTRIBUTIONS

The analysis of current mechanisms storing RDF data in databases.

Introduction of a new concept of vertically partitioning RDF data to improve query performance and a mechanism to use a column-oriented database with the vertical partitioning approach to improve performance and increase simplicity.

The performance evaluation of the new and existing techniques with a real world and appropriate example and a good explanation of the results are provided. The results show that the new technique outperforms the existing mechanism by a significant magnitude.

A new column oriented database SW-store is proposed which is based on storing vertically partitioned RDF data in an effective manner. The basic structure of this database is explained with some examples.

KEY CONCEPTS – PROPERTY TABLES

Property Clustered Tables A data clustering algorithm is used to find out the related

properties. The limitation with this type of property table is that a

property can occur at most only in one property table.

Property Class Tables This approach creates clusters based on subject’s type

property and one property can occur in multiple property tables.

NULLs in data. Multivalued attributes.

SAMPLE DATABASE

Source: - SW-Store: a vertically partitioned DBMS for Semantic Web data management

KEY CONCEPTS: VERTICAL PARTITIONING The authors propose a storage mechanism which involves vertical

partitioning of data and further storing this vertically partitioned data into a column oriented database.

A two column table for each unique property in the RDF data is created where the first column contains subjects and the second column contains the object values for that property. The advantages of this approach are the following: Effective handling of Multivalued attributes. Elimination of NULLs Clustering of properties The number of unions is less.

Column oriented

no wastage of bandwidth as projections on data happen before it is pulled into main memory

record header is stored in separate columns thus reducing the tuple width and letting us choose different compression techniques for each column.

KEY CONCEPTS: SW-STORE The authors propose SW-store which is a column oriented

DBMS optimized for storing RDF triples after partitioning them vertically. The Key concepts in this DBMS include:-

Storage system: - Some properties can be stored as a single column table having one to one mapping with the table containing subjects.

Query engine: - Lack of tree structure in query plan for column oriented databases can lead to problems where there can be mismatch of rate of consumption of data. This is overcome by ensuring that the graph is still rooted with a single node with no parents and the parents request the data at the same rate.

Overflow tables and Query translation Materialized joins: - Subject object joins can be removed using

materialized paths. An example is storing multiple properties as one property to prevent multiple joins for frequent queries.

ASSUMPTIONS MADE BY THE AUTHOR

Postgres is assumed to be the best available choice for a row oriented RDBMS because of effective handling of NULLs.

Authors assume that queries that do not restrict on property values are very rare for RDF applications and have not evaluated their technique for these queries.

Authors assume that there will be a moderate amount of Insert/Updates on RDF store which will keep the compression and decompression of the data to a minimum.

Authors have assumed that both attributes of two column tables in vertically partitioned RDF store are fixed length using dictionary encoding for strings but have ignored the overheads involved for this.

VALIDATION METHODOLOGY The research group uses the dataset taken from the publicly

available Barton Libraries dataset provided by the Simile Project at MIT (http://simile.mit.edu/rdf-test-data/barton).

The set of queries is based on a browsing session of Long well, a UI built by Simile group for querying the library dataset. These queries are then executed on the all the existing mechanisms and the new techniques proposed namely:- Triple data store (subject, property, object table with no

improvements on Postgres). Property tables ( on Postgres) Vertically partitioned data in a row oriented store (Postgres). Vertically partitioned data in a column oriented store (C- Store).

The Results are captured and analysed.

VALIDATION METHODOLOGY Strengths :

Real world data and effective coverage of the different query scenarios. Comparison of all the techniques which exist and the proposed technique in an

effective medium which is not biased towards any particular approach. The inclusion of special/practical scenarios and revaluating the results is not

very hard using this methodology.

Weaknesses :- Avoiding queries involving unrestricted property problem which are

particularly prevalent for vertical partitioned scenarios. The methodology uses clustering technique for property table’s, accuracy of

which is assumed very high by the authors. With the different setting i.e. use of a different underlying databases the

performance may differ.

Reasons for choosing this methodology The paper describes the storage and effective retrieval of RDF data and the

most logical way to test database performance is to see the response time of complex real world queries.

An unstructured real world data set is already available publically with an effective UI to query it. This helped the authors design the execution flow simulating a real world flow.

EXAMPLE QUERIES AND EXECUTION ANALYSIS

Q1. Counts the different types of data in the RDF store. This requires a search for the objects and counts of those objects with property Type.

Analysis:The vertically partitioned table and the column-store aggregate the object values for the Type table. Because the property table solution has the same schema as the vertically partitioned table for this query, the query plan is the same.

Q2. The user selects Type: Text from the previous panel. Longwell must then

display a list of other defined properties for resources of Type: Text. It must also calculate the frequency of these properties. For example, the Language property is defined 1,028,826 times for resources that are of Type: Text.

Analysis :On a triple-store, this query requires a selection on property=Type an

object=Text, followed by a self-join on subject to find what other properties are defined for these subjects. The final step is an aggregation over the properties of the newly joined triples table.

In the property table solution, the selection predicate Type=Text is applied, and then the counts of the non-NULL values for each of the 28 columns is written to a temporary table. The counts are then selected out of the temporary table and unioned together to produce the correct results schema.

The vertically partitioned store and column-store select the subjects for which the Type table has object value Text, and store these in a temporary table, t. They then union the results of joining each property’s table with t and count all elements of the resulting joins.

RESULTS

From the results, it is clear that column-stores present some clear advantages in storing RDF data.

A new RDF database, called SW-Store has been proposed which will be faster in executing RDF based queries

IMPROVEMENTS Authors have not paid much attention to the RDF concept from

spatial perspective – Schema design- Queries are fired on vertically partitioned tables

as well as overflow tables. Owing to the heaviness of spatial data, there should be some spatial indexing like R* TREE or GRID to make these queries faster.

Restrictive nature - Spatial queries are not restricted to only specific “properties” which is an important assumption on their part.

E.g. Landmarks Tables should be partitioned in a better way rather than just

handling one property per table!e.g. Taxonomies can be helpful grouping similar properties

together.

Thank you !

Documents

SW-Store: a vertically partitioned DBMS for Semantic Web data Management