Upload
thanh-tran
View
639
Download
0
Tags:
Embed Size (px)
Citation preview
KIT – University of the State of Baden-Württemberg andNational Large-scale Research Center of the Helmholtz Association
Institute of Applied Informatics and Formal Description Methods (AIFB)
www.kit.edu
Linked Data Query Processing Strategies
Günter Ladwig, Thanh TranInternational Semantic Web Conference 2010, Shanghai
Institute of Applied Informatics and Formal Description Methods (AIFB)2 November 11th, 2010
Contents
IntroductionChallenges
Contributions
Linked Data Query Processing Strategies
Stream-based Query Processing
Corrective Source Ranking
Evaluation
Conclusion
ISWC 2010, Shanghai, China
Institute of Applied Informatics and Formal Description Methods (AIFB)3 November 11th, 2010
What is Linked Data?
Linked Data PrinciplesUse URIs to identify things
Use HTTP URIs that allow dereferencing
Dereferencing a URI provides information about the thing in a standard format (RDF)
Include links to other, related URIs
Linked Data Query ProcessingEvaluate queries directly over Linked Data
Dereference Linked Data URIs during query processing
ISWC 2010, Shanghai, China
Institute of Applied Informatics and Formal Description Methods (AIFB)4 November 11th, 2010
Challenges
Volume of Source CollectionEach URI is a potential data source
Dynamic of Source CollectionSources may change rapidly over time
Sources might only be discovered at run-time
Heterogeneity of Sources, Source Descriptions and Access Methods
Sources vary in size
Description of sources vary in completeness
Access methods: URI lookup, SPARQL endpoints, local cache, ...
ISWC 2010, Shanghai, China
Institute of Applied Informatics and Formal Description Methods (AIFB)5 November 11th, 2010
Contributions
Discussion of Linked Data Query Processing strategies
Mixed strategy, combining local indexes and run-time discovery
Stream-based Query ProcessingData can arrive at any time and in any order
Suited to deal with network latency
Corrective Source RankingDeals with different types of source descriptions
Ranking is refined at run-time
ISWC 2010, Shanghai, China
Institute of Applied Informatics and Formal Description Methods (AIFB)6 November 11th, 2010
LINKED DATA QUERY PROCESSING STRATEGIES
ISWC 2010, Shanghai, China
Institute of Applied Informatics and Formal Description Methods (AIFB)7 November 11th, 2010
Retrieve sourcesJoin data
Select and rank sources
Top-down Query Evaluation
Local index, assumed to be completeSelection and ranking of sources
No run-time discovery
Fast, only relevant sources are retrieved
Not up-to-date, index size may become very large
ISWC 2010, Shanghai, China
SELECT ?paper ?author WHERE {?paper swrc:author ?author . ?paper swc:isPartOf ?proc .?proc swc:relatedToEvent <http://sw.org/eswc/2010> .
}
Local source index
Probe
Source URI Score
http://sw.org/person/AB 0.87
... ...
Institute of Applied Informatics and Formal Description Methods (AIFB)8 November 11th, 2010 ISWC 2010, Shanghai, China
Bottom-up Query Evaluation
Sources are discovered at run-time through links
Answers can be incomplete as links might not be discoverable
Slower, as unnecessary sources are retrieved
Always up-to-date
SELECT ?paper ?author WHERE {?paper swrc:author ?author . ?paper swc:isPartOf ?proc .?proc swc:relatedToEvent <http://semweb.org/eswc/2010> . }
<http://sw.org/proc/eswc/2010> swc:relatedToEvent <http://sw.org/eswc/2010> .
...
swc:paper1 swc:isPartOf <http://sw.org/proc/eswc/2010> ....
Retrieve source
Discover new sources
Institute of Applied Informatics and Formal Description Methods (AIFB)9 November 11th, 2010
Mixed Strategy
Combination of top-down and bottom-up strategiesPartial local index of sources, not assumed to be complete
New sources are discovered at run-time
Addresses volume and dynamic of Linked Data
Corrective Source RankingDeal with heterogeneous source descriptions
Stream-based Query ProcessingDeal with unpredictable nature of Linked Data access
ISWC 2010, Shanghai, China
Institute of Applied Informatics and Formal Description Methods (AIFB)10 November 11th, 2010
STREAM-BASED QUERY PROCESSING
ISWC 2010, Shanghai, China
Institute of Applied Informatics and Formal Description Methods (AIFB)11 November 11th, 2010
Query Plan
Source Retrieval
Stream-based Query Processing
Network latencyDo not block!
Evaluation driven by incoming data
Compile-timeConstruct query plan
Probe local index for sources
Run-timeRank sources
Retrieve sources
Push data into query plan
Discover new sources
ISWC 2010, Shanghai, China
Join
Join
worksAt(?x, dbpedia:KIT) knows(?x, ?y)
name(?y, ?n)
Results
Source Retriever 1
Source Retriever 2
...
Push
Source RankerRetrievesource
Sourcediscovered
Source 1 (score: 1.0)Source 2 (score: 0.7) ...
Samples
Local source index
Institute of Applied Informatics and Formal Description Methods (AIFB)12 November 11th, 2010
Push-based Symmetric Hash Join
OperationMaintains a hash table for each input
Arriving tuples are inserted into one hash table and then the other is probed for join combinations
Push-basedTuples are pushed into operators from the leaves to the root of the query plan
Execution driven by incoming tuples instead of results
Results reported as soon as input tuples arrive
Tuples can arrive on all inputs in any order
ISWC 2010, Shanghai, China
Key T
a t1, t3
b t2
Key T
b t4, t5
c t6
Left input Right input
Pushed on left: t7(b)
InsertProbe
Push output
t7t4
t7t5
Key T
a t1, t3
b t2, t7
Institute of Applied Informatics and Formal Description Methods (AIFB)13 November 11th, 2010
CORRECTIVE SOURCE RANKING
ISWC 2010, Shanghai, China
Institute of Applied Informatics and Formal Description Methods (AIFB)14 November 11th, 2010
Corrective Source Ranking
Prefer more relevant sources
Relevancy of a source is based onCurrent query
Any available intermediate results
Overall optimization goal
Define a set of source features and derive concrete source metrics
Not all metrics are available for all sources (heterogeneity)
Refine previously computed metrics using newly discovered information
ISWC 2010, Shanghai, China
Institute of Applied Informatics and Formal Description Methods (AIFB)15 November 11th, 2010
Source Features and Metrics
Source is more relevant if it contains data that contributes to answers of the query
Triple Pattern Cardinality
Join Pattern Cardinality
Cardinalities stored in local index
Some patterns have high cardinality for all or many sources (e.g. )
These patterns do not discriminate sources
ISWC 2010, Shanghai, China
Institute of Applied Informatics and Formal Description Methods (AIFB)16 November 11th, 2010
Source Features and Metrics
Adopt TF-IDF concept to obtain weights for triple patternsImportance positively correlates with how often bindings to a pattern occur in a source (i.e. cardinality)
Importance negatively correlates with how often its bindings occur in all sources of the source collection S
Triple Frequency – Inverse Source Frequency (TF-ISF)
ISWC 2010, Shanghai, China
Institute of Applied Informatics and Formal Description Methods (AIFB)17 November 11th, 2010
Source Features and Metrics - Links
Source linked from many other sources is more relevant
Relevance is higher when these links match query predicates
Links are only discovered at run-time
ISWC 2010, Shanghai, China
Institute of Applied Informatics and Formal Description Methods (AIFB)18 November 11th, 2010
Metric Correction and Refinement
During query processing new information becomes available: intermediate join results, links
Refine and correct previously computed metrics
Important in the case of non-discriminative patterns
Instantiate triple pattern of a join with samples of intermediate results to obtain better join size estimates
Example
ISWC 2010, Shanghai, China
Intermediate results in SHJ operatorPerform triple pattern
cardinality lookupsSample
Institute of Applied Informatics and Formal Description Methods (AIFB)19 November 11th, 2010
Ranking at Run-time
Optimization goal: early result reportingIndexed sources: triple and join pattern cardinality, TF-ISF, weighted links, sampled join size estimates
Discovered sources: weighted links
Ranking has to be refined at run-time
Parameters influencing behavior and cost of ranking process
Invalid Score Threshold: ranking is performed when the number of sources with invalid scores passes a threshold
Sample Size: larger samples for join size estimation will give better estimates, are also more costly
Resampling Threshold: cache join size estimates and perform sampling only when the hash table of join operator grows past a given threshold
ISWC 2010, Shanghai, China
Institute of Applied Informatics and Formal Description Methods (AIFB)20 November 11th, 2010
EVALUATION
ISWC 2010, Shanghai, China
Institute of Applied Informatics and Formal Description Methods (AIFB)21 November 11th, 2010
Evaluation
Systems: top-down (TD), bottom-up (BU), mixed (MI)
8 queries over various datasets (DBpedia, Geonames, NYT, Freebase, ...)
To make the approaches comparable, sources were restricted to those discoverable by the BU approach
~6200 sources, containing ~500k triplesSources hosted on local proxy server with artificial delay of 2 seconds
25% of sources were randomly chosen to construct index for MI
ISWC 2010, Shanghai, China
Institute of Applied Informatics and Formal Description Methods (AIFB)22 November 11th, 2010
Results
ISWC 2010, Shanghai, China
Query 1 Query 6
BU MI TD BU MI TD
25% Results 24810.5 10300.0 11038.0 8222.5 4743.5 5545.0
50% Results 43464.5 40782.0 15787.0 10961.5 7650.5 5634.0
Total 84066.5 86895.5 44323.5 24086.0 20711.0 16469.0
Src. Selection
0.0 853.0 1444.5 0.0 1331.0 1863.5
Ranking 25.5 2404.0 411.5 23.5 292.5 335.0
#Sources 622 612 154 236 92 49
Overall early result reporting
25% results: MI 8.7s, BU 15.1s
50% results: MI 12.8s, BU 22.0s
Improvement of ~42%
Detailed results for two queries:
Institute of Applied Informatics and Formal Description Methods (AIFB)23 November 11th, 2010
Result Arrival Times
ISWC 2010, Shanghai, China
Institute of Applied Informatics and Formal Description Methods (AIFB)24 November 11th, 2010
Ranking Heuristics
ISWC 2010, Shanghai, China
Institute of Applied Informatics and Formal Description Methods (AIFB)25 November 11th, 2010
Conclusion
Mixed strategy for Linked Data Query ProcessingPartial knowledge available beforehand, incorporated with source discovery at run-time
Corrective Source RankingMetrics for source relevancy
Refinement of ranking at run-time
Stream-based Query Processing
Early results reported on average 42% faster
Future workAdapt query plan to changing properties of incoming data
Query local and remote data
ISWC 2010, Shanghai, China