Upload
knud-moeller
View
4.338
Download
3
Embed Size (px)
DESCRIPTION
"Although the cloud of Linked Open Data has been growing continuously for several years, little is known about the particular features of linked data usage. Motivating why it is important to understand the usage of Linked Data, we describe typical linked data usage scenarios and contrast the so derived requirement with conventional server access analysis. Then, we report on usage patterns found through an in-depth analysis of access logs of four popular LOD datasets. Eventually, based on the usage patterns we found in the analysis, we propose metrics for assessing Linked Data usage from the human and the machine perspective, taking into account different agent types and resource representations." Slides for a presentation at WebScience 2010. The paper is available for download at http://journal.webscience.org/302/.
Citation preview
13/03/2008 FAST kick-off, Madrid, 2008 Copyright 2010 Digital Enterprise Research Institute. All rights reserved.
WebScience 2010, Raleigh, NC, USA26/04/2010
Learning from Linked Open Data Usage: Patterns & Metrics
Knud Möller, Michael Hausenblas, Richard Cyganiak, Gunnar Grimnes, Siegfried Handschuh
Copyright 2010 Knud MöllerExcept where otherwise noted, this work is licensed underhttp://creativecommons.org/licenses/by-sa/3.0/
Monday 26 April 2010
What is Linked (Open) Data? (in <1 minute)
2
Conventional “Eye-ball” Web Web of Linked Data
interlinked documents interlinked items of data (URIs, RDF)
mainly people / Web browsers
mainly machine agents
Monday 26 April 2010
What is Linked (Open) Data? (in <1 minute)
3
Linked Open Data cloud (the set of interlinked, Semantic Web datasets)
February 2008
July 2009http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData
Monday 26 April 2010
Question: How is Linked Data being Used?
•plenty of research on conventional Web usage•what about usage of linked data?
Why?•how healthy is the Web of linked data?•who is using the data and how? Is it useful? Are there
trends?•providers: improve hosting•... just curiosity!
4
Monday 26 April 2010
Question: How is Linked Data being Used?
•plenty of research on conventional Web usage•what about usage of linked data?
Why?•how healthy is the Web of linked data?•who is using the data and how? Is it useful? Are there
trends?•providers: improve hosting•... just curiosity!
4
webometrics?
Monday 26 April 2010
Approach
•particular sites:– a URI for each data item ➙ a request for each data item
(resource)– content negotiation best practices– redirection (HTTP 303)
5
Monday 26 April 2010
http://data.semanticweb.org/conference/www/2009
http://data.semanticweb.org/conference/www/2009/rdf
http://data.semanticweb.org/conference/www/2009/html
plainresource URI
RDFdocument URI
HTMLdocument URI
Approach
•particular sites:– a URI for each data item ➙ a request for each data item
(resource)– content negotiation best practices– redirection (HTTP 303)
5
Monday 26 April 2010
Approach (ctd.)
•server log files– common log format (CLF), combined log format
6
80.219.211.147 - - [23/May/2009:09:52:03 +0100] "GET /sparql?query=PREFIX [..] LIMIT+200 HTTP/1.0"
200 64674 "-" "ARC Reader (http://arc.semsol.org/)"
Request IP Request Date Request String
User AgentReferrerResponce SizeResponse Code
•RDF requests vs. “semantic” requests•90.21.243.141 − − [06/Oct/2008:16:07:58 +0100] ”GET /organization/vrije−universiteit−amsterdam−the−netherlands HTTP/1.1” 303 7592 ”−” ”rdflib −2.4.0 (http://rdflib.net/; [email protected])”
•90.21.243.141 − − [06/Oct/2008:16:08:02 +0100] ”GET /organization/vrije−universiteit−amsterdam−the−netherlands/rdf HTTP/1.1” 200 45358 ”−” ”rdflib −2.4.0 (http://rdflib.net/; [email protected])”
Monday 26 April 2010
Source Data
7
80.219.211.147 - - [23/May/2009:09:52:03 +0100] "GET /sparql?query=PREFIX [..] LIMIT+200 HTTP/1.0"
200 64674 "-" "ARC Reader (http://arc.semsol.org/)"
Request IP Request Date Request String
User AgentReferrerResponce SizeResponse Code
Figure 1: The combined log format
# triples # days total # hits # plain hits # RDF hits # HTML hits SPARQL
Dog Food 79,175 597 8,427,967 1,923,945 259,031 1,647,205 879,932(14,117) (3,223) (434) (2,759) (1,471)
DBpedia 109,750,000 118 87,203,310 22,821,475 7,008,310 22,999,237 20,972,630(739,011) (193,402) (59,392) (194,909) (177,734)
DBTune 74,209,000 61 7,467,125 1,952,185 1,135,509 677,904 3,055,493(122,412) (32,003) (18,615) (11,113) (50,090)
RKBExplorer 91,501,684 29 529,938 — — — 9,327(18,274) (—) (—) (—) (322)
Table 1: Overview of four LOD datasets
queries are served. For our evaluation, we had access to logfiles in two periods: from 24/05/2009–21/06/2009 and from27/09/2009–29/10/2009, i.e., roughly two months.
3.2.4 RKBExplorerRKBExplorer6 [11] is another meta-dataset currently com-
prising 44 sub-datasets covering various topics and sourceswithin the domain of academic research, as well as a Webapplication that allows users to access and browse its contentin an integrated fashion. Both RDF and HTML documentsabout the resources in all datasets are available. Apart fromserving linked data, the site also features a module thatprovides co-reference resolution functionality [10]. For ourevaluation, we had access to log files in the period from24/05/2009–21/06/2009, i.e., roughly one month. However,since the log files were partially broken (no referrer IPs wererecorded), and because their structure was slightly modi-fied in comparison to the conventional log file format, wewere only able to make use of the dataset in some of ourexperiments.
3.3 A New Breed of AgentsSince we expect usage of linked data to be different from
conventional Web usage, we can also expect to find newkinds of agents. In this section we define what we considerto be “semantically aware” agents, which are explicitly tar-geted at the Web of linked data.
3.3.1 Detecting SemanticityBy classifying an agent as “semantic”, we imply that it is
capable of processing structured, semantic data, i.e., RDF.Whether or not an agent has this capability can only be de-termined indirectly from the log files, based on some heuris-tics. Making the assumption that any agent which explicitlyrequests semantic data from a server also knows how to pro-cess it, we will classify such agents as “semantic”. In detail,we use the following two heuristics:
• SPARQL requests: if an agent sends a request con-
6http://www.rkbexplorer.com
taining a SPARQL query, we assume that it is capa-ble of handling the query result, i.e., either a set ofbindings (in the case of a SELECT query), potentiallycontaining URIs of RDF resources, or an RDF graph(in the case of a CONSTRUCT or DESCRIBE query).
• RDF requests: if an agent directly requests RDFfrom a server, we assume that it knows how to pro-cess data in this format. Directly here means thatthe agent specified an RDF syntax such as rdf/xmlas an acceptable response in the header of its request.Merely requesting the URI of an RDF representationdoes not suffice to indicate semanticity, as this couldsimply mean that the agent followed a link to this rep-resentation.
http://data.semanticweb.org/conference/www/2009
http://data.semanticweb.org/conference/www/2009/rdf
http://data.semanticweb.org/conference/www/2009/html
plainresource URI
RDFdocument URI
HTMLdocument URI
Figure 2: Plain resource, RDF and HTML representations
Detecting SPARQL requests is straightforward, since therequested URI will contain the actual SPARQL query. How-ever, log files of Web servers do not normally record theheader for each request7, which makes it less straightfor-ward to apply the second heuristic. Nevertheless, there isan indirect way to apply it in some cases, based on the7Web servers can be configured to also log information suchas request headers. In fact, this has been done by the ad-ministrators of RKBExplorer, which makes it easy to detectsemantic agents in this site’s log files.
3
DBTune
Plain 45%
HTML 39.9%
RDF 14.9% Semantic 4.2%
DBpedia
Plain 47.7%
HTML 46.5%
RDF 5.8% Semantic 2.8%
Dog Food
Plain 41.0%
HTML 51.1%
RDF 7.8% Semantic 2.5%
Monday 26 April 2010
Agents: Ordinary Traffic
8
SW Dog Food (21/07/2008 - 20/06/2009)
0
100000
200000
300000
400000
500000
0 5 10 15 20 25 30
hits
agents
http://data.semanticweb.org, 21/07/2008 - 20/06/2009
hits
Bot (
4978
33)
Yahoo! S
lurp
(159
238
& 1
3376
6)
msn
bot (11
8928
)
Sindic
eFet
cher
(192
11)
multi
craw
ler (
1232
5)
rdfb
ot/1.0
(734
2)
ARC R
eader
(680
8)
ordinary traffic: the usual suspects
Monday 26 April 2010
Agents: How “Semantic” are they?
9
0
0.2
0.4
0.6
0.8
1at
tribu
tor/1
.13.
2tri
plr
sind
iceb
otrd
flib-
2.4.
2R
ippl
eO
L_Vi
rtuos
o_R
DF_
craw
ler
Mor
ph_C
onve
rter_
Serv
ice
Falc
onsb
otSp
eedy
Slug
_SW
_Cra
wle
rya
cybo
thc
lsre
port-
craw
ler
MJ1
2bot
PycU
RL
herit
rix/1
.14.
3Si
ndic
eFet
cher
herit
rix/p
om.v
ersi
onhe
ritrix
/2.0
.2m
ultic
raw
ler
Sind
iceB
otia
_arc
hive
rZi
tgis
t-APl
usPl
us-A
gent
rdfli
b-2.
4.1
Mp3
Bot
curl
Zend
_Http
_Clie
ntSp
eedy
_Spi
der
nxcr
awle
rm
arbl
es-
Java
rdfli
b-2.
4.0
(unk
now
n)AR
C_R
eade
rM
LBot
Moz
illaJa
karta
_Http
Clie
ntW
get
libw
ww
-per
lM
SIE
Fire
fox
Pyth
on-u
rllib
sind
ice_
onto
logy
_fet
cher
Goo
gleb
ot
sem
antic
hits
/tot
al h
its (
>100
sem
antic
hits
)
semantic traffic: new kinds of agents
Monday 26 April 2010
0
1000
2000
3000
4000
5000
6000
200
8-07
-01
200
8-09
-01
200
8-11
-01
200
9-01
-01
200
9-03
-01
200
9-05
-01
200
9-07
-01
200
9-09
-01
200
9-11
-01
201
0-01
-01
201
0-03
-01
201
0-05
-01
Dog Food Hits over Time (smoothing factor 0.05)
plainhtml
rdfsemantic
Is Demand for LOD increasing?
10
no increase for semantic requests
Monday 26 April 2010
0
50000
100000
150000
200000
250000
300000
200
9-06
-20
200
9-07
-04
200
9-07
-18
200
9-08
-01
200
9-08
-15
200
9-08
-29
200
9-09
-12
200
9-09
-26
200
9-10
-10
200
9-10
-24
200
9-11
-07
DBpedia Hits over Time (smoothing factor 0.05)
plainhtml
rdfsemantic
Is Demand for LOD increasing? (ctd.)
11
no increase for semantic requests
Monday 26 April 2010
0
100
200
300
400
500
600
700 2
008-
07-0
1
200
8-09
-01
200
8-11
-01
200
9-01
-01
200
9-03
-01
200
9-05
-01
200
9-07
-01
200
9-09
-01
200
9-11
-01
201
0-01
-01
201
0-03
-01
201
0-05
-01
Demand for Events (smoothing factor 0.05)
iswc2008www2009eswc2009iswc2009
12
Do Real-world Events have an Impact on LOD Usage?
possible impact
Monday 26 April 2010
0
1
2
3
4
5
6
7
8
9
200
9-06
-20
200
9-07
-04
200
9-07
-18
200
9-08
-01
200
9-08
-15
200
9-08
-29
200
9-09
-12
200
9-09
-26
200
9-10
-10
200
9-10
-24
200
9-11
-07
Irish Lisbon Treaty Referendum (smoothing factor 0.05)
http://dbpedia.org/resource/Republic_of_Irelandhttp://dbpedia.org/resource/European_Unionhttp://dbpedia.org/resource/Treaty_of_Lisbon
Do Real-world Events have an Impact on LOD Usage?
13
possible impact
Monday 26 April 2010
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5 2
009-
06-2
0
200
9-07
-04
200
9-07
-18
200
9-08
-01
200
9-08
-15
200
9-08
-29
200
9-09
-12
200
9-09
-26
200
9-10
-10
200
9-10
-24
200
9-11
-07
Michael Jackson Memorial Service (smoothing factor 0.05)
http://dbpedia.org/resource/Staples_Centerhttp://dbpedia.org/resource/Michael_Jackson_memorial_service
http://dbpedia.org/resource/Michael_Jackson
Do Real-world Events have an Impact on LOD Usage?
14
possible impact
Monday 26 April 2010
Conclusion (of sorts)
•Generic approach for analysing usage of LOD sites (but see below), based on server log files
•Metric for semanticity of agents•Did not notice a rising demand in LOD•However: real-world events do seem to have an effect
on LOD usage•Restrictions:
– does not work well with embedded metadata (e.g., RDFa-based sites)
– does not take into account usage through meta sites (indexes, search engines, ...)
15
Monday 26 April 2010