LENA: Browsing Linked Open Data Along Knowledge-Aspects

LENATR: Browsing Linked Open Data Along Knowledge-Aspects

Thomas Franz, Jorg Koch, Renata Dividino, Steffen StaabISWeb - Information Systems and Semantic Web

University of Koblenz-Landau{franz,jmkoch,dividino,staab}@uni-koblenz.de

Abstract

Browsing linked open data (LOD) is a promising, yet, oftenunsatisfactory experience today. User-support for the identi-fication of relevant information within the fast-growing cloudof LOD is limited. This paper presents LENATR, a browserfor LOD that highlights relevant information with respect todifferent knowledge aspects hidden in linked data. Its inter-pretation of faceted navigation facilitates the sense-makingand browsing of LOD, solving many of the shortcomings ex-perienced in LOD browsing today.

IntroductionOn the Web of linked open data (LOD) (Bizer, Heath, andBerners-Lee 2009), information is represented uniformly bythe resource description framework (RDF) (Antoniou andHarmelen 2008) and interlinked by semantic relations. In-stead of application-centric task support, data-centric tasksupport can be implemented upon this infrastructure, i.e. sin-gle browsing tools can support the exploration of informa-tion originally provided by a variety of different informationsystems. Several user studies have revealed that this strategyof manual search by browsing through available informa-tion is the preferred strategy of humans (Teevan et al. 2004;Blanc-Brude and Scapin 2007; Boardman and Sasse 2004;Ravasio, Schar, and Krueger 2004).

Consequently, the idea of browsing LOD has been takenup in several research works and resulted in the develop-ment of several browser applications for LOD such as theresearch prototypes Tabulator (Berners-Lee et al. 2007),LENA (Koch, Franz, and Staab 2008), and Marbles1.

Exploring interlinked data with these browsers, however,is difficult. None of them supports users in deciding whichpaths to follow during their exploration of the available data.Users are required to investigate displayed information man-ually, virtually scanning presented information from top tobottom, in order to spot links that are most relevant, mean-ingful, and interesting. While technologies like the Fresneldisplay vocabulary (Pietriga et al. 2006) have been devel-oped to encode statically for a specific information domainand a known vocabulary, which properties and links are rel-evant, a generic approach that can be applied dynamically

Copyright c© 2010, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

1http://marbles.sourceforge.net/

is missing today. However, on the web of linked open data,it cannot be assumed to know the vocabularies utilized todescribe existing data in advance. The problem of selectingrelevant links becomes even harder as users cannot easily in-vestigate the information available behind a link, e.g. inves-tigate the information type, its quantity, and its relevance.

In this paper, we propose LENATR, a linked data browserthat analyses LOD dynamically in order to provide a moreprofound browsing experience for arbitrary data available2.LENATR features facets that describe the data in the vicin-ity of a currently browsed resource. Each facet correspondsto a particular knowledge-aspect and lists relevant resourcesfor this aspect. Resources shown within facets are in thevicinity of the currently viewed resource, i.e. not necessar-ily connected by a single link-hop, so that users benefit fromthe analysis of data that is ahead of their next browsing steps.The analysis employed by LENATR is TripleRank (Franz etal. 2009), a method for ranking RDF data.

In this paper, we illustrate LOD browsing by a scenariofollowed by a discussion of its shortcomings. We thenpresent the LOD browser LENATR and show how it solvesthe shortcomings identified before. We continue with anoverview of the architecture of LENATR and the integra-tion of the TripleRank method, before we give a detaileddescription of TripleRank. We summarize the related workdone in the field of LOD browsers and finish the paper witha conclusion and outlook of ongoing and future work.

Scenario

In the web of linked data, resources are retrieved by follow-ing data level links into various other data sources. Existinglinked data browsers such as Tabulator, Lena, Marbles, Fen-fire3, razorbase4, and VisiNav5 enable users to visualize theavailable information and navigate along its in/outcominglinks.

In our scenario, the user Claudia is browsing for informa-tion about the music band The Beatles. Much informationcan be found about this band on the web of LOD, e.g. the re-

2http://west.uni-koblenz.de/Research/systeme/lena

3http://fenfire.org/4http://www.razorbase.com/5http://visinav.deri.org/

46

Figure 1: Browsing DBPedia: Which properties and which information is relevant here? The figure shows screenshots ofthe Marbles browser on the left, the Zitgist browser in the center, and the Openlink browser on the right, rendering a lot ofinformation corresponding to The Beatles resource.

source http://dbpedia.org/resource/The_Beatlesis linked to music labels that have published albums of theband, it lists people whom they have worked with, countriesthey have visited, the birthday of each of the band members,birth town and so on.

Figure 1 shows how data about the music band is pre-sented by different LOD browsers. The resource title isdisplayed at the top associated with corresponding propertynames and property values. A property value again describeseither a dereferenceable resource or a literal. The user canselect these property value resources in order to visualizethe associated data. For instance, if Claudia would like toget more information about the current members of the bandthe Beatles, she looks up the property currentMembers andfollows one of the links associated to its resource propertyvalues.

Shortcomings

Browsing data is about adopting an exploratory search ap-proach over a dataspace. Usually, the user is trying to findsome information which could not be formulated as a pre-cise query or be completely retrieved by a search engine. Bybrowsing available information, the user learns about and in-vestigates the dataspace by navigating along in/outcominglinks. Exploratory search in LOD, however, can be very

complex for users. The magnitude of information availabletoday, i.e. at the advent of LOD, makes the identification andquantification of relevant links difficult.

Furthermore, LOD browsing is realized by following datalinks corresponding to a one-step-at-a-time approach. Theuser is able to select one available resource at a time; fromthere he goes on to the next navigation step. This approach,however, limits exploring the sophisticated interrelations ofthe web of LOD.

At last, even though the linked data principles are beingadopted by an increasing number of data providers over thelast years, the portion of links across data sets is often low,i.e. the available information is poor linked or inadequate.In such cases, the user is not able to explore a qualitativenumber of links and consequently he does not learn muchmore about the information he was seeking by browsing thanhe would when using other retrieval methods.

Back to our scenario, Claudia is seeking for informationabout the band The Beatles because she would like to buyone of the band’s albums. For Claudia it is important thatthe album has many songs that she has already heard. SinceClaudia usually cannot remember a song’s name, she learnsabout the band and albums by reading their discography, byknowing its members, music style, and so on.

Due to the huge amount of information (e.g. the resource

47

The Beatles is associated with more than 1000 others re-sources) Claudia cannot identify which of the links couldhelp her to be more effective for getting the overview of theband.

Additionally, Claudia is irritated by the fact that thereexist ambiguous and redundant links in the data. For in-stance, three properties end with label in the description ofthe band. As Claudia finds out, the properties ending withlabel have very different meanings. Some are used to labelthe resource The Beatles with a string, others link to musiclabels, i.e. publishers of songs by The Beatles. Claudia hasspotted several of such similar ending properties, e.g. db-pedia.org/ontology/writer and dbpedia.org/property/writer.However, unlike for properties ending with label, the linksending with writer seem to have identical meanings as theyall seem to link to people who have written songs of TheBeatles.

Furthermore, Claudia experiences links leading to state-ments with few further outcoming links, such as http://www.w3.org/2002/07/owl#Class, or even to dead ends,thus stopping further navigation and requiring her to go backand forth several times. The following list enumerates theshortcomings experienced by Claudia:

1. The magnitude of information makes the identificationand quantification of relevant links difficult.

2. The one-step browsing approach implemented by currentbrowsers limits the exploration of the sophisticated inter-relations in the web of linked data.

3. Redundant and ambiguous links reduce the effectivenessand efficiency of browsing.

4. Dead-ends in LOD interrupt the browsing process as theyrequire users to browse backwards manually in order tofind further relevant information.

These shortcomings of browsing LOD are related to the factthat none of the current LOD browsers provide sophisticatedfunctionalities to support and guide users to explore inter-relations in the web of LOD, such as, to present a list ofrelevant resources for a resource, which are not necessarilyconnected by a single link-hop, or to present the relevantresources organized by different aspects.

LENATR

LENATR is a linked data browser that features faceted navi-gation (cf. Fig. 2) to overcome the shortcomings listed in thesection above. A facet is a named group of resources whereresources of a group have a similar – not necessarily exactlythe same – relation to the resource currently displayed byLENATR. For instance, Figure 2 shows the facets label,associatedActs, currentMembers and so on for the resourceThe Beatles.

By moving the mouse over the name of a facet, the pred-icates contributing to a facet are shown as tooltip. Figure 2shows such a tooltip for the facet associatedActs. It lists thecomplete URIs of the contributing properties and the scoreused for ranking them. In this example the top-three pred-icates are associatedActs, associatedBand, associatedMusi-calArtist, all with a score of more than 0.4. These descrip-

tions of facets enable users to investigate on the knowledge-aspect expressed by a facet.

The facets themselves are ordered according to their rel-evance with respect to the resource currently viewed inLENATR. The relevance of a facet is expressed by theweight that is displayed below the name of each facet, e.g.Figure 2 highlights the weight of the facet currentMembers.

Each facet lists several resources. The list is ordered sothat the most relevant resources are at the top of the list. Thisordering is based on scores that can also be investigated byusers. The tooltip for resources of a facet shows the com-plete URI of the resource and its score. Figure 2 shows sucha tooltip for the resource Paul McCartney that is the top re-source for the facet currentMembers.

The facets provided by LENATR build upon rankings forresources and properties in the vicinity of the resource cur-rently displayed by LENATR. These rankings are computedat runtime whenever a new resource is browsed. As rank-ing method, we integrated the TripleRank approach. Its out-puts are exploited by the user interface features presented.To highlight how these features of LENATR help to over-come the shortcomings in LOD browsing identified above,we confront LENATR with each shortcoming in the follow-ing.

Shortcoming 1: The magnitude of information makesthe identification and quantification of relevant links dif-ficult. As illustrated in Figure 1, LOD browsing can easilylead to information overload. LENATR implements facetednavigation as a counter measure to information overload.Facets assist users in browsing LOD by guiding them alongmost promising knowledge aspects, giving them a meansto explore the correlations between resources and availableLOD. They enable to identify relevant resources and proper-ties more easily than by scanning manually through all of theavailable information and thus mitigate the negative effect offeeling overloaded.

Shortcoming 2: The one-step approach limits exploringthe sophisticated interrelations of the web of data. Bycombining LOD browsing and faceted navigation, a useris not just able to browse LOD along RDF links based onthe one-step navigation approach but also to browse alongfacets. This combination enables users to go beyond explicitlinks by exploring along facets. As explained before, a facetcorresponds to a set of similar types of links and thus repre-sents an implicit link. Moreover, resources shown within thefacets offered by LENATR are not necessarily connected bya single link-hop. They are suggested based on the TripleR-ank analysis that considers resources in the vicinity of the re-source currently viewed. Accordingly, selecting a resourceof a facet in LENATR can correspond to multiple browsingsteps.

Shortcoming 3: Redundant and ambiguous links. Thescenario has revealed the existence of ambiguous and redun-dant links as a cause for dissatisfaction and irritation. Facetsprovided by LENATR summarize multiple types of links,e.g. as shown by Figure 2. They also provide a simple meansto distinguish among groups of links. Thus, they effectively

48

��

��

�� !��

��!��

��!��

��!��

Figure 2: Screenshot of LENATR. Relevances of facets, facet properties, and facet resources are marked by yellow rectangles.

hide redundancy and ambiguity of the underlying data andthus mitigate irritations caused by it.

Shortcoming 4: Dead-ends in LOD interrupt the brows-ing process. The web of LOD connects well defineddatasets to a huge cloud of interlinked data. However, manyresources have only few outgoing links and can be consid-ered as dead ends that interrupt the browsing process. Byfaceted LOD browsing as implemented by LENATR, usersare always enabled to navigate to resources that are not di-rectly connected by a single-link-hop. Consequently, thebrowsing process does not end because no links can be fol-lowed anymore.

Architecture

LENATR runs as Java servlet application accessible fromcommon web browsers. It utilizes an implementation of theTripleRank method (Franz et al. 2009) in order to supportLOD faceted browsing. In the following, we describe thework-process of LENATR and review its architecture (seeFig. 3).

LENATR implements the Model-View-Controller (MVC)pattern. The controller manages incoming requests from thebrowser, queries data from the model, and serves the viewwith data for the final rendering output. The model allowsfor requesting LOD, e.g. by means of SPARQL queries. It

Figure 3: Architecture of LENATR.

serves a triple cache of local repositories to reduce networkoverhead. The view consists of XHTML templates used togenerate the final rendering output with help of the providedcontroller information. Since LENATR relies on the Fres-nel display vocabulary for rendering requested datasets, thecontroller utilizes the JFresnel Java library in order to applyFresnel (Pietriga et al. 2006) definitions.

49

Upon incoming HTTP requests from web browsers, therequested RDF resource is rendered according to the stan-dard LENA rendering process (Koch, Franz, and Staab2008). In parallel, the TripleRank components are invokedand requested by means of an AJAX call. Thus, the user canimmediately inspect the resource, while assisting facets areprovided as soon as the TripleRank processing is complete.The three TripleRank components (crawler, pre-processor,analyser) are processed consecutively, whereas the resultingfacets are injected into the user interface by means of AJAX.

TripleRank

The TripleRank implementation is separated into three corecomponents; namely, the crawler, pre-processor and anal-ysis component (cf. Fig. 3). The crawler component per-forms a breadth-first search. It emanates from the requestedresource and recursively aggregates related LOD up to a cer-tain depth, link-, and statement limit. Technically, RDF datais requested from LOD data providers on the semantic webvia SPARQL endpoints. The output of the crawling-processis a RDF graph (e.g. as shown by Figure 4(a)), which istransformed into a tensor representation (cf. Figure 4(b)),that is input for the pre-processor.

The pre-processor component is responsible for i) reduc-ing the amount of the crawled data and for ii) increasingits quality. E.g. properties linking a majority of resourcesare discarded, since they only marginally contribute to theoverall information. Secondly, statements are strengthenedaccording to their property frequency, whereas less frequentproperties are weighted stronger, since their significance ap-pears to be more eminent (for details see (Franz et al. 2009)).

The TripleRank analysis computes a list of weightedfacets for a given RDF graph. Each facet is described bymultiple weighted RDF properties that contribute to it. Asillustrated by Figure 2, the user interface of LENATR en-ables to view the contributing properties and their rank-ing scores. Additionally, TripleRank also returns for eachfacet, a ranked list of RDF resources that are displayedby LENATR to ease the browsing experience as discussedabove. Table 4(c) illustrates the results produced by TripleR-ank for the RDF graph shown in Figure 4(a).

Tensor Model

By means of the crawler and pre-processing components, anRDF graph is created that is transformed to a tensor rep-resentation. Figure 4(b) illustrates the tensor representa-tion resulting from the transformation of the sample graphshown in Fig. 4(a). The first adjacency matrix T (:, :, 1)6

models linkage by the property loves. An entry > 0 cor-responds to the existence of a link by this property, emptyentries are considered as zeroes. The second matrix T(:, :, 2) models links by the property hates. For instance,the graph expresses that Alex hates Bob, which correspondsto T (1, 2, 2) = 1.

6Throughout this paper we use the common Matlab-notation foraddressing entries in tensors and vectors

PARAFAC Analysis

As analysis step, a decomposition of the tensor is performedto detect hidden dependencies in the data. The TripleRankimplementation applies the PARAFAC method for this pur-pose. It is regarded as a higher-order equivalent to a matrixdecomposition like the singular value decomposition (SVD).The PARAFAC tensor decomposition has the advantage ofrobustness and computational efficiency. These advantagesare due to its uniqueness property with respect to the scalingand permutation of produced component matrices (Harsh-man and Lundy 1994). The analysis step transforms the in-put tensor into a Kruskal tensor that corresponds to authorityand hub scores (Kleinberg 1999) for particular latent aspects(topics) of the analyzed data.

Formally, a tensor T ∈ Rk×l×m is decomposed by n-Rank-PARAFAC into component matrices U1 ∈ Rk×n,U2 ∈ Rl×n, U3 ∈ Rm×n and n principal factors (pf )λi in descending order. Via these, T can be written as aKruskal tensor by T≈ ∑n

k=1 λk · Uk1 ◦ Uk

2 ◦ Uk3 where λk

denotes the kth principal factor, Uki the kth column of Ui

and ◦ the outer product (Kolda and Bader 2009). Ui yieldsthe ratio of the ith dimension to the principal factors. So,similar to SVD, PARAFAC derives hidden dependencies re-lated to the principal factors and expresses the dimensionsof the tensor by relations to the principal factors. Depend-ing on the number of principal factors, the PARAFAC de-composition can be loss-free. For a third-mode-tensor T∈ Rk×l×m a weak upper bound for this rank is known:rank(T) ≤ min{kl, lm, km} (Kolda and Bader 2009).While there is no proper way for estimating the optimalnumber of principal factors for an appropriate decomposi-tion, several indicators like residue analysis and core con-sistency analysis exist for its estimation (Andersson and Bro2000).

The output of a PARAFAC decomposition can be inter-preted as authority and hub scores plus additional scores forthe relevance of link types, i.e. RDF properties. Table 4(c)shows the scores of RDF properties and authority scores forresources. At the time of this writing, hub scores are notutilized by LENATR.

Related Work

A few research works also deal with the challenge of fa-cilitating users with more effective, efficient, and satisfyingbrowsing support for linked open data. The Fresnel frame-work (Pietriga et al. 2006) consists of a vocabulary to ex-press how certain information shall be presented, what theordering for displayed information is and which informationto leave out. Interpreters are provided that select, filter, andpresent data according to such specifications. While this ap-proach requires to encode statically how information is dis-played, with LENATR we persue an approach to identify thebest links and resources dynamically.

Unlike LENATR, Parallax (Huynh and Karger 2009) is abrowser for RDF data that enables users to formulate joinconditions and type restrictions in an intuitive manner thatdoes not require users to be acquainted with a query lan-guage. The Tabulator browser (Berners-Lee et al. 2007)

50

(a) Input: RDF Graph (b) Transformation: TensorRepresentation

Score Property Score ResourceFacet 1

1.00 hates 0.71 Bob- - 0.70 Chris

Facet 21.00 loves 0.70 Don

0.001 hates 0.70 Alex- - 0.10 Elly

(c) Analysis: Results from PARAFAC

Figure 4: Illustration of the TripleRank Process

also implements support for the formulation of queries onLOD. Queries are triggered on available LOD by selectingpredicates and thereby creating a query pattern. The resultsof those queries can be visualized with respect to differentinformation aspects, e.g. a map view that indicates spatialrelations. Sig.ma (Tummarello et al. 2009) is a search en-gine for the Semantic Web that offers versatile utilizationsof search results including their selection and filtering (e.g.by data source), export to several formats, and different dis-play options. Unlike LENATR, sig.ma does not provide aranking for the data retrieved. The Marbles browser also im-plements a means to ease the exploration and sense-makingof LOD by indicating the provenance of resources. Themetaphor of marbles is employed to distinguish between dif-ferent sources of information. The Openlink Dataexplorer7

supports the filtering and ordering of information. Further-more, various domain-specific views are provided, e.g. fordata described by the FOAF vocabulary.

Conclusion and Outlook

This paper has presented a faceted navigation approach thatintegrates a novel ranking method with linked data prin-ciples. Its implementation by LENATR enables users tobrowse along hidden and diverse knowledge aspects and hasbeen addressed in several ways: First, a user interface forfaceted navigation has been presented. Second, the exploita-tion of a novel method for relevance ranking of linked opendata, namely TripleRank has been illustrated. Third, the in-tegration of this method into an architecture that facilitatesthe user interface has been described.

We are currently investigating on several optimizations ofthe components of LENATR, focussing particularly on thecomponents for crawling and caching the LOD used for theanalysis by TripleRank. We also aim for an extensive eval-uation that considers implicit feedback gathered by users ofLENATR.Acknowledgements This research was supported by theEuropean Commission under contract IST-FP6-026978, X-Media. The expressed content is the view of the authors butnot necessarily the view of the X-Media project.

7http://demo.openlinksw.com/ode/

ReferencesAndersson, C. A., and Bro, R. 2000. The n-way toolbox for mat-lab. Chemometrics and Intelligent Laboratory Systems 52(1):1 –4.Antoniou, G., and Harmelen, F. v. 2008. A Semantic Web Primer,2nd Edition (Cooperative Information Systems). The MIT Press.Berners-Lee, T.; Hollenbach, J.; Lu, K.; Presbrey, J.;Prud’hommeaux, E.; and Schraefel, M. 2007. Tabulator redux:Writing into the semantic web. Technical report.Bizer, C.; Heath, T.; and Berners-Lee, T. 2009. Linked data- the story so far. International Journal on Semantic Web andInformation Systems.Blanc-Brude, T., and Scapin, D. L. 2007. What do people recallabout their documents?: implications for desktop search tools. InIntelligent User Interfaces, 102–111.Boardman, R., and Sasse, M. A. 2004. ”stuff goes into the com-puter and doesn’t come out”: a cross-tool study of personal infor-mation management. In CHI, 583–590.Franz, T.; Schultz, A.; Sizov, S.; and Staab, S. 2009. Triplerank:Ranking semantic web data by tensor decomposition. In Interna-tional Semantic Web Conference.Harshman, R. A., and Lundy, M. E. 1994. Parafac: Parallel factoranalysis. Computational Statistics & Data Analysis 18(1):39–72.Huynh, D., and Karger, D. 2009. Parallax and companion: Set-based browsing for the data web. In WWW Conference.Kleinberg, J. M. 1999. Authoritative sources in a hyperlinkedenvironment. J. ACM 46(5):604–632.Koch, J.; Franz, T.; and Staab, S. 2008. Lena - browsing rdf datamore complex than foaf. In International Semantic Web Confer-ence, Demo Session.Kolda, T. G., and Bader, B. W. 2009. Tensor decompositions andapplications. SIAM Review 51(3).Pietriga, E.; Bizer, C.; Karger, D.; and Lee, R. 2006. Fresnel: Abrowser-independent presentation vocabulary for rdf. In Interna-tional Semantic Web Conference, 158–171.Ravasio, P.; Schar, S. G.; and Krueger, H. 2004. In pur-suit of desktop evolution: User problems and practices withmodern desktop systems. ACM Trans. Comput.-Hum. Interact.11(2):156–180.Teevan, J.; Alvarado, C.; Ackerman, M. S.; and Karger, D. R.2004. The perfect search engine is not enough: a study of orien-teering behavior in directed search. In CHI, 415–422.Tummarello, G.; Cyganiak, R.; Catasta, M.; Danielczyk, S.; andDecker, S. 2009. Sig.ma. In Semantic Web Challenge at ISWC.

51

Documents

LENA: Browsing Linked Open Data Along Knowledge-Aspects