4
LODVader: An Interface to LOD Visualization, Analytics and DiscovERy in Real-time Ciro Baron Neto Kay Müller Martin Brümmer Dimitris Kontokostas Sebastian Hellmann Leipzig University, AKSW, http://aksw.org Leipzig (Germany) (cbaron|kay.mueller|bruemmer|kontokostas|hellmann)@informatik.uni-leipzig.de ABSTRACT The Linked Open Data (LOD) cloud is in danger of becom- ing a black box. Simple questions such as ”What kind of datasets are in the LOD cloud?”, ”In what way(s) are these datasets connected?” – albeit frequently asked – are at the moment still difficult to answer due to the lack of proper tooling support. The infrequent update of the static LOD cloud diagram adds to the current dilemma, since there is neither reliable nor timely-updated information to perform an interactive search, analysis or in particular visualization in order to gain insight into the current state of Linked Open Data. In this paper, we propose a new hybrid system which combines LOD Visualisation, Analytics and Discov- ERy (LODVader) to aid in answering the above questions. LODVader is equipped with (1) a multi-layer LOD cloud vi- sualization component comprising datasets, subsets and vo- cabularies, (2) dataset analysis components that extend the state of the art with new similarity measures and efficient link extracting techniques and (3) a fast search index that is an entry point for dataset discovery. At its core, LODVader employs a timely-updated index using a complex cluster of Bloom filters as a fast search index with low memory foot- print. This BF cluster is able to efficiently perform analysis on link and dataset similarities based on stored predicate and object information, which – once inverted – can be em- ployed to discover invalid links by displaying the Dark LOD Cloud. By combining all these features, we allow for an up- to-date, multi-dimensional LOD cloud analysis, which – to the best of our knowledge – was not possible before. Keywords Linked Open Data, Linksets, Bloom filter, RDF diagram Copyright is held by the author/owner(s). WWW’16 Companion, April 11–15, 2016, Montréal, Québec, Canada. ACM 978-1-4503-4144-8/16/04. http://dx.doi.org/10.1145/2872518.2890545 . 1. INTRODUCTION The amount of datasets published on the Web – partic- ularly on the Linked Open Data (LOD) cloud – has grown significantly in the last couple of years. 1 This increase in available linked data creates the following interconnected research challenges: Exploration - find, download and index data using scalable algorithms and data structures for efficient re- trieval. Analysis - develop metrics to measure similarity, qual- ity and search rankings of datasets in order to structure the web of data. Visualisation - allow end users to discover, access and evaluate datasets. Furthermore, a crucial problem which encases our research questions is that searching for datasets in terms of used vocabularies, tuples, or available resources requires solu- tions that most part of the time, imply downloading large amounts of datasets. Due to the growing number of avail- able datasets this use-case will become more common for many linked data applications. 2. LODVADER ARCHITECTURE In order to address the problems pointed out, we created a framework that is light and extensible. The framework creates a central index for Linked Data datasets, allowing to perform basic queries, link discovery, analytical data re- trieval and visualization. The framework is called LOD- Vader, and holds the implementation of the proposed fea- tures. Figure 1 describes the framework architecture and consists of the following components: Manager, RDF Stream- ing Module, Plugins module, Index Engine, MongoDB 2 and the REST API module. The proposed approach was imple- mented in Java and uses the Apache Jena[3] library to parse and write RDF data. MongoDB is used as a back end stor- age service for storing non-RDF data such as analytics data and distribution namespaces. Bloom filters (BF)[1] contain the indexes of the datasets and are stored using GridFS 3 . Internally, the API uses Java Spring MVC 4 , which handle with the HTTP requests, and case there are some invalid request (e.g. missing parameters), the Spring MVC will re- turn an customized error message. To communicate with the 1 http://lod-cloud.net/#history 2 https://www.mongodb.org/ 3 https://docs.mongodb.org/v3.0/core/gridfs/ 4 https://spring.io/guides/gs/rest-service/ 163

LODVader: An Interface to LOD Visualization, Analytics and

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

LODVader: An Interface to LOD Visualization, Analyticsand DiscovERy in Real-time

Ciro Baron Neto Kay Müller Martin BrümmerDimitris Kontokostas Sebastian Hellmann

Leipzig University, AKSW, http://aksw.orgLeipzig (Germany)

(cbaron|kay.mueller|bruemmer|kontokostas|hellmann)@informatik.uni-leipzig.de

ABSTRACTThe Linked Open Data (LOD) cloud is in danger of becom-ing a black box. Simple questions such as ”What kind ofdatasets are in the LOD cloud?”, ”In what way(s) are thesedatasets connected?” – albeit frequently asked – are at themoment still difficult to answer due to the lack of propertooling support. The infrequent update of the static LODcloud diagram adds to the current dilemma, since there isneither reliable nor timely-updated information to performan interactive search, analysis or in particular visualizationin order to gain insight into the current state of LinkedOpen Data. In this paper, we propose a new hybrid systemwhich combines LOD Visualisation, Analytics and Discov-ERy (LODVader) to aid in answering the above questions.LODVader is equipped with (1) a multi-layer LOD cloud vi-sualization component comprising datasets, subsets and vo-cabularies, (2) dataset analysis components that extend thestate of the art with new similarity measures and efficientlink extracting techniques and (3) a fast search index that isan entry point for dataset discovery. At its core, LODVaderemploys a timely-updated index using a complex cluster ofBloom filters as a fast search index with low memory foot-print. This BF cluster is able to efficiently perform analysison link and dataset similarities based on stored predicateand object information, which – once inverted – can be em-ployed to discover invalid links by displaying the Dark LODCloud. By combining all these features, we allow for an up-to-date, multi-dimensional LOD cloud analysis, which – tothe best of our knowledge – was not possible before.

KeywordsLinked Open Data, Linksets, Bloom filter, RDF diagram

Copyright is held by the author/owner(s).WWW’16 Companion, April 11–15, 2016, Montréal, Québec, Canada.ACM 978-1-4503-4144-8/16/04.http://dx.doi.org/10.1145/2872518.2890545 .

1. INTRODUCTIONThe amount of datasets published on the Web – partic-

ularly on the Linked Open Data (LOD) cloud – has grownsignificantly in the last couple of years.1 This increase inavailable linked data creates the following interconnectedresearch challenges:

• Exploration - find, download and index data usingscalable algorithms and data structures for efficient re-trieval.

• Analysis - develop metrics to measure similarity, qual-ity and search rankings of datasets in order to structurethe web of data.

• Visualisation - allow end users to discover, access andevaluate datasets.

Furthermore, a crucial problem which encases our researchquestions is that searching for datasets in terms of usedvocabularies, tuples, or available resources requires solu-tions that most part of the time, imply downloading largeamounts of datasets. Due to the growing number of avail-able datasets this use-case will become more common formany linked data applications.

2. LODVADER ARCHITECTUREIn order to address the problems pointed out, we created

a framework that is light and extensible. The frameworkcreates a central index for Linked Data datasets, allowingto perform basic queries, link discovery, analytical data re-trieval and visualization. The framework is called LOD-Vader, and holds the implementation of the proposed fea-tures. Figure 1 describes the framework architecture andconsists of the following components: Manager, RDF Stream-ing Module, Plugins module, Index Engine, MongoDB2 andthe REST API module. The proposed approach was imple-mented in Java and uses the Apache Jena[3] library to parseand write RDF data. MongoDB is used as a back end stor-age service for storing non-RDF data such as analytics dataand distribution namespaces. Bloom filters (BF)[1] containthe indexes of the datasets and are stored using GridFS3.Internally, the API uses Java Spring MVC4, which handlewith the HTTP requests, and case there are some invalidrequest (e.g. missing parameters), the Spring MVC will re-turn an customized error message. To communicate with the

1http://lod-cloud.net/#history2https://www.mongodb.org/3https://docs.mongodb.org/v3.0/core/gridfs/4https://spring.io/guides/gs/rest-service/

163

Figure 1: LODVader Architecture

REST API, we created a front end using NodeJS5 which al-lows the visualization of the diagrams using the Data DrivenDocuments6 JavaScript library. One can try out LODVaderfollowing this link http://lodvader.aksw.org. The imple-mentation of the REST API and the front end are opensource and are available on GitHub. 7

The Manager is the central controller of the service andis responsible to serve the API calls and coordinate the pro-cessing components. Once a user submits a dataset for pro-cessing (using a description file), the Manager will fetch forURL of distributions (usually described by dcat:downloadURL

or void:dataDump) and will dispatch the call to the RDFstreaming module responsible for stream and parse the distri-butions. Distributions normally are dump files or SPARQLendpoint. The output of this phase is an array of triples< s, p, o > that is dispatched to the Index engine where theindexes are created. The Manager is also connected to a setof plugins. Plugins are modules with well defined functionsthat allow our framework to be extended for different usecases. As an example, the Analytics back end Plugin is re-sponsible for determine the top N links between two datasetsand the Search Plugin queries Bloom Filters (BF) in orderto find subjects, predicates and objects. The plugins are allconnected to the REST API, making possible an integrationwith a front end or even different frameworks.

A particular plugin that allows users to visualize the linkeddatasets is the Visualization plugin. This plugin providesJSON objects which characterize the current condition ofthe links between the distributions. Hence, using the frontend provides a new visualization of the LOD-Cloud diagram(which can be seen in Figure 2) using multiple layers ofdata, i.e. distributions and subsets are within datasets andare represented by different edges of the graph. In additionto the classical view, where vertexes represents the amountof links between datasets, the LODVader framework allowsdifferent dataset visualizations which can help the user toperform analysis operations on the imported LOD cloud.For example it is possible to filter datasets based on theirsimilarities (using Jaccard coefficient based on rdf:type,owl:Classes and general predicates), datasets and ontolo-gies. Furthermore, subsets and distributions of a dataset areshown and grouped as clusters of bubbles.

5https://nodejs.org/en/6http://d3js.org/7https://github.com/AKSW/lodvader

Figure 2: LODVader LOD cloud

Figure 3: LODVader Dark LOD cloud

The main difference between LODVader and lod-cloud8

is that instead of assuming that every distribution is in thesame pay-level domain or sub-domain, LODVader preciselycompare every object and subject for each source datasetand target dataset.

Moreover, we propose the novel concept of the Dark LODdiagram which can be seen in Figure 3. The Dark LODdiagram, visualizes links between objects of a source distri-bution which are not described as subjects in a target dis-tribution.

Besides to be able to read and write RDF data (RDF isgenerated on the fly by Apache JENA), LODVader doesn’tuse a triplestore. Instead, LODVader uses a MongoDB asa database to store and fetch relevant data. MongoDB hasshown fast and scalable enough to be used with BF as acentral index for linked datasets.

3. LODVADER USAGE (DEMO PRESENTA-TION)

The following subsections describe the usage of the fourcurrent features of the LODVader Visualisation plugin andManager. The presentation in the Demo Track will go throughall of them showing the framework performance. For thefirst three examples, we demonstrate how to use the RESTAPI (although all of them can be done using the font end),using the base URL ”http://api.lodvader.aksw.org/”, and inthe fourth example we will demonstrate the usage of theLOD diagram.

3.1 Adding a dataset to the frameworkIn order to visualise a dataset via LODVader, the user

can either create an entry at indexed repositories, such ashttp://datahub.io or http://lov.okfn.org or she should

8http://lod-cloud.net/

164

Figure 4: Add a dataset manually

provide a description file which contains metadata describ-ing the datasets to be streamed as shown in Figure 4. LOD-Vader is capable of consuming RDF descriptions of datasetsaccording to DCAT 9, VoID10 and DataID [2] fetching forURL described by dcat:downloadURL or void:dataDump.

The request parameters to the API are two: descrip-tionFileURL which should contain the URL of the descrip-tion file and format witch contains the serialization format(available options are: ttl, nt, rdfxml or jsonld) of thedescription file. The REST API returns an array of JSONobjects which contains a list of fetched distributions. EachJSON object describes the status of the streaming process,error messages and amount of triples read.

3.2 Finding data within a datasetLODVader uses Boom filters to store indexes of subject,

predicate and object. Thus, it’s possible to query whether adataset or a vocabulary contains a particular resource. Theparameters used are:

• searchSubject: Parameter used to find a particularsubject resource in a dataset

• searchProperty: Parameter used to find a particularproperty resource in a dataset

• searchObject: Parameter used to find a particularobject in a dataset. Note that literals are not allowedhere.

• searchVocabulary: Boolean value that defines whetherto filter only vocabularies or datasets. Omitting it willsearch both.

As an example, the URL11 would return an array of JSONobjects of all datasets which contains the DBpedia subjectHawaii.

3.3 Retrieving VoID:linksetWe will also show in the presentation that LODVader also

allows to retrieve RDF data in the void:Linkset format.Essentially two parameters are used: source and targetdataset. Both parameters should have a value if the userwants to compare two specific datasets, otherwise only thesource is mandatory. The described RDF includes the URIof the source dataset, target dataset, number of triples andthe provenance using the using the Prov-O ontology12. TheREST response should be a RDF represented using turtleformat as shown in Figure 5.

9http://www.w3.org/TR/vocab-dcat/10http://www.w3.org/TR/void/11http://api.lodvader.aksw.org/distribution/search?searchSubject=http://dbpedia.org/resource/Hawaii

12http://www.w3.org/TR/prov-o/

Figure 6: Parameters for LOD Cloud diagram generation

3.4 Customized LOD Cloud graphNext, we are going to show the visualization module (in-

cluded in the front end GUI) and capability of browsingand creating customized LOD Cloud. The API allows toretrieve a JSON object where links between datasets arerepresented by edges and datasets are represented by ver-texes. This structure is useful when a customized diagramshould be rendered. The parameters used are:

• dataset: The array of datasets that LODVader shouldcrawl. The returned JSON object will contain the rep-resentation of the datasets consisting of the array plusthe datasets which contains indegree and outdegreelinks.

• linkType: Describes the type of the retrieved links.Should be showLinks, showSimilarities or show-DarkLOD. Case the similarity option is chosen, twoparameters (from and to) are optional and representthe similarity range (from 0 to 1).

Along with demonstrating the usage of the parameters,we will show the creation of a graph similar to the Figure 2,and demonstrate the usage of filters (e.g. how to filter vo-cabularies from the diagram) and the behavior of the graphwhen a dataset contains multiple subsets or distributions.The last feature to be presented in the visualization mod-ule is the Dark LOD Diagram and its features. We willshow and explain how we create a cloud which contains onlydatasets with invalid links to debug the LOD Cloud. Fig-ure 6 shows the above-mentioned options in the LODVaderinterface. Note that we also index vocabularies and treata triple with rdf:type as a link between a dataset and itsschema in LODVader.

4. EVALUATIONIn order to evaluate the performance of the LODVader,

we first compared indexing data using BF with HashMapSearch (HS) and Binary Search Tree (BST) w.r.t. memoryusage for each structure to index distributions. Secondly, wecompared LODVader with OpenLink Virtuoso13 w.r.t. timeto load and index triples, time to make searches, amountof data on the storage. We are aware that LODVader isnot a triplestore, and do not store sufficient data do makea complete SPARQL query. However, making complexes

13http://virtuoso.openlinksw.com/

165

1 <http ://api.lodvader.aksw.org/distribution/compare/rdf/? retrieveDataset&source=http :// dataset_example.org >2 a void:Linkset ;3 void:objectsTarget <http :// dataset_example.org#dataset > ;4 void:subjectsTarget <http :// dbpedia.org/dataid.ttl#article -categories_en > ;5 void:triples "15" ;6 prov:wasGeneratedBy <api.lodvader.aksw.org/distribution/compare/rdf/> .

Listing 1: void:linkset example

2 4 6 8

0

200

400

600

Structure size (millions of subjects)

Mem

ory

(MB

) HS

BST

BF

Figure 7: Memory usage per indexed resource

Parameter Virtuoso LODVader PerformanceLoad time 24:02:36h 06:32:01h 3.67x fasterDisk usage 84,28Gb 4,03Gb 95.2% less pace

Table 1: Triple load time and disk usage of LODVader com-pared to Virtuoso. Virtuoso outperforms LODVader

queries is out of the scope of our approach. All experimentswere made using a Intel(R) Core(TM) i7-5600U @ 2.6GHz,16GB DDR3 and SSD drive. The results (Figure 7) showmain advantages of using BF w.r.t. the memory usage whilevarying the number of resources for each structure. Thedifference from HS and BST to BF is notable. Storing 8million resources HS, and BST use over 0.5 GB of RAMmemory. Considering that a regular dataset can easily havemore than this number of triples, the usage of HS and BSTis unfeasible. It is important to stress that Figure 7 showsthe memory usage for loading only one structure. However,usually a dataset is compared with not only one, but mul-tiple datasets, and memory efficiency is fundamental whenmultiple BF are loaded at the same time. BF fulfills itsfunction using less then 34MB of memory, performing onaverage 12 times better than HS and 10 times better thanBST.

Apart from measuring the precision of BFs, we used Open-Link Virtuoso to compare the performance in other aspects.We are aware that LODVader is not a triplestore, and donot store sufficient data do make a SPARQL query. Parsingcomplexes queries is out of the scope of our approach. More-over, considering the huge gain in the matter of performanceof making simples searches, and storage space, LODVader isappropriate to index and search large amount of RDF data.

To compare the efficiency of LODVader with OpenLinkVirtuoso, we loaded 491 datasets with a total of 1,092,182,333triples. There is a slightly difference between the numberof loaded triples on LODVader and Virtuoso. LODVaderloaded 1,092,182,333 triples and Virtuoso 1,034,808,229. Thisdifference is due the fact that LODVader stores triples re-gardless whether they are repeated or not, dealing with alltriples as unique. On the other hand, triplestores will not

store the same triple twice, considering that it’s a graphstructure. The problem is emphasized since each frameworkdeals differently with erroneous data. For example, a badRDF structure might be accepted by a particular frame-work, but completely ignored by others. Hence, when wedeal with large amount of triples, it’s difficult to have theexact number of resources in different frameworks.

Table 1 shows the results w.r.t for loading and indexingtriples and the total space used in the hard drive. Open-Link Virtuoso loaded triples in 24:02:23, while LODVader06:32:01, making the execution 3.67 times faster. The amountof data stored was 84,28Gb for OpenLink Virtuoso and 4,03Gbfor LODVader.

5. ACKNOWLEDGEMENTSThis paper’s research activities were funded by grants

from the FP7 & H2020 EU projects ALIGNED (GA-644055),LIDER (GA-610782), FREME (GA-644771), Smart DataWeb (GA-01MD15010B) and CAPES foundation (Ministryof Education of Brazil) for the given scholarship (13204/13-0).

6. REFERENCES[1] B. H. Bloom. Space/Time Trade-offs in Hash Coding

with Allowable Errors. Commun. ACM, 13(7):422–426,July 1970.

[2] M. Brummer, C. Baron, I. Ermilov, M. Freudenberg,D. Kontokostas, and S. Hellmann. DataID: TowardsSemantically Rich Metadata for Complex Datasets. InProceedings of the 10th International Conference onSemantic Systems, SEM ’14, pages 84–91. ACM, 2014.

[3] M. Grobe. RDF, Jena, SparQL and the ’SemanticWeb’. In Proceedings of the 37th Annual ACMSIGUCCS Fall Conference, SIGUCCS ’09, pages131–138, New York, NY, USA, 2009. ACM.

166