49
Collaborative Project LOD2 – Creating Knowledge out of Interlinked Data Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month 36, 31/08/2013 Actual Submission Date Month 36, 31/08/2013 Work Package WP5 ‐ Linked Data Browsing, Visualization and Authoring Interfaces Task T5.1 Type Report Approval Status Approved Version 1.0 Number of Pages 49 Filename LOD2_D5_1_4_GEO_Benchmark_Evaluation.pdf Abstract: This report describes the evaluation of the LOD2 Geo Benchmark, developed to ensure that RDF storage engines provide the proper level of functionality and performance to facilitate the needs of Linked Data Browsing, Visualization and Authoring Interfaces. The information in this document reflects only the author’s views and the European Community is not liable for any use that may be made of the information contained therein. The information in this document is provided “as is” without guarantee or warranty of any kind, express or implied, including but not limited to the fitness of the information for a particular purpose. The user thereof uses the information at his/ her sole risk and liability. Project co‐funded by the European Commission within the Seventh Framework Programme (2007 – 2013) Project Number: 257943 Start Date of Project: 01/09/2010 Duration: 48 months

LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

CollaborativeProject

LOD2–CreatingKnowledgeoutofInterlinkedData

Deliverable5.1.4

LOD2GeoBenchv2.0Evaluation

DisseminationLevel Public

DueDateofDeliverable Month36,31/08/2013

ActualSubmissionDate Month36,31/08/2013

WorkPackage WP5‐ LinkedDataBrowsing,VisualizationandAuthoringInterfaces

Task T5.1

Type Report

ApprovalStatus Approved

Version 1.0

NumberofPages 49

Filename LOD2_D5_1_4_GEO_Benchmark_Evaluation.pdf

Abstract:ThisreportdescribestheevaluationoftheLOD2GeoBenchmark,developedtoensurethatRDFstorageenginesprovidetheproperleveloffunctionalityandperformancetofacilitatetheneedsofLinkedDataBrowsing,VisualizationandAuthoringInterfaces.

Theinformationinthisdocumentreflectsonlytheauthor’sviewsandtheEuropeanCommunityisnotliableforanyusethatmaybe made of the information contained therein. The information in this document is provided “as is” without guarantee orwarrantyofanykind,expressorimplied,includingbutnotlimitedtothefitnessoftheinformationforaparticularpurpose.Theuserthereofusestheinformationathis/hersoleriskandliability.

Projectco‐fundedbytheEuropeanCommissionwithintheSeventhFrameworkProgramme

(2007–2013)

ProjectNumber:257943 StartDateofProject: 01/09/2010 Duration:48months

Page 2: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page2

HistoryVersion Date Reason Revisedby

0.2 07/08/2013 InitialDraft(incomplete) PeterBoncz

0.9 25/08/2012 CompleteDraft PeterBoncz

1.0 29/08/2013 Completeversionaftercomments DucMinhPham

1.1 30/08/2013 Minoredits,correctionoftypos PeterBoncz

AuthorListOrganisation Name ContactInformation

CWI PeterBoncz [email protected]

Page 3: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page3

ExecutiveSummary

ThisreportgivesanaccountofevaluatingtheLODGeoBench‐aspreviouslydevelopedinD5.1.2‐onanumberofdifferentRDFdatabasesystems.ThisLOD2GeoBenchaimstotestthefunctionalityandperformanceofRDFstoresusedinLinkedDataBrowsing,VisualizationandAuthoringInterfaces.

Thisbenchmarkisnotintendedasapurelyscientificdeliverable,itisratherfocusedinaddressingpractical challenges in the Geo Browsing components, as developed by University Leipzig(browser.linkedgeodata.org). In particular, it highlights performance problems encountered whenlaying out linked objects on amap,whichmay have highly different zoom levels. The performancechallengeismakingsurethatperformancealwaysremainsinteractive,irrespectiveofthezoomlevelorfacetselections.

Thisreportcoincideswiththeopen‐sourcereleaseofv2.0oftheLOD2GeoBench.Theevaluationpresentedheregoesbeyond theoneat the initial specification inD5.1.2whichwas runon justonesystem(analphapre‐releaseversionofVirtuoso7).Hereweaddbenchmarkingonmultiplesystems,onlargedatasizes(scalefactor100)andusingclusterhardware,insteadofjustasinglemachine.

The overall message coming out of these experiments is that to create high‐performance(interactive) geospatial faceted browing interfaces, specific pre‐computation and indexing effort isneeded(thisisembodiedbythe“quad”implementation).Thismeansthatontheonehand,applicationdesigners need to think of their data access strategy. On the other hand, more hooks for physicaltuningareneededinRDFdatabasesystemstomakethispossible.

Tool Purpose Address

SPARQLendpoint ExecuteSPARQLqueries http://lod.openlinksw.com/sparql

WebServiceAPI RESTInterface http://lod.openlinksw.com/fct/service

FacetBrowser TextSearchandlookups http://lod.openliksw.com/fct

Page 4: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page4

AbbreviationsandAcronymsAcronym ExplanationLOD LinkedOpenDataGeoJSON GeographicJavaScriptObjectNotationGFM GeneralFeatureModel(asdefinedinISO19109)GML GeographyMarkupLanguageKML KeyholeMarkupLanguageOWL WebOntologyLanguageRCC RegionConnectionCalculusRDF ResourceDescriptionFrameworkRDFS RDFSchemaRIF RuleInterchangeFormatSPARQL SPARQLProtocolandRDFQueryLanguageWKT WellKnownText(asdefinedbySimpleFeaturesorISO19125)W3C WorldWideWebConsortium(http://www.w3.org/)XML eXtendedMarkupLanguageOGC OpenGeospatialConsortiumLGD LinkedGeodataBrowser(http://browser.linkedgeodata.org)OSM OpenStreetMapLGB LOD2GeoBench(definedinthisdocument)

Page 5: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page5

TableofContents1. Introduction................................................................................................................................................61.1 Outline..................................................................................................................................................6

2. Benchmark...................................................................................................................................................72.1 Goals......................................................................................................................................................72.2 Dataset..................................................................................................................................................72.2.2 QueryWorkload............................................................................................................................................82.2.3 BenchmarkMetrics..................................................................................................................................112.2.4 BenchmarkPrograms..............................................................................................................................12

2.3 BenchmarkImplementations....................................................................................................122.3.1 BasicImplementation..............................................................................................................................122.3.2 RTreeandRTree++Implementations..............................................................................................142.3.3 QuadImplementaton...............................................................................................................................14

3. Evaluation..................................................................................................................................................203.1 HardwarePlatform........................................................................................................................203.2 RDFDatabaseSystemsTested...................................................................................................203.3 LoadingResults...............................................................................................................................213.4 OverallBenchmarkResults.........................................................................................................223.5 DetailedQueryPerformanceResults......................................................................................253.6 FullQueryPerformanceResults...............................................................................................31

4. Conclusions................................................................................................................................................37

5. Appendix:ConfigurationDetails........................................................................................................39

Page 6: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page6

1. IntroductionGeographic informationmanagement is a generallywell‐understood task in datamanagement.

Relational database systems technologically support geographical data, sometimes by incorporatingmulti‐dimensional indexing structures like the RTree, or using simple uni‐dimensional BTrees (inconjunctionwith a space‐filling curve). In RDF datamanagement,manyRDF stores support spatialdata management, providing functions to test geospatial predicates; sometimes technologicallysupportedbydatastructuressuchastheRTree.Thesespecificsystemextensionsarebeingreplacedby general adoption of the proposed GeoSPARQL standard proposed by the Open GeospatialConsortium. As such, application development and deploymentwhere the data involves geographyshouldbesupportablewithRDFdatabasesystems.ThisactivityinLOD2takesthattothetest.

InthepastdeliverableD5.1.2,anewdatabaseandapplicationbenchmarkforfacetedgeographicquerying was introduced, called the LOD2 GeoBench (v1.0). The underlying goal for creating thisbenchmarkisfocusonimprovingtheuserexperiencefortheGeospatialBrowserdevelopedbyAKSWinthecontextoftheLOD2project(browser.linkedgeodata.org),bothbyinfluencingthedesignoftheapplicationandmymeasuringandimprovingtherawpowerforgeographicalqueryexecutioninRDFdatabasesystems.

InthisdeliverablewereportonaseriesofexperimentswhenrunningtheLOD2GeoBenchonfourdifferentsystems:OWLIM5.3,OpenlinkVirtuosoV6(opensource),OpenlinkVirtuosoV7(opensource)and Openlink Virtuoso V7 Cluster Edition. The hardware platform usedwas the SCILENS databasecomputeclusteratCWI.Thishand‐builtclusterconsistsofthreedifferentlayersofnodes,ofwhichweused the highest “bricks” layer, built out of 16 large servers (16 cores, 256GB RAM). This sameplatformwasused to create the record‐breaking runswith150billion tripleson theBSBMExploreandBusinessIntelligencebenchmarks(seedeliverableD2.1.4and1).

1.1 OutlineIn Section 2,we describe the LOD2 GeoBench benchmark in its v2.0 version; released in open

source in conjunctionwith this deliverable. The benchmark can currently be implemented by RDFdatabasessystemsinfourdifferentways(basic,rtree,rtree++andquad),whichwedescribeindetail.

InSection3,weprovideanddiscusstheresultswhenrunningthebenchmarkatscalefactors1,10and100ontheplatformsdescribedabove.Whenusingthe“quad”implementation,whichprovidesimprecise answers, RDF database systems turn out to be capable of sustaining tens of concurrentclient requests simultaneously on a single machine. Considering that real users of the GeospatialBrowserwouldusesignificantthinktimeinbetweenqueries,thismeansthatasinglemachinecouldsupporthundredsofconcurrentusers.Ifpreciseanswersarerequired,theseexperimentsshowthatRDFbasedgeographicalsupport(“rtree++”)provideshighperformanceinqueriesthataremoderatelytostronglyzoomed in;whilequerieson largegeographicalareas(zoomedout)wouldstillhave lowperformance – though it is evident that this problem cannot be eliminated inside RDF databasesystems; only application redesign can overcome it. In all, the experimental results show clearimprovementsoverthesituation18monthsago,andasdocumentedinD5.1.2.

In Section 4 we make some forward looking statements and recommendations both forapplicationdesigningeographicalfacetedbrowsing,aswellonthesideofRDFdatabasetechnology.Inshort,applicationdesignshouldthinkaheadandcreateadditional(indexing)datastructures,inorderto ensure interactive performance at all times. Such physical database design is very common inrelational database systems, but almost completelyundeveloped inRDFdatabase systems.On theirpart,RDFsystemsshouldexposemorefeaturestoenablesuchadditional(indexing)opportunities.

1http://lod2.eu/BlogPost/1584‐big‐data‐rdf‐store‐benchmarking‐experiences.html

Page 7: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page7

2. Benchmark2.1 Goals

The LOD2 GeoBench is an RDF database/application benchmark for faceted geographicalquerying. In particular, its queries use a combination of geographical selection and grouping andcounting by facets. Such faceted querying in itsmainstream use (outside RDF, e.g. using relationaltechnology) is known to be ahardproblem.Theproblembeing, that grouping and countingby thefacetrequiresa lotofcomputationaleffort if therearemany facet instancesqualifyingtheselection,yetduetotheinfiniteamountofpossibleselectionpredicatesitishardtopreparethesystemforthis.Thus,queriesinvolvingmillionsofinstancesmustreallygroupandcountmillionsoftuples(ortriples)andmakingsuchpartofaninteractivesystemthatshouldrenderaresultscreenwithin0.2secondsisachallenge.Also,facetedbrowsingserversonthewebmaybeusedbymanyclientssimultaneously.Assuch, the database system answering the queries should be capable of providing this interactiveexperiencetomanyusersatthesametime.

The goal of the LOD2 GeoBench result metric (queries per second per $) is to highlight theperformance and architecture problems faced by the Linked Geodata Browser application(browser.linkedgeodata.org),which is being developed atUniversity of Leipzig as part of the LOD2project.Specifically,itisintendedtostimulateboth(i)technicalprogressinRDFdatabasetechnology,improving both the query execution and query optimization support for geographical queries inSPARQLbackends,and(ii)tostimulatethinkingaboutapossibleredesignofRDF‐basedapplicationslike the Linked Geodata Browser. This suggestion for redesignpoints toward an opportunity toredesign physical RDF databases, where for specific access patterns and queries, the applicationarchitectandDBAcoulddecidetopre‐createcertainindexesandmaterializedviews(notethatthisisphrased in relational database terms, in practice this could take the form of additional synthetictriples).

TheLOD2GeoBenchwasdevelopedasdeliverableD5.1.2intheLOD2project,18monthsearlier.Coincidingwith this report,wehavereleasedaversionv2.0of thebenchmark,whosesoftwareanddocumentationisavailableinopensource:

http://svn.aksw.org/lod2/LOD2‐GeoBench

We therefore continue with a re‐cap of the benchmark design and description, including adescriptionofwhathaschangedinv2.0.

2.2 DatasetThedatasetusedbyLOD2GeoBenchistheRDF‐izedOpenStreetMap(OSM)datasetprovidedby

AKSWgroupUniverstyofLeipzig.Thebulkofthisdatasetconsistsof6Mpoints(RelevantNodes)and3.8Mpolygons(RelevantWays).

http://downloads.linkedgeodata.org/releases/2011‐04‐06/

10M dataset statistics  ASCIIsize #triples #points #polygonsOntology 1.2MB  8KRelevantNodes 10GB 66M 6MRelevantWays 10GB 65M 60M 3.8MDBpediaInterlinks 14MB  101KGeoNamesInterlinks 60MB  487K

We call this core dataset the SF1 dataset. It contains roughly 10M geographic objects. Theamount of triples (130M) is significantly higher, and the uncompressed size in bytes is 20GB. For

Page 8: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page8

practicalbenchmarking,weoptforsyntheticscalingofthiscoredataset.NotonlydoesthisnotdependontheavailabilityofadditionalresourcesatAKSW,oronthequestionwhethercreatinglargersubsetsofOSMinRDFmakesense,butitalsomakessurethegeographicalcharacteristicsofthedataremainequalatallbenchmarkscales.Thismakesitbetterpossibletointerpretbenchmarkresultsatdifferentscales.

Thebenchmarkthereforescales thiscoreofrealdatatoanycardinal factorx*SFbycopyingalltriples in all datasets x times, appending the string “_y” (for all y: 0<y<x) to all URIs starting withhttp://linkedgeodata.org/. Thismeanswe getmanymore facets in the Ontology and every facet isduplicated x times in the dataset, belonging to new copies of the instances. This kind of scaling ishighlysimilartotheoneproposedintheDBpediabenchmark,andmimicswhatwouldhappenifmorepropertiesofOpenStreetMapwouldgetincludedinthehttp://linkedgeodata.org/dump.

Thev1.0versionoftheLOD2GeoBenchwouldjustmaketheycopiesofthesamedatainstance,with different subject URIs, replicating the data. The geographic feature (point, polygon, polyline)would just be the same among the copies. This replication strategy backfires in systems that onlycreateRTreegeographicalsearchacceleratorstructuresontheuniquesetofliterals–Virtuosobeingsuchanexample.Thatis,becausethegeographicfeatureswerecopiedandremainedequal,theuniquesetofgeographicliteralswouldnotgrow,andhencethesizeoftheRTreewouldnotgrow.

Thev2.0versionoftheLOD2GeoBench,nowreleased,changesthescalingproceduretoshifteachreplicatedgeographical featurebya tinyrandom(lat,long)delta (encompassinga fewmeters).Thisway, all geographical features areunique, yet the setof such features still is realistic in its size andpositiondistribution.Thiswas themain reason to startwith a “real” coredataset in the firstplace,sinceitisveryhardtocreatesyntheticrandomlygeneratedgeographicaldatathat“makessense”andconformstoreal‐worlddistributions.

Since April 2011, there have been new releases of the core dataset in April and August 2013,whichcontainroughlythesamedata,butactualizedfromOpenStreetMap,splitintherawtripledatafilesbydatafacetcategory(theyusedtobetogether).However,intheLOD2GeoBenchV2.0wehavenot moved to this new core dataset. The rationale has been to keep the v1.0 and v2.0 of LOD2GeoBenchascompatibleaspossible.Having (onlyslightly)more triplesandhaving themactualizedfromOpenStreetMap isof limitedvalue forourpurposeshere. It is,however,possible that a futureversionofthisbenchmarkwillstartusingnewLinkedGeoDatadatasetreleases,ifaloneforthereasonthatthebenchmarkspecificationreliesonthedatareleasebeingonlineanddownloadable.

2.2.1.1 BulkLoadThebenchmarkstartsbycreatinganewdatabase,startingupthedatabaseserver,andloadingthe

fulldatasetintothedatabasesystem(includingpossiblyaddedtriplesinthedatapreparationstep).

ThefulldisclosureofaLOD2GeoBenchresultconsistsof:

1. theelapsedtimeuntilallbulk‐loadinghasfinished.2. thesizeinmegabytesoftheresultingdatabasefilesondisk3. allrelevantDBMSconfigurationfiles.4. scriptscontainingallcommandsusedforbulk‐loading.

2.2.2 QueryWorkloadTheLOD2GeoBenchworkloadmimicsabrowsinguser inaqueryrun.Aqueryrun,basedona

randomseed,deterministicallypicks10centerpoints,andexecutes12steps,eachstepconsistingoftwo queries: the Facet Count Query (FCQ) and an Instance Retrieval Query (IRQ) or an InstanceAggregationQueries (IAQ).Thus theworkload in total consistsof240queries.The sequenceof12stepsisasfollows:

1. displaymapatzoomlevel0atacenterpoint (FCQ1+IAQ1)

Page 9: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page9

2. zoomtolevel1atthesamecenterpoint (FCQ2+IAQ2)3. zoomtolevel2atthesamecenterpoint (FCQ3+IAQ3)4. zoomtolevel3atthesamecenterpoint (FCQ4+IAQ4)5. zoomtolevel4atthesamecenterpoint (FCQ5+IAQ5)6. pan1/8widtheastatzoomlevel4 (FCQ6+IAQ6)7. zoomtolevel5atthesamecenter (FCQ7+IRQ1)8. pan1/4heightnorthatzoomlevel5 (FCQ8+IRQ2)9. zoomtolevel6 (FCQ9+IRQ3)10. pan1/2widthwestatzoomlevel6 (FCQ10+IRQ4)11. zoomtolevel7 (FCQ11+IRQ5)12. panoneheightsouthatzoomlevel7 (FCQ12+IRQ6)

The power query workload executes a query run directly after data load. It is immediatelyfollowed by the throughput workload. In the power workload, the queries in the query run areexecutedpurely after eachother. In the throughputworkload,multiplequery runs (generatedwithdifferent parameters), run concurrently on the system. The typical concurrency levels to test are2,4,8,16.

2.2.2.1 FacetCountQuery(FCQ)TheLinkedGeodataBrowserdisplaysanoverviewwiththecountperfacetoftheobjectsinthe

visiblewindow.This is anaggregationquery that countsalloccurrences foreach facet in thequerywindow, be it a currently selected (active) facet or not. The query parameters here are the querycenterpoint(LATITUDE,LONGITUDE)andthewindowHEIGHTandWIDTHindegrees.

2.2.2.2 InstanceRetrievalQuery(IRQ)TheMapdisplayedbytheLinkedGeodataBrowsershowsmarkersforallinstancesoftheselected

facets. Torenderascreen, thebenchmarkwillalwaysselect4 facets.This isapureselectionquery(rectangulargeographicwindowandfacets),thereisnogroupingoraggregationinvolved.Inadditionto the parameters LATITUDE,LONGITUDE,HEIGHT andWIDTH, this queryhence also receives fourURIparametersFACET1,FACET2,FACET3,FACET4identifyingthefacetsofinterest.

Page 10: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page10

Figure 1: The Linked Geodata Browser mis‐handling situations with too many results: queries get

disabled(infowindows)andcertainpartofthescreenexhibitinformationoverflow.

2.2.2.3 InstanceAggregationQuery(IAQ)Atthelowerzoomlevels,whenaverylargeareafitsinthewindow,thesheeramountofresults

cancauseperformanceandusabilityproblems.Forinstance,tryimaginingtovisualizeallstreetlightsin all of Germany as markers on a map on a computer screen. This would mean that millions oflampposticonsneedtobeplacedonthescreen,whichdoesnotevenhaveenoughpixelsforthat.Theresulting drawing is bound to be judged as convoluted by average users. Further, even to arrive atsuchadrawnmapisaperformancechallenge,sincethequeryreturnsmanyresults,whichneedtobeprocessed (and, dependingon the architectureof the application,might alsoneed to be sent to theclient,e.g.awebbrowser).

Theinstanceaggregationquerydealswiththeproblemoftoomanyinstancesbysummarizingtheinstancesgeographically.ThisqueryisusedintheLOD2GeoBenchinsteadoftheInstanceQueryonthe first four zoom levels (the first six steps). For this purposes, it divides the map into 40x20conceptualsquaretiles,andjustallowsonemarkerperactivefacetinsideonetile.Itdoescounthowmany instances fall in a tile, and it displays the most relevant marker in a tile for display (in thebenchmark,wedonot really choose themost relevantmarker, but choose theonewith the largestsubjectURI–i.e.arandomone)andacountofoccurrences.

NotethattheInstanceAggregationQuerydeliverssomethingthatcouldalternativelybeshownasa“heatmap”,ratherthansummarymarkerswithanoccurrencecountinsidethem,assuggested.

2.2.2.4 QueryParametersCenterPoint. The randomly generated queries in the LOD2GeoBenchworkload use bounding

boxescenterednear(notexactlyinthecenter–arandomdistanceoff)arandomlychosenmajorcityinEurope.

Page 11: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page11

Thecitieswechosefromare{Paris,Essen,Madrid,Milan,Barcelona,Berlin,Athens,Birmingham,Rome,Düsseldorf, Cologne, Katowice,Hamburg,Naples,Warsaw, Frankfurt,Munich, Brussels, Lisbon,Vienna, Manchester, Budapest, Amsterdam, Leeds, Stuttgart, Liverpool, Stockholm, Bucharest ,Rotterdam, Copenhagen, Prague, Lyon, Zürich, Turin,Newcastle, Sheffield, Southampton,Nottingham,Marseille,Dublin}.ThesecenterpointswerechosenbecauseOSMprovidesaveryhighlevelofdetailfortheseareassuchthatevenatzoomlevel7therewillbealotofdataperselectionwindow.

WidthandHeight.ThezoomlevelZatscalefactorSFcorrespondstoalongitudewidthof9/2Zdegreesanda latitudeheightof4.5/2Zdegrees.Notethatthe lowestzoomlevel=0selects9degreeslongitude and 4.5 degrees latitude, which roughly corresponds with an area like Germany minusBavaria.Atzoomlevel7,thewindowisdownto0.07by0.03degrees,asmalldowntownarea.

Facets.Themapdrawingqueryonlyvisualizesgeographicinstancesfor4randomlychosenfacetsfromfourrestrictedsets(onefacethttp://linkedgeodata.org/ontology/FACETfromeach):

1. Place,Parking,Village(1M)2. School,PlaceOfWorship,Leisure(700K)3. Peak,Restaurant,Tourism(360K)4. Sport,PostBox,Supermarket(200K)

These facet categories were chosen by analyzing the frequency of the various facets in OSM.Concretely,theabovefacetsarechosnfromthefacetsthathavethehighestfrequencyofoccurrence.Thesewerechoseninorder(i)tomakethequerieswhenzoomedoutchallengingastheywillselectmanyinstancesand(ii)toguaranteethatatthehighestzoomlevelstillanonzeroamountofinstancesareinthewindow.

Further,fromthesetofveryfrequentfacets(whichislargerthantheabove),weselectedgroupsoffacetsthathavequitesimilarfrequenciesandputthemintheabovefourgroups.Thatis,thereareroughly1millionplaces,parkingsandvillages,and200.000sport,postboxandsupermarketfeatures.Each query in the LOD2 GeoBench workload picks one from each category, e.g. (Parking, School,Tourism,Sport).Thatway,thequeriesalwayshaveahighlyasimilarfrequencycharacteristic.Thisinturnhelpstocreatemorestableperformancerunsamongtheresultsofrunningthesamequerywithdifferent parameter bindings (this is something that e.g. BSBMdoes not do,making it very hard tounderstandhowgoodorbadasystembehavesonacertainquery–asthismayvaryenormouslyonthechosenparameter).

At scale factorx*SFwith (x>1), these facets are suffixedwitha random“_y”,withy:0<=y<x.RecallthattheLOD2GeoBenchwhenscalingthedatasettoalargersize,notonlycreatescopiesofallgeographic featureswith a different subject URI, but also uses different property URIs, i.e. suffixedwith_y.Asmentioned,thefacetsusedarerelativelyfrequentfacets;theirfrequencyinthecoredatasetisindicatedinparenthesis.Atzoomlevel0weexpectroughly70Kinstancesintotalbelongingtoanyofthefourselected;theexpectedamountdecreasesateachzoomlevel,tojustahundredatzoomlevel 7. Note that aswe are focusing on high‐density areas (European city centers), the amount ofinstancesina4xsmallersub‐window(zoom‐in)isinfactlessthan4xsmaller.

2.2.3 BenchmarkMetrics

2.2.3.1 PagePerSecondThebasicresultmetricisPagePerSec,i.e.theaveragetimetorenderapageoftheLinkedGeoData

Browser,which is thesumof the facetcountqueryandthe instance(aggregation)query;butthis isreportedintheinverse,hencePagePerSec.Fromabenchmarkrun,thatexecuteseachstep10times,we derive an overall PagePerSec score at that step by averaging the 10 results (query latency inseconds). For multi‐stream runs, we add the PagePerSec metric results for each stream to get acombinedPagePerSecresult.

Page 12: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page12

2.2.3.2 PagePerSecondPer$1000(PagePerSec/K$)To take into account the cost of the hardware used in various implementations,we divide the

PagePerSecmetric by themonetary cost of the hardware and softwareused: PagePerSec/K$. If theRDFsystemisacommercialsoftwareproduct,thepriceforsoftwaremustbethedollar(listprice,nodiscounts). The price quoted for hardware must be the publicly available end user price of thehardwareatanonlinemerchantatthedatethebenchmarkwasrun.

2.2.3.3 LowZoom,HighZoomandTotalScoreWeexpectdatabasesystemstoperformquitedifferentlyat lowzoomlevelswhencomparedto

highzoomlevels.Forthisreason,twodifferentsub‐metricsarereported,wheretheLowZoomScoreisderivedfromstep1‐step6;andtheHighZoomScorederivedfromstep6‐step12.WeusethegeometricmeanasthemethodtocombinethePagePerSecscoresfromthevarioussteps,becausethisrewardsrelative improvements at any step equally in the overall score, even if the individual scores at thevariousstepsarequitediverse.Similarly, theLOD2GeoBenchTotalScore(LGB‐TS)isthegeometricmeanoftheLowZoomScore(LGB‐LS)andHighZoomScore(LGB‐HS).

2.2.4 BenchmarkProgramsDataGenerator.Thebenchmarkcomeswithadatagenerator(geoscale.sh)thatreadsoneinput

file and produces x output files 0<=y<x with _y suffixes in the URIs. It should be used on all coredatasetfiles.ThesefilescanthenbeimportedintheRDFdatabasesystem.Generatingthecopiesofthecoredatasetfileshouldnotbeincludedindatabaseloadtime.

QueryGenerator. Thebenchmark comeswith aquery generator (geoqgen.c), that given a runnumberandascalefactor(SF)generates240textualqueries.Therunnumberis:

0forawarmuprun.Itgeneratesonesubdirectory01/withonestreamof240queries. 1for thepowerrun,whichtestshowthesystembehaveswhen ithandlesoneuserata

time.Itgeneratesonesubdirectory01/withonestreamof240queries. 2,4,8,16 for the throughput runs, which test how the system behaves when it handles

multiple user at a time. It generates multiple subdirectories 01/, .. xx/, each with onestreamof240queries.

Thequeriesgeneratedaredifferentbetweenthe10runsinsideastream,andbetweenmultiplestreams.Therefore,theselectivities(resultsetsizes)ofdifferentqueriesfromthesametemplate,alsodiffer.Thecurrentlyselectedrandomseednumbersusedtogeneratethequerieshavebeenselectedsuchthattheoverallsizeofintermediateresultsissimilar,though(within10%ofeachotherintermsofsumofresultsizesinaquerystream).

2.3 BenchmarkImplementationsThere are different ways an application can be designed (specifically, extra “indexing” triples

couldbepre‐generated),anddifferentwaysinwhichqueriestoRDFsystemcouldbeformulated.

TheLOD2GeoBenchv2.0currentlysupportsfourdifferentimplementationbasic:,rtree,rtree++andquadtiles.

2.3.1 BasicImplementationEachstepintheLOD2GeoBenchworkloadconsistsoftwoqueries.TheFacetCountQuerycounts

theamountoffacetinstancesintherectangularquerywindow.Thebasicstrategyisnottoassumeanygeographical support in theRDF backend and perform the selection on the (lat,long) values,whichleadstotehfollowingSPARQ1.1text:

select ?f as ?facet count(?s) as ?cnt

where { ?s <http://www.w3.org/2003/01/geo/wgs84_pos#lat> ?a;

Page 13: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page13

<http://www.w3.org/2003/01/geo/wgs84_pos#long> ?o;

<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?f.

filter (?a >= LATITUDE-HEIGHT/2 && ?a <= LATITUDE+HEIGHT/2 &&

?o >= LONGITUDE-WIDTH/2 && ?o <= LONGITUDE+WIDTH/2) }

group by ?f

order by desc(?cnt)

limit 50

Typically,RDFstoreswillevaluatethisquerybyusingrangescansonthePOSorOPS index forrespectivelythelatitudeandlongitudepredicate,andintersecttheresultingtriplestreamsfromtheseonsubject.Thismeansthat if (say) theselectivityof thequery is1/10of the full latituderangeand1/10ofthefulllongituderange,and(say)hence1/100ofthetotaldatabase,theintermediateresultbefore the intersection is in the range of 1/10of the dataset.Hence, it it is 10x larger than strictlynecessary.Still,thisapproachissimpleandportable(itwillworkonanySPARQL1.1backend).

ThesecondqueryineachstepistheInstanceRetrievalQuery,ortheInstanceAggregationQuery.We start wth the Instance Retrieval Query. This query retrieves all the facet instances inside (oroverlappingwith)thequerywindow,forfourchosenfacets.

The Map displayed by the Linked Geodata Browser shows markers for all instances of theselected facets. To render a screen, the benchmark will always select 4 facets so there are fourdifferentFACETparameters,FACET1,FACET2,FACET3,FACET4:

sparql select ?s as ?instance ?f as ?facet ?a as ?lat ?o as ?lon

where

{ #where-start

{ #union-start

?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?f

filter (?f = <http://linkedgeodata.org/ontology/Village>)

} #union-end

union

{ #union-start

?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?f

filter (?f = <http://linkedgeodata.org/ontology/Leisure>)

} #union-end

union

{ #union-start

?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>

?f filter (?f = <http://linkedgeodata.org/ontology/Tourism>)

} #union-end

union

{ #union-start

?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>

?f filter (?f = <http://linkedgeodata.org/ontology/Supermarket>)

} #union-end

.

?s <http://www.w3.org/2003/01/geo/wgs84_pos#lat> ?a ;

<http://www.w3.org/2003/01/geo/wgs84_pos#long> ?o .

filter (?a >= LATITUDE-HEIGHT/2 && ?a <= LATITUDE+HEIGHT/2 &&

?o >= LONGITUDE-WIDTH/2 && ?o <= LONGITUDE+WIDTH/2)

Arguably, the selection on any of the four facets could also be done in a filter – however it isbelieved that thecurrent syntaxand theonewithdisjunctiveexpressionswouldusually lead to thesame physical query plan anyway. It should be noted that if desired, such an alternative, yetequivalent,querysyntaxwouldbepermissibleinaLOD2GeoBenchresult.

Page 14: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page14

2.3.2 RTreeandRTree++ImplementationsIf an RDF database system supports efficient evaluation of geographical predicates (e.g. by

creating an RTree index in advance), such is very relevant for the LOD2 GeoBench. We allowreasonable query variants, for instance if the RDF database system being tested has specificgeographicsupport,thiscanbeused.

For instance, Virtuoso v6 provides RTree based indexing allowing to test spatial intersectionwithin a radius. It is possible to drawa circle around the querywindowanduse the radius of thiscircule and the center point of the window in this syntax. This was the first RTree syntax variantimplementedbyLOD2GeoBench(inv1.0)andthereforecarriesthename“rtree”:

select ?f as ?facet count(?s) as ?cnt

where { ?s <http://www.w3.org/2003/01/geo/wgs84_pos#lat> ?a;

<http://www.w3.org/2003/01/geo/wgs84_pos#long> ?o;

<http://www.w3.org/2003/01/geo/wgs84_pos#geometry> ?p;

<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?f.

filter (bif:st_intersects(bif:st_geomfromtext('POINT(LONGITUDE LATITUDE)',2000), ?p, RADIUS) &&

?a >= LATITUDE-HEIGHT/2 && ?a <= LATITUDE+HEIGHT/2 &&

?o >= LONGITUDE-WIDTH/2 && ?o <= LONGITUDE+WIDTH/2) }

group by ?f

order by desc(?cnt)

limit 50

WenowalsoallowqueryvariantsthatexploitthegeographiccapabilitiesofotherRDFdatabasesystems.Forinstance,bothOWLIM5.3andVirtuosov7supportsthepredicate:

bif:st_intersects(bif:st_geomfromtext("BOX(lat1 lon1, lat2 lon2)"), ?p)

This allows direct translation of the LOD2 GeoBench window queries into a geographicalpredicate.NotethatthepreviousqueryforVirtuosov6wouldcombineaquerywitharadius(circlequery)witha subsequent (lat lon) filter. InLOD2GeoBench, thisdirectBOXcomparison, supportedfromv2.0on,isdenoted“rtree++”.

WeomitdetaileddescriptionsoftheInstanceQueryandInstanceAggregationqueriesforthetreeandrtree++variants,asthesearenaturalvariantsofthebasicqueries,withasonlydifferenceusingtheappropriategeographicalfilters,mentionedabove.

2.3.3 QuadImplementatonAnapplicationliketheLinkedGeodataBrowserwithstronginteractivitydemandschallengesnot

only RDF database technology, but also the application design itself. Taking the analogy to GoogleMaps,onecanbeassuredthatratherthanqueryingfromasingledatacollectionforallzoomsettings,theresultscreensarerenderedfroma(pre‐generated)separatedatasetforeachdifferentzoomlevel.EventhoughGoogleMapslikelydoesnotrelyonrelationaldatabasetechnology,thisapproachwouldbelikehavingdifferenttablesstorethegeographicaldataofthevariouszoomlevels.Theadvantageisthat these tables canbe designed such thatwhen the zoomwindow is very large (low zoom level),irrelevant data that would be too big to show would be pruned, or frequency counts could besummarized (e.g. keep the amount of lampposts in Germany for each zipcode, rather than allindividuallampposts).Thisway,theselowerzoomlevelshavetooperateonmuchlessdata,allowingtheapplicationtoexhibitinteractiveperformancealways.

The quad approach, described here, formally is not a valid implementation of the LOD2GeoBench,as itwillprovideslightly incorrectqueryanswers,buthas thepotential toachievemuchbetterperformance,withonlyminorqualityreductioninthequeryanswersprovided.ItsperformancecanbemeasuredwiththeLOD2GeoBench.

Themainideaistocreateadditionalindexingtriplesthat(i)accelerategeospatialdataaccessat multiple zoom resolutions, even on systems that do not provide specific geospatial support (ii)

Page 15: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page15

precomputes certain subquery results in order to accelerate query results, for all three types ofqueries(facetcount,instance, andinstanceaggregation).

QuadTiles. The geospatial acceleration comes from partitioning the 2D space according toQuadTiles,which isaZ‐orderingof the (LONGITUDE,LATITUDE)space into32‐bitsnumbers,whereLONGITUDEandLATITUDEgetdiscretizedfromtheirnormaldoubleprecisionranges[‐180,180]resp.[‐90,90] to the short integer [0,65536]. The below pictures from theOpenStreetMapwiki illustratethis:

Hence,asinglenumberidentifiesarectangleinthetwo‐dimensionalspace.Infact,wecancreatesuchnumbersatanyevenbitgranularity:theleftmostimageshowsthefourrectanglesidentifiedby2‐bitsquadtilenumbers:A=00,B=01,C=10,D=11.

QuadTileannotationscanbeexploitedbyaddingextraRDFtriplesthatannotateasubjectthathas a geography with those rectangles it overlaps with (one QuadTile triple for each). Each suchannotation for a geographical subject would add one triple with a property, e.g.http://linkedgeodata.org/intersects/quadtile and a value which would be the integer QuadTilenumber.Itistoberemarkedthatthisworksfineforpoints,butlargepolygonsmightgetneedmanytriplesiftheirsurfaceislarge.IntheOSMcoredataset,thisdoesnotseemtobeanissue,though.

FacetTiles&TileFacets.Furtherelaboratingonthisidea,wecanalsouse64‐bitsintegers,withthelower32bitsbeingthepreviouslydescribedQuadTilenumberwherethehigher32‐bitsareusedtostoreafacetidentifier.Inourcurrentimplementation,weuse52‐bitsintegersconsistingofa20‐bitfacetnumberanda32‐bitsQuadTilenumber.Thisfacetidentifierisa20‐bitsnumberidentifyingthefacetURIasdenotedbythehttp://www.w3.org/1999/02/22‐rdf‐syntax‐ns#typeproperty.Infact,werestrictourselvestothe1024mostfrequentfacets(morethan10instancesworldwide),forwhich10‐bitsareneeded.Thehigher10‐bitsofthe20‐bitsfacetnumberareusedfordatasetscaling.

Sowehave52‐bitsapproachwitha32‐bitsQuadTilenumberintheminorbitsanda20‐bitsfacetintegerinthemajorbits.WebaptizethesecombinationsofQuadTileandfacetnumbers“FacetTiles”.In caseof an equi‐selectionof FACET such as found in the InstanceQueries, thenumber rangewillhavethemajorbits(Facetpart)thesameintheMinandMaxvaluesofallrangesandonlyvaryinthelowerbits(QuadTiles).Hence,FacetTilessharewithQuadtilesalltheirnicegeospatiallocalityaspects,insuchsituations.

TheFacetCountQuery,however,mustcountthenumberofallfacetsthatfallintoit.ThiswouldleadtonorangerestrictionatallusingFacetTilenumbers.Therefore, forsuchquerieswewouldbemoreinterestedinhavingthefacetnumbersbeingtheminorbitsandtheQuadTilenumbersbeingthemajorbits.Letuscallsuchanumberingscheme"TileFacets".

WecanaddFacetTileannotationstoallgeospatialobjects(foreachRDFsubtypetheyhave).Suchtriplesconsistsofthesubject,ahttp://linkedgeodata.org/intersects/facettileproperty,andthe64‐bitsintegerliteralvalue.ThesetriplescanbecanleveragedtheInstanceQueryoftheLOD2GeoBench.ThisqueryasksforallinstancesoffourselectedFACETsthatfallinacertainquerywindow.

Page 16: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page16

Itisrelativelyeasytomapageospatialquerywindowintoa(seriesof)rangerestrictionsontheQuadTilenumbers.Thisusuallygivesalimitednumberofconjunctiveranges,butstillitisoftenagoodideatousetheSPARQL1.1subqueryfeatureandencloseinthisSPARQLqueryasubquerythatsimplyhasasinglerangeconsistingoftheMINandMAXvalueofthemultiplerangesweareafter.ThisideatopresentabasicquerywithonlyoneselectionrangeisaworkaroundforweaknessesinSPARQLqueryoptimizers, that would otherwise not recognize the opportunity to use the POS index on thehttp://linkedgeodata.org/intersects/facettileproperty.Similarly,giventhatwequeryforfourFACETs,itmayworkbesttousetheabovequerymodeltoretrievealldataforonefacet,andwriteaquerythatunion‐sfoursuchsub‐queries.

Notethatinprinciple,giventhattheInstanceQueryisusedonthehighzoomlevelsonly,whereresultsetsarenotverylarge,thiswillleadtofourlocalindexlookupsinthePOSindex.ThismayworkbetterthananormalRTreewoulddo,becauseintheRTreeonewouldhaveallinstancesofallfacets,notonlythefourfacetsofinterest.ThismeansthatanRTreeselectionqueryintheleafnodesitvisitswill only find a low percentage of the data to be relevant for the query. Onewould need a kind ofpartitioned RTree (partitioned on facet) to get the same kind of locality as FacetTile. An exampleInstance Retrieval Query is below, shortened by having it only query two facets(http://linkedgeodata.org/ontology/Village, http://linkedgeodata.org/ontology/Supermarket) ratherthanfour:

select ?s as ?instance ?f as ?facet ?a as ?lat ?o as ?lon

where

{ #where-start

{ #union-start

{ #subquery-start

select ?s <http://linkedgeodata.org/ontology/Village> as ?f

where

{ #where-start

filter((?g >= 2554700922112 && ?g <= 2554700922367)

|| (?g >= 2554700922624 && ?g <= 2554700922879)

|| (?g >= 2554700944640 && ?g <= 2554700944895)

|| (?g >= 2554700945152 && ?g <= 2554700945407)

|| (?g >= 2554700965888 && ?g <= 2554700967167)

|| (?g >= 2554700967424 && ?g <= 2554700967679)

|| (?g >= 2554700988416 && ?g <= 2554700989695)

|| (?g >= 2554700989952 && ?g <= 2554700990207)) .

{ #subquery-start

select ?s ?g

where

{ #where-start

?s <http://linkedgeodata.org/intersects/facettile> ?g .

filter (?g >= 2554700922112 && ?g <= 2554700990207)

} #where-end

} #subquery-end

} #where-end

} #subquery-end

} #union-end

union

{ #union-start

{ #subquery-start

select ?s <http://linkedgeodata.org/ontology/Supermarket> as ?f

where

{ #where-start

filter((?g >= 2808103992576 && ?g <= 2808103992831)

|| (?g >= 2808103993088 && ?g <= 2808103993343)

|| (?g >= 2808104015104 && ?g <= 2808104015359)

Page 17: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page17

|| (?g >= 2808104015616 && ?g <= 2808104015871)

|| (?g >= 2808104036352 && ?g <= 2808104037631)

|| (?g >= 2808104037888 && ?g <= 2808104038143)

|| (?g >= 2808104058880 && ?g <= 2808104060159)

|| (?g >= 2808104060416 && ?g <= 2808104060671)) .

{ #subquery-start

select ?s ?g

where

{ #where-start

?s <http://linkedgeodata.org/intersects/facettile> ?g .

filter (?g >= 2808103992576 && ?g <= 2808104060671)

} #where-end

} #subquery-end

} #where-end

} #subquery-end

} #union-end

.

?s <http://www.w3.org/2003/01/geo/wgs84_pos#lat> ?a ;

<http://www.w3.org/2003/01/geo/wgs84_pos#long> ?o .

filter(?a >= 45.6938 && ?a <= 45.8344 && ?o >= 4.77089 && ?o <= 5.05214)

} #where-end

The Facet Count Query, as said, does not have locality on Facet, so it can better exploit theTileFacetnumberingthanaFacetTilenumbering.WecanthushenceaddalsoTileFacetannotationstoall instances they intersect with. This speeds up the query, certainly on systems without built‐ingeospatial support (RTrees) as the geographical predicate can now be translated into a rangerestriction thatwill workwell on a POS index. Furthermore, we could pre‐aggregate the retrievedtuplesonthefacetnumber(lowerbits)beforeevenjoiningthemtoothertriples.

However,especiallyatthelowerzoomlevels,whereareasthesizeofGermanyfallinthevisiblewindow, suchquerieswill have to aggregate hundreds of thousands of triples, even at the smallestSF=1;andlinearlymoreathigherscalefactors.Aggregatingthismuchdata,evenifdeliveredfastbyaPOS index is still heavy CPUwork that can take various seconds at least andwhichwillmake thisquerynon‐interactiveathigherscalefactors.

Thereforewe donot add TileFacet annotations to instances, but usepre‐computation for the FacetCountQuery. Wedothisatvariousresolutions intherangeof12‐26bits,becausethe lowestzoomlevelselects1/402ofthedata(roughly26socorrespondingto6bitsforbothdimensions,so12bits),wherethedeepestzoomlevelis7stepsdeeper,soat26bits.Hence,weproposeTileFacetcountpre‐computationat7granularities:12,14,16,18,20,22and24bits.

Itisnowamatterofdeterminingaproperbitgranularityforevaluatingaquery,dependingonthezoom level. A good heuristic is to use the lowest granularity level atwhich at least one tile is fullyenclosedby thequerywindow(and ifnosuch levelexist,use thehighestbitgranularity);and thentranslatethewindowselectionpredicateinaseriesofrangepredicatesonTileFacets,likebefore.

The extra triples we keep hold the pre‐computed counts at the various resolutions for anyrectangleforeachTileFacetatthatresolution(e.g.16bits).Forallfacetsinstances,wegeneratetwotripleswithasubjectintheformofhttp://linkedgeodata.org/facetcount/0000XXXXXXandas:

property http://linkedgeodata.org/facetcount/tilefacet16, with as value its TileFacet number,with theQuadTile number part truncated to 16 bits in this case. This represents thus a certainrectableinthe2Dspace.

propertyhttp://linkedgeodata.org/facetcount/count,andasvaluethenumberofoccurencesofafacet. Note thatwe only need to generate http://linkedgeodata.org/facetcount triples for facetsthat have a non‐zero count in a certain rectangle. As such, the amount of these pre‐computedtriplesisalwayssignificantlylowerthantheamountofTileFacetannotationsweaddedbefore.

Page 18: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page18

Property http://linkedgeodata.org/facetcount/facet stores the facet URI (i.e. dhttp://www.w3.org/1999/02/22‐rdf‐syntax‐ns#type value). It could be derived from thetilefacet16number,buthavingthisasatriplesimplifiesapplicationdevelopment.

The Facet Count Query can now be formulated by selecting all tiles at some granularity thatoverlapwiththequerywindow,andsummingupthesepre‐computedcounts.Hereisanexample:

select ?f as ?facet xsd:integer(sum(?c * 0.512)) as ?cnt

where

{ #where-start

?s <http://linkedgeodata.org/facetcount/count> ?c ;

<http://linkedgeodata.org/facetcount/facet> ?f .

filter((?g >= 3471158208888832 && ?g <= 3472257719468032)

|| (?g >= 3473357232144384 && ?g <= 3474456742723584)

|| (?g >= 3659174697238528 && ?g <= 3660549085724672)

|| (?g >= 3660823964680192 && ?g <= 3661098841538560)

|| (?g >= 3661373720494080 && ?g <= 3662748108980224)

|| (?g >= 3663022987935744 && ?g <= 3663297864794112)) .

{ #subquery-start

select ?s ?g

where

{ #where-start

?s <http://linkedgeodata.org/facetcount/tilefacet14> ?g .

filter (?g >= 3471158208888832 && ?g <= 3663297864794112)

} #where-end

} #subquery-end

} #where-end

group by ?f

order by desc(?cnt)

limit 50

Thedownsideofthisapproachisthatthefacetcountsprovidedwillbeanoverestimationofthereal facet counts, since the tiles from which the precomputed counts originate may (will) extendbeyond the visible window. However, users may tolerate such inaccuracies; but especially for thelower counts, itmight be annoying. One could envision a system that,when a userswants the realcountforanon‐frequentfacet,wecouldcomputetheexactvalue(withaseparatequeryexploitingtheFacetTileannotations,asintheprevioussection).

Thecurrentquerygeneratortriestocorrectforoverestimatingbynormalizingtheprecomputedresult to thesizeof thequerybox,bydividingwith theboxused foranswering thequery(which isequalorlarger).Intheaboveexample,thisleadstothe0,512constantinthefirstline,asonlyslightlyoverhalfofthere‐usedprecomputedresultsisinsidethequerybox.

TheproblemoflargequerywindowsatlowzoomlevelsalsooccursintheInstanceAggregationQuery.Recall thatthisquerytacklestheinformationoverloadproblemofwaytoomanymarkersbycombiningmarkers that are near to each other into a singlemarker, and visualizes a count of howmanyinstancesfallunderit.Similartopre‐computingcountspertile,weobservethatthisaggregationper facet per tile can also be pre‐computed. Note that herewe again needmarkers for only a fewfacets,sousingtheFacetTilenumbershereworksbest.SincetheInstanceAggregationQueryisonlyusedatthelowerzoomlevels,wecanjustindexthisatgranularities12,14,16and18bits.Thus,foreachtileatallgranularities(e.g.16bits)inwhichafacetoccursatleastonce,wegenerateanartificialnewsubjecthttp://linkedgeodata.org/facetmap/0000YYYYYYYYinthreetripleswithas:

property http://linkedgeodata.org/facetmap/facettile16 and as value its FacetTile number(identifyingarectangleinwhichtheclusteredmarkerlies).

properties holding the position http://linkedgeodata.org/facetmap/latitude andhttp://linkedgeodata.org/facetmap/longitudeofthemarker.

Page 19: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page19

property http://linkedgeodata.org/facetmap/count, and as value the number of occurences of afacet into that 16x8 cell inside the tile. Againwe only add such pre‐computed triples if a facetoccursinacertaincell,sotheamountofgeneratedhttp://linkedgeodata.org/facetmap/triplesissignificantlylowerthantheamountofTileFacetannotationsweaddedbefore.

AnimplementationoftheInstanceAggregationQueryexploitingthesepre‐computedtriples,firstchoosesanappropriatebitgranularity for thezoom level.Thenallabove tiles thatoverlapwith thequerywindowarefetched;nextthereal(latitude,longitude)valuesfromtheexamplemarkersinthemarefetchedandfilteredagainwiththequerywindow.Thismapisthenpresented.Becauseweusepre‐aggregated data, just like in case of the Facet Count Query, the problem was having to aggregatehundredsofthousandsofinstances;andthispre‐computationisguaranteedtoavoidthis;asanytilemaximallycontains128points;andweaccessonlyfewtiles.

Anexample InstanceAggregationQuery isbelow, shortenedbyhaving itonlyquery two facets(http://linkedgeodata.org/ontology/Village, http://linkedgeodata.org/ontology/Supermarket) ratherthanfour:

select ?f as ?facet ?latlon ?cnt

where

{ #where-start

{ #subquery-start

select ?f ?x ?y max(concat(xsd:string(?a)," ",xsd:string(?o))) as ?latlon count(*) as ?cnt

where

{ #where-start

{ #subquery-start

select ?f ?a ?o xsd:integer(20*(?a - 43.5141)/4.5) as ?y

xsd:integer(40*(?o - 0.3412)/9) as ?x

where

{ #where-start

{ #union-start

?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?f .

filter (?f = <http://linkedgeodata.org/ontology/Village>)

} #union-end

union

{ #union-start

?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>

?f . filter (?f = <http://linkedgeodata.org/ontology/Supermarket>)

} #union-end

.

?s <http://www.w3.org/2003/01/geo/wgs84_pos#lat> ?a ;

<http://www.w3.org/2003/01/geo/wgs84_pos#long> ?o .

filter(?a >= 43.5141 && ?a <= 48.0141 && ?o >= 0.3412 && ?o <= 9.3412)

} #where-end

} #subquery-end

} #where-end

group by ?f ?x ?y

order by ?f ?x ?y

} #subquery-end

} #where-end

SincethequerywindowwillnotperfectlyalignwithQuadTilesboundariesattheresolutionused,andforthemarkercombinationinthepre‐computedtilesweuselesscells(16x8;becauseaquerywillbeanswered frommultiplecells), theclustercombinationwillgivedifferentresults thantheofficialLOD2 GeoBench Instance Aggregation Query, even if we later re‐aggregatemarkers on the desired40x20grid.Fortheuserexperience,theeffectofthisislikelytobeofminorimportance.

Page 20: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page20

3. Evaluation3.1 HardwarePlatform

InordertoruntheLOD2GeoBenchatscale,weusedaclusterofcomputenodes,inparticular,theSCILENSclusterinstalledatCWI.

SCILENSisanewkindofhardwareclusterthathasbeendesignedfromthegrounduptoservelarge‐scaledatamanagement. Themachines in the SCILENS cluster areorganized in threedifferentlevels,called‘pebbles’,‘rocks’,and‘bricks’.Eachleveldecreasesinamountofnodesbuttheindividualmachinesusedinthelevelincreaseincomputationalanddiskresources(andpricetag).TheSCILENSclusterusescheapconsumerhardware,optimizedtopackasmuchpowerinaslittlespace,makinguseofconsumerhome‐threatermini‐PCcases(‘Shuttlebox’),connectedbyhigh‐performanceInfinibandnetwork.

Due to thenegativeperformance impactofnetwork trafficduringSPARQLqueryprocessingonlarge clusters (where joins tend to be 'communicating' joins where allmachines need to exchangedata),wherenetworkusagevolumeincreasessuper‐linearlywithmorenodes,itisgenerallybetterinRDFstores toworkwith fewernodeswithmore (RAM)resources thanwithmanynodeswith littleresources.Thus,wechoseasourexperimentalplatformthe‘bricks’layerofSCILENS,thatconsistsofsixteen256GBRAMmachines,eachwith16coresrunningat2.4GHz(dualsocketIntelservers,worth$8K).TheclusterrunsFedoraLinux.Theprice‐tagoftheeightmachinesinvolvedintheexperiments,inclusivetheInfinibandnetworkinfrastructureisroughly$100K.

The SCILENS cluster contains much more I/O resources per CPU core than usual in computeclusters.TherelationbetweenCPUpowerandI/OresourcesiscapturedbytheAmdahlnumber.ThisnumberistheamountofI/ObytespercoreCPUcyclethesystemcandeliver.IncaseoftheSCILENSclusterthisnumberiscloseto1.0whereastypicalclustersatsupercomputingfacilitiessuchasLISAatSARA,onlygetto0.2(1byteper5cycles).Wedoconfess,whileallthisI/Opowerisinteresting,intheworkloads presented so farmost data is RAM resident. One reasonwas that the high‐performancemulti‐SSD I/O subsystem of the bricks layer at the time of testing was not yet operational. Thisprovidesgroundforafollow‐upexperimentusingthisfastI/Olayer.Weexpectthistoacceleratetheloadphase,andalsotoallowtoaddressevenlargerdatasetsefficientlyonthesamehardware.

3.2 RDFDatabaseSystemsTestedForourevaluationoftheLOD2GeoBench,weusedfourdifferentsoftwareconfigurations:

OWLIM‐SE v5.3: we used the non‐cluster version of Ontotext’s OWLIM, which efficientlysupportsgeographicalquerying,asitstoresgeographicfeaturesinanRTree.OWLIM5.3withgeographicextensionisproprietarysoftware,butwehavenohardinformationonthecostof

Figure 2: The 'rocks' and 'pebbles' layers of the SCILENS cluster are hand‐built from384 Shuttleboxes,packingCPUandamplediskresourcesinlittlespace.

Page 21: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page21

OWLIMat the timeofpreparationof thisdocument, soweomitted the scoresper$ for thissystem.

VirtuosoV6open source is still themostwidely used RDF store around (V7 open sourcebinary builds have starteddistributing only sinceAugust 2013). ThisOpenLinkproduct hasspecificsupportforgeographicalpredicates,albeitsomewhatlimited.Asdiscussed,directBOX(rectangularwindow)selectionsonlatitude,longitudearenotpossible,soweusetheRADIUSpre‐filteringapproach.

VirtuosoV7opensource:thismajornewreleasehasbeenstronglyinfluencedbytheLOD2project, wherein CWI advised Openlink on the introduction of numerous architecturalenhancements. Specifically, V7 introduces columnar storage for RDF triples as well asvectorizedexecution;patternedafterCWIresearchdatabasesystemprototypes.VirtuosoV7wasreleasedin2013andgenerallyofferssignificantstoragesavings,reducedmemoryusage,andimprovedcomputationalperformanceoverV6.

VirtuosoV7ClusterEdition:thismajornewreleaseofthe(nonopen‐source)clustereditionhas been documented in D2.1.6 and introduces a new vectorized cluster based executionparadigm that allows to parallelize any (complex) SPARQL query over a cluster of computenodes.As a result, it canhandle complex SPARQLqueries, such as theBusiness Intelligenceworkload of BSBM, but also the LOD2 GeoBenchmuchmore efficiently (or at all) than theclusterversionofV6everdid.Themonetarycostatthetimeofwritingofenterpriseversionfordepartmentserversis$25K.

3.3 LoadingResultsBeforeloading,firstwegeneratedextratriplesforthe“quad”approach2.Thisroughlydoubles

thedatasize.Thesetriplesarethenbulkloadedintothesystems.Incaseofowlim,afterbulkloading,theRTreegeographical indexneedstobecreated.Thetimeneededforthis is includedinthebelowtable(andisalwaysasmallpartoftherealloadtime).

SF10 #2/1  SF1 (224M triples)  SF10 (2.24G triples)  SF100 (22.4G triples) 

METHOD  time   Size  time  Size  time  Size 

owlim5.3  5257sec   24GB 102075sec 185GB

virtuoso6  12900sec  23GB

virtuoso7  780sec  12GB 5520sec 108GB

v7cluster  2280sec 156GB 18840sec  1.1TB

Loading Virtuoso6 was done using single loading process, since parallel loading wouldconsistentlyhangthesystem.WementionedalreadythefactthatloadingthescaleddatasizesalsohitanerrormessageintheRTreeloadingcode(abug),eveninsingle‐threadedmode,whichpreventedusfromtestingVirtuosoV6onthelargerdatasizes.

Virtuoso7 used the native parallel loading procedure and was done by running 14 loadingprocessesinparallel.LoadingVirtuosoV7ClusterEditionwasdonebyrunning2processespernode,givingintotal32loadingprocesses(2x2nodes/machinex8machines).

2Alternatively,wecouldhavecreatedseparatedatabasesfortheexperiments,onewiththeextragenerated

triplesforquad,andonewithout(torunbasic,rtreeandrtree++).Thiswasnotdoneformanageabilitypurposes,astheperformanceeffectsaredeemedtobeminor

Page 22: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page22

3.4 OverallBenchmarkResultsWenowpresenttheoverallbenchmarkscoresoftheexperiments.TheLOD2GeoBenchmainpure

resultmetric(regardlesscost)isPagePerSec.Inordertomeasurethesystemunderload,i.e.withall8coresbusy,wederivethisSCOREfromtheworkloadunder8concurrentquerystreams;thisisknownasthe"throughputmetric"3.Thescoreiscomputedasthegeometricmeanofall12steps.Wealsosplititoutinthegeometricmeanoverthequeriesatlowzoom(zoomlevel0‐4,steps1‐6)andhighzoom(zoomlevel>4,steps7‐12).ThesearecalledLSCOREandHSCORE.

Figure3:LOD2GeoBenchScoresatSF1withoneserver(8core2.4GHz,256GBRAM)–8querystreams

The results at SF1have the approximate implementation “quad” in front, especially inVirtuosoV7,however,theimprovedRTreesupport“rtree++”comesquiteclose.Notably,high‐zoomquadqueriesworkedbetterinV6,sothereseemstobeeitheranoptimizerperformanceregression,oranundesiredeffect of the vectorized columnar execution inV7. Because the low‐zoom (compute‐intensive) quadqueriesaremuchfasteronV7,itsoverallscoreishigher.

AnotherinterestingcomparisonistheimpactoftheLOD2R&Dactivitiesinthepastfewyears,atleastin thisbenchmark, forVirtuosoversus its strongest competitor,OWLIM.WhereasOWLIMgenerallywas equivalent or faster than VirtuosoV6 (compare owlim rtree++ with V6 rtree), the rtree basedscore in Virtuoso improved by a factor 7, creating a significant performance advantage. Besidesimprovements to the RTree functionality, this is very likely caused by the columnar vectorizedexecutionmodelthatVirtuosoadoptedinV7,inspiredbyCWIresearchinthatarea.

Moving to SF10, though throughput drops by a factor 3,we see the relative advantage of the quadapproachimprovedramatically:

Figure4:LOD2GeoBenchScoresatS10withoneserver(8core,2.4GHz,256GBRAM)–8querystreams

3Forthe"Power"metricthattestsallqueriesinisolation,seetables3.6.1;3.6.6andtheleftpartof3.6.11.

0102030405060708090100

basic basic basic quad quad quad rtree++ rtree rtree++

owlim5.3 virtuoso6 virtuoso7 owlim5.3 virtuoso6 virtuoso7 owlim5.3 virtuoso6 virtuoso7

SCORE

LSCORE

HSCORE

051015202530

basic basic quad quad rtree++ rtree++

owlim5.3 virtuoso7 owlim5.3 virtuoso7 owlim5.3 virtuoso7

SCORELSCOREHSCORE

Page 23: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page23

Figure5:LOD2GeoBenchScoresatSF10with8servers(8core2.4GHz,256GBRAM)–8querystreams

Whereasthepreviousexperimentswereusingasingleserver,nowwemovetoresultsobtainedwith8servers. One strategy that is applicable to any technology in a read‐only workload like this, is toreplicatethedatabaseinmultipleservers,anddividethequeriesamongthem.Weremainusingjust8querystreams,suchthateachservergetsasinglestream.Inordertouseallcores,thesystemsmustnowparallelizetheindividualqueriesinordertomakeuseoftheCPUresources.Thisexplainslackoflinearscale‐up inall systems.Note that replicationstill isaverypowerful technique inheavyread‐onlyworkloads:wecansafelyexpectthatwhenusing8replicatedserverswith64concurrentquerystreams, the resultswill be 8‐fold those in Figure 4 (for example, 8*virtuoso7would then reach athroughputscoreof128insteadofjust12).

We also tested the “true” cluster database system provided by OpenLink, i.e. Virtuoso V7 ClusterEdition.Datahereisnotreplicatedinallservers,butpartitionedamongthematloadtime.Thiscausesall queries to be parallelized. This explains the superior scores (with quad overall being the best)obtainedonthissystem,under light load.Having inmindthetheoreticalpeakusageofat least128,platformutilizationofv7clusterat15,canstillbesignificantlyoptimized.

Overall, the absolute performance of the peak throughput drops from50PagesPerSec at SF1 to 15PagePerSecatSF10.ThiscouldpartlybeexplainedbythelossofdatalocalityatSF10,butitcouldalsoindicate some query optimization problems. Namely, in the Virtuoso V7 quad implementation, theplans donot have a heavy computational load (as this has beenprecomputed) and in principle thecomplexity of all queries should be logarithmic to data size. In this sense, a drop of a factor 3mayindicatethatthequeryoptimizerdoesnotfindtheoptimalplansyet.

Figure6:LOD2GeoBenchscorepercost(“bangforthebuck”)on8servers,SF10,8querystreams

FinallywecomputedthePagePerSec/K$scoreforallbenchmarkedproducts,bothsingle‐serverand8‐node cluster setups, excludingOWLIM5.3 (forwhichwe lack pricing information). It turns out that

0

5

10

15

20

25

basic basic basic quad quad quad rtree++ rtree rtree rtree++

8*owlim5.3 8*virtuoso7 v7cluster/8 8*owlim5.3 8*virtuoso7 v7cluster/8 8*owlim5.3 v7cluster/8 8*virtuoso7 8*virtuoso7

SCORE

LSCORE

HSCORE

0

0,5

1

1,5

2

2,5

basic quad rtree++ rtree basic basic quad quad rtree rtree rtree++ rtree++ rtree++ rtree++

virtuoso7 virtuoso7 virtuoso7 virtuoso7 8*virtuoso7v7cluster/88*virtuoso7v7cluster/8v7cluster/88*virtuoso78*virtuoso78*virtuoso78*virtuoso78*virtuoso7

SCORE/$

Page 24: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page24

fromamonetarypointofviewV7quadisthebestdeal;whereasforclusteredsetupwehavethe$25Ksoftware(plus$100Khardware)VirtuosoV7ClusterEditionbeatingasetupof8*8Khardwarenodeswiththeopen‐sourceversionreplicated.

Figure7:LOD2GeoBenchScoresatSF100with8servers(8core,2.4GHz,256GBRAM)‐8querystreams

The last overall benchmark results are the scores at SF100. Here, the trends continue, thoughperformanceofthedifferentqueryvariantsisstableandthedifferentresultsarenearertoeachother.

Infutureexperiments,theclusterexperiments(ormaybeallexperiments)shouldalsobeperformedunderahighqueryloadwithmanymorethan8streams,toensurethatallcoresarefullybusyunderpeakload.

00,51

1,52

2,53

3,54

basic quad rtree

v7cluster/8 v7cluster/8 v7cluster/8

SCORE

LSCORE

HSCORE

Page 25: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page25

3.5 DetailedQueryPerformanceResultsWerantheLOD2GeoBenchatscalefactors(SF)1,10and100(130M,1.3G,13Gtriples);using1,2,4and8concurrentquerystreams.Notallsystemsaretestedusingallparameters:

we tested the non‐clustered systems only on SF1 and SF10; and Virtuoso V7 ClusterEditionwasnottestedatSF1,onlyatSF10andSF100.

aresultofthedifferentdatascalingmethodintheLOD2GeoBenchv2.0(vs.v1.0)isthatVirtuosoV6nolongercanloadthescalefactor10dataset.Ithitsabug,andgiventhatV7isout,thisbugmaynevergetfixed.ThereforeVirtuosoV6isonlytestedonscalefactor1.

Figure8:Page‐per‐secondperformanceforeachofthe12steps(SF1,#8/1)

SF1#8/1:AtSF1onasingleserverwith8concurrentquerystreams(whichshouldkeepthe8coresbusyatleast),theresultsshowthatforVirtuosoV7thehighestperformanceisachievedwiththenew RTree functionality (rtree++); however the performance linearly improves with higher zoomlevel, and isquitepoorat the lower zoom levels.At the lower zoom levels, the approximate “quad”approachismuchbetter.Interestingly,V6achievesbetterperformancethanV7.

ForOWLIM,thequadperformanceisbadduetoitgettingbadqueryplans(duetothedisjunctivefiltersandunions).TheRTreesupportinOWLIMisquitegood(rtree++)usuallyexceedingtheRTreesupportofVirtuosoV6(butnotV7).

0

20

40

60

80

100

120

140

160

180

1 2 3 4 5 6 7 8 9 10 11 12

owlim5.31basic

owlim5.31quad

owlim5.31rtree++

virtuoso61basic

virtuoso61quad

virtuoso61rtree

virtuoso71basic

virtuoso71rtree

virtuoso71rtree++

owlim5.31quad

owlim5.31rtree++

virtuoso71quad

Page 26: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page26

Figure9:Page‐per‐secondperformanceforeachofthe12steps(SF10,#8/1and#8/8)

SF10#8/1:AtSF10onasingleserverwith8querystreams,theadvantageofthequadapproachinVirtuosoincreasesconsiderably.DifferenttoSF1,atSF10onV7theperformanceexceedsthatofV7rtree++andofV6quadconsiderably,thelatterbecauseofimprovementsmadeinthequeryoptimizerthat favour the InstanceRetrievalQuery. At SF10, the rtree++ performance is no longer very good.ThiscanbeexplainedbythefactthatinsidetheRTreeinstancesbelongingtoallfacetsarestored,notjustthefourfacetsrequiredbytheInstanceRetrievalQuery.Atlargerscalefactor,theincreaseddatasizecausesI/Otostartplayingarole.InthisexperimentatSF10,VirtuosoV6couldnotbetestedduetothedataloadingbug,mentionedearlier.

InthisexperimentwealsoseeVirtuosoV7ClusterEditionresultsoneightidenticalmachines.Itisstriking that the basic variant performs very close to rtree. If we further compare basic betweencluster (8machines) and a singlemachine, scalability for low zoom levels is near linear (factor 8).These queries perform a lot ofwork,which gets parallelized. However, the gains in the high zoomlevels,which access less data, aremore limited. It is an open questionwhy rtree does not providemuchbenefitinaclustersetting.

SF10#8/8:We nowmove to experiments at SF10with 8 query streams on 8 servers. In thefollowingwecomparetheVirtuosoV7ClusterEditionapproachwithsimplereplication.Theformer,followinga“true”clusterapproach,partitionsalldataacrossallservers,henceeachserverstores1/8thofthedata,andqueriesgetspreadoutoverallservers(parallelized).Simplereplication,incontrasts,

0

5

10

15

20

25

30

35

40

45

50

1 2 3 4 5 6 7 8 9 10 11 12

owlim5.310basic

owlim5.310quad

owlim5.310rtree++

v7cluster/810basic

v7cluster/810quad

v7cluster/810rtree

virtuoso710basic

virtuoso710quad

virtuoso710rtree

virtuoso710rtree++

Page 27: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page27

loads the same data independently in eight different machines, and then executes the 8‐streambenchmarktestbyrunningthesingle‐streamtestindependentlyonall8machines.Assuchtheresultsof thisexperiment are roughly8‐foldhigher than the single‐streamsingle‐server test.Note that thehardwareismorethan8timesexpensive($8Kvs$100K,duetothecostoftheinfinibandswitch,oneinfiniband network card per server and cabling). This added price differencemake the replicationstrategy lessattractive in thePagePerSecond/$scores,presented later.The replicationexperimentsaremarkedwithastarinthelegend.

Figure10:Page‐per‐secondperformanceforeachofthe12stepsinthebenchmark(SF10,#8/8)

Replication: This experiment shows replicated owlim (owim5.3*) to compete on higher zoomlevelswith itsRTreesupport.ThereplicateVirtuosoV7(virtuoso7*)withthequadapproachscoreshigh,thoughisvulnerableintheFacetRetrievalQueryatthelowerzoomlevelswhereitisused(steps7‐9)andwhentherearemanyqueryresults.Theoverallwinner intermsofperformanceisClusterEditionwithquads(thegreendashes)thankstomorereliableperformanceatsteps7‐9,eventhoughitlosesouttoreplicationatsteps10‐12.

SF100#8/8:whenwescalethedatasettofactor100weonlyhaveresultsonVirtuosoV7Clustereditionon8servernodes.Onthelowzoomlevel,thequadapproachracesahead,withbasicandrtree

0

5

10

15

20

25

30

35

40

45

50

0 1 2 3 4 5 6 7 8 9 10 11 12

8*owlim5.310basic

8*owlim5.310quad

8*owlim5.310rtree++

8*virtuoso710basic

8*virtuoso710rtree

8*virtuoso710quad

8*virtuoso710rtree++

v7cluster/810basic

v7cluster/810quad

v7cluster/810rtree

Page 28: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page28

behaving identically.Athigherzoom levels, all approaches improvegradually,but rteeclearlybeatsbasic,andquadclearlybeatsrtree.

Figure11:Page‐‐per‐secondperformanceforeachofthe12steps(SF100,8/8)

Intheexperimentsuntilnow,weshowtheperformanceperstep,howeverpleaserecallthateachstep is the combination of two queries. For steps 1‐6, it is a Facet Count Query with a InstanceAggregationQuery,andforsteps7‐12itisaFacetCountQueryfollowedbyaInstanceRetrievalQuery.Also, levels 6,8,10,12 just pan (to a partially overlapping area at the zame level),wheres the otherquerieszoomin.Intheabovethisisvisibleinqueries6,8,10,12scoringabovethetwotrendlinesthatonecanconstructinthestep1‐6and7‐12segments.

Analysisof IndivualQueryPerformance. However,what is also interesting is to look at theindividual queries. Each query stream consists of 24 queries, two per step. First, the Facet CountQuery,thentheInstanceAggregationorRetrievalQuery.

0

2

4

6

8

10

12

1 2 3 4 5 6 7 8 9 10 11 12

v7cluster/8100rtree

v7cluster/8100quad

v7cluster/8100basic

Page 29: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page29

Theabove figuresshowatSF1#8/1on theright theQueries‐Per‐Secondachievedby theFacetCountQuery,andontheleftbytheInstanceAggregationQuery(steps1‐6)andtheInstanceRetrievalQuery(steps7‐12). Ifweexaminethescale,ateachstep, the Instancequeriesarethebottleneck. Infact,onVirtuoso7theFacetCountQuerydoesnotneedthequadapproximation,asrtree++isamongthe best.Herewe confirm that the performance dip at step 7‐9 is causedby the InstanceRetrievalQuery.Thereasonforthisisthelargeamountofinstancesatthesezoomlevels.Assuch,theseresultspoint to the fact that in the benchmark the switch‐over from Instance Aggregation to InstanceRetrievalshouldbetterbemadeatadeeperzoomlevel.

0

5

10

15

20

25

30

35

40

0 1 2 3 4 5 6 7 8 9101112

1owlim5.3basic

1owlim5.3quad

1owlim5.3rtree++

1virtuoso6basic

1virtuoso6rtree 0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 101112

InstanceInstance

Aggregation Retrieval

QueryQuery

FacetCountQuery

Page 30: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page30

At SF10#8/1 (above) the cost balance between InstanceQueries (left graph) and Facet CountQueries(rightgraph)shift,astheybecomemorecomparable.Still,thebottleneckisinthefirstthreeInstance Retrieval Queries (step 7‐9, left). It is remarkable in these results that for the InstanceRetrievalQueries,owlim5.3doesagood job instep8‐12(left); in factbeating theVirtuoso7rtree++approach.

AtSF100#8/8theFacetCount(rightgraph)andInstanceQueries(leftgraph)areroughlythesamecost.Here, thequadapproachreallywins in theFacetCountQuery (right).The facet InstanceQueries (left) generally have lower performance, especially between query 7‐12 (i.e., the FacetRetrievalQuery).

0

5

10

15

20

25

30

1 2 3 4 5 6 7 8 9 101112

owlim4.3basic

owlim4.3quad

owlim4.3rtree++

virtuoso7basic

virtuoso7rtree

virtuoso7quad

virtuoso7rtree++

v7clusterbasic

v7clusterrtree

v7clusterquad0

5

10

15

20

25

1 2 3 4 5 6 7 8 9 101112

0

1

2

3

4

1 2 3 4 5 6 7 8 9 10 11 12

v7clusterbasic

v7clusterrtree

v7clusterquad

0

1

2

3

4

5

1 2 3 4 5 6 7 8 9101112

FacetCountQuery

FacetCountQuery

Page 31: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page31

3.6 FullQueryPerformanceResultsIntheseresults,therednumbersarethoseproducedbytheslowest,whereasthegreennumbers

arethefastestruns.

3.6.1 Scalefactor1,1querystream,1serverSF1 #1/1  owlim5.3  owlim5.3  owlim5.3  virtuoso6  virtuoso6  virtuoso6  virtuoso7  virtuoso7  virtuoso7  virtuoso7 

METHOD  basic  quad  rtree++  basic  quad  rtree  basic  quad  rtree  rtree++ 

STEP01  0.0346  0.1176  0.0832  0.0905  4.7641  0.0387  0.1358  4.4662  0.2301  0.6190 

STEP02  0.0556  0.1458  0.2586  0.1603  5.3937  0.1134  0.3253  3.7481  0.5640  1.4253 

STEP03  0.0780  0.2968  0.7925  0.2619  6.9348  0.2772  1.0380  4.6468  1.8761  3.0703 

STEP04  0.1237  0.2347  1.6066  0.3784  7.9176  0.6577  1.2910  5.3821  2.1349  5.0100 

STEP05  0.2079  0.9257  2.1772  0.5676  15.3678  1.0667  2.0250  6.4850  2.3651  6.8446 

STEP06  0.2028  0.9109  2.2862  0.5628  14.6413  1.0730  2.4520  8.9126  2.5886  6.7294 

STEP07  0.3517  0.7462  3.2216  0.8517  1.1614  1.5647  2.6040  2.5953  2.5819  8.8261 

STEP08  0.3610  0.5692  3.3288  0.8486  1.4243  1.5382  2.8520  5.0403  2.9967  9.0171 

STEP09  0.7034  0.8040  6.2853  1.4900  2.3596  3.5868  4.2123  6.2972  4.4762  17.2414 

STEP10  0.6758  0.7642  5.1177  1.4007  2.1092  2.4515  4.4111  7.8003  3.8051  14.0647 

STEP11  1.3709  0.9497  9.5602  4.0144  5.0735  6.2383  6.4892  9.1996  6.6711  26.5252 

STEP12  1.2764  2.9002  6.6181  2.2123  4.5558  4.1493  6.2422  9.6711  5.2966  18.0832 

LSCORE  0.0961  0.3168  0.7176  0.2780  8.1604  0.3119  0.8156  5.1935  1.2128  2.9228 

LSCORE/$  0.0347  1.0200  0.0389  0.1019  0.6491  0.1516  0.3653 

HSCORE  0.6876  0.9466  5.2829  1.5409  2.3974  2.8593  4.2104  6.2021  4.0841  14.4740 

HSCORE/$  0.1926  0.2996  0.3574  0.5263  0.7752  0.5105  1.8093 

SCORE  0.2571  0.5476  1.9471  0.6545  4.4231  0.9444  1.8532  5.6755  2.2256  6.5044 

SCORE/$  0.0818  0.5528  0.1180  0.2316  0.7094  0.2782  0.8130 

3.6.2 Scalefactor1,2querystreams,1serverSF1 #2/1  owlim5.3  owlim5.3  owlim5.3  virtuoso6  virtuoso6  virtuoso6  virtuoso7  virtuoso7  virtuoso7  virtuoso7 

METHOD  basic  quad  rtree++  basic  quad  rtree  basic  quad  rtree  rtree++ 

STEP01  0.0438  0.2046  0.1377  0.1450  8.7857  0.0592  0.2899  9.3350  0.5868  1.0153 

STEP02  0.0760  0.3324  0.3922  0.2556  11.8142  0.1581  0.9248  10.0023  1.5272  2.1669 

STEP03  0.1010  0.4646  1.0689  0.4358  12.5180  0.4040  2.7962  6.2473  4.2260  4.5340 

STEP04  0.1579  0.7876  2.2714  0.6660  18.5405  0.8659  2.7117  9.9469  4.0895  8.4835 

STEP05  0.2746  0.8481  4.6420  0.9965  25.4895  1.7024  3.4837  13.1304  4.8692  13.6153 

STEP06  0.2704  0.9686  4.4050  0.9913  26.2031  1.7180  3.6731  19.0339  5.8937  13.4819 

STEP07  0.4914  0.9496  6.5400  1.6060  2.0462  3.3723  5.5321  6.6725  5.8411  19.9092 

STEP08  0.4976  0.9409  6.0975  1.6073  2.3136  3.3690  6.0500  10.2762  6.7234  19.6660 

STEP09  1.0367  2.5329  12.4141  2.9654  5.3145  8.2270  9.8168  12.8568  11.5399  39.1348 

STEP10  0.8869  2.6699  10.3294  3.0043  4.2300  5.9599  9.5017  16.0707  10.4757  29.2999 

STEP11  1.8653  5.0793  20.1726  6.3119  8.6424  13.6262  13.5619  17.2091  17.0043  55.0651 

STEP12  1.7027  3.8357  14.5142  5.3378  7.5906  9.8195  13.5811  17.8367  12.7472  38.4264 

LSCORE  0.1258  0.5230  1.17876  0.4690  15.8713  0.4610  1.7210  10.6290  2.7614  4.9919 

LSCORE/$  0.0586  1.9839  0.0576  0.2151  1.3286  0.3451  0.6239 

HSCORE  0.9454  2.2132  10.6857  3.0293  4.2201  6.4824  9.1109  12.7630  10.0386  31.3102 

HSCORE/$  0.3786  0.5275  0.8103  1.1388  1.5953  1.2548  3.9137 

SCORE  0.3449  1.0759  3.5490  1.1920  8.1840  1.7287  3.9598  11.6472  5.2651  12.5020 

SCORE/$  0.1490  1.0230  0.2160  0.4949  1.4559  0.6581  1.5627 

Page 32: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page32

3.6.3 Scalefactor1,4querystreams,1serverSF1 #4/1  owlim5.3  owlim5.3  owlim5.3  virtuoso6  virtuoso6  virtuoso6  virtuoso7  virtuoso7  virtuoso7  virtuoso7 

METHOD  basic  quad  rtree++  basic  quad  rtree  basic  quad  rtree  rtree++ 

STEP01  0.0703  0.3726  0.2693  0.2875  18.7083  0.1195  0.5566  21.3604  1.5452  2.0748 

STEP02  0.1121  0.4790  0.8264  0.5028  25.3611  0.3937  1.5829  18.1676  2.8779  5.3924 

STEP03  0.1498  1.00556  2.3242  0.7441  31.2361  0.8582  4.6161  13.0845  8.6183  9.5042 

STEP04  0.22611  1.4740  4.7571  1.1240  46.2466  1.9203  5.0335  21.9932  8.2786  16.5111 

STEP05  0.3634  2.7782  7.6465  1.6884  74.7055  3.2683  6.9151  30.5185  9.1625  23.0400 

STEP06  0.3694  2.6684  7.9984  1.6681  74.7607  3.2947  6.6436  45.7031  10.3364  24.2530 

STEP07  0.7162  2.7846  12.2841  3.1894  4.4121  5.7132  11.4062  16.1533  10.6061  33.4092 

STEP08  0.6863  2.8749  12.0616  2.9065  4.4726  5.7027  12.4466  23.5253  10.956  33.4019 

STEP09  1.4399  4.4919  26.8988  6.7532  8.3481  13.7416  19.6487  28.2182  19.1360  65.3226 

STEP10  1.3783  5.8125  19.5487  6.6083  8.09174  10.0076  19.2529  36.7007  17.3637  52.3030 

STEP11  2.8112  12.0010  43.3150  14.7862  18.6510  27.0136  29.8168  40.1119  26.7478  96.3516 

STEP12  2.4259  9.2902  27.8078  13.0194  15.15110  16.2686  30.6850  43.2767  23.4443  60.7765 

LSCORE  0.1817  1.1187  2.3055  0.8357  39.2294  0.9703  3.1287  23.1667  5.5719  9.6245 

LSCORE/$  0.1044  4.9036  0.1212  0.3910  2.8958  0.6964  1.2030 

HSCORE  1.3712  5.3409  21.2913  6.5543  8.4910  11.1847  19.1156  29.6368  16.9895  52.9802 

HSCORE/$  0.8192  1.0613  1.3980  2.3894  3.7046  2.1236  6.6225 

SCORE  0.4992  2.4444  7.0063  2.34052  18.2510  3.2944  7.7335  26.2028  9.7295  22.5812 

SCORE/$  0.2925  2.2813  0.4118  0.9666  3.2753  1.2162  2.8226 

3.6.4 Scalefactor1,8querystreams,1serverSF1 #8/1  owlim5.3  owlim5.3  owlim5.3  virtuoso6  virtuoso6  virtuoso6  virtuoso7  virtuoso7  virtuoso7  virtuoso7 

METHOD  quad  basic  rtree++  basic  quad  rtree  basic  quad  rtree  rtree++ 

STEP01  0.5448  0.0563  0.2743  0.4113  36.2784  0.1811  0.9454  40.7706  2.7136  3.4060 

STEP02  0.7765  0.0892  0.9029  0.7124  46.3338  0.5223  2.9656  36.3412  7.7649  6.6037 

STEP03  1.4639  0.1222  2.3402  1.1828  58.2323  1.3596  8.3323  31.9119  13.4467  13.7075 

STEP04  2.5405  0.1910  5.1597  1.8460  90.0222  2.9270  9.1631  47.4474  13.7959  25.0132 

STEP05  3.4629  0.3169  8.3595  2.7192  132.4990  5.2606  12.2425  59.0893  13.7280  38.6629 

STEP06  4.6291  0.3096  8.8595  2.7852  139.9900  5.3247  13.1893  85.2281  15.7198  38.5826 

STEP07  4.5387  0.5878  14.1144  4.9718  7.7641  10.0048  23.0038  33.8289  20.6775  55.1534 

STEP08  4.6977  0.6517  15.3833  4.7890  7.9362  10.0792  23.8945  45.0062  21.1386  59.0653 

STEP09  7.6525  1.3616  38.3478  10.2151  13.7468  24.4637  40.1600  55.6296  37.3605  105.0690 

STEP10  9.7112  1.2118  25.6037  9.2489  13.3397  17.5706  37.1036  72.2938  31.9566  83.1055 

STEP11  19.0270  2.2852  66.4700  21.7008  30.4268  45.2330  54.0770  77.3047  51.9836  156.4850 

STEP12  13.2889  2.0098  34.6726  18.9416  25.7544  29.1453  54.146  79.2604  45.7914  105.1910 

LSCORE  1.7121  0.1503  2.4589  1.3007  73.8155  1.4807  5.7035  47.2966  9.7115  15.0085 

LSCORE/$    0.1625  9.2269  0.1850  0.7129  5.9120  1.2139  1.8760 

HSCORE  8.5786  1.1943  27.7408  9.8613  15.3680  19.6025  36.5335  57.7652  32.7411  87.9627 

HSCORE/$    1.2326  1.7960  2.4503  4.5666  7.22066  4.0926  10.9953 

SCORE  3.8325  0.4238  8.2590  3.5814  32.5666  5.3876  14.4350  52.2695  17.8317  36.3344 

SCORE/$    0.4476  4.07083  0.6734  1.8043  6.5336  2.2289  4.5418 

Page 33: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page33

3.6.5 Scalefactor1,8querystreams,8replicatedserversSF1 #8/8  owlim5.3  owlim5.3  owlim5.3  virtuoso6  virtuoso6  virtuoso6  virtuoso7  virtuoso7  virtuoso7  virtuoso7 

METHOD  Basic  Quad  rtree++  basic  rtree  Quad  basic  rtree  quad  rtree++ 

STEP01  0.2772  0.9410  0.6660  0.7247  0.3102  38.1134  1.0867  1.8415  35.7302  4.9520 

STEP02  0.4450  1.1669  2.0693  1.2831  0.9078  43.1499  2.6029  4.5126  29.9850  11.4025 

STEP03  0.6244  2.3751  6.3406  2.0958  2.2179  55.4785  8.3064  15.0094  37.1740  24.5625 

STEP04  0.9902  1.8779  12.8534  3.0279  5.2617  63.3413  10.3306  17.0794  35.0570  40.0802 

STEP05  1.6636  7.4060  17.4178  4.5413  8.5342  114.942  16.2042  18.9214  51.8806  54.7570 

STEP06  1.6230  7.2879  18.2899  4.5029  8.5846  117.1300  19.622  20.7093  71.3013  53.8358 

STEP07  2.8137  5.9697  25.7732  6.8143  12.5176  9.2915  20.833  20.6558  20.7630  70.6090 

STEP08  2.8886  4.5539  26.6311  6.7888  12.3058  11.3944  22.8180  23.9736  40.3226  72.1370 

STEP09  5.6274  6.4365  50.2829  11.9207  28.6944  18.8768  33.6984  35.8102  50.3778  137.9310 

STEP10  5.4065  6.1138  40.9417  11.2061  19.6126  16.8741  35.2890  30.4414  62.4025  112.5180 

STEP11  10.9679  7.5980  76.4818  32.1156  49.9064  40.5886  51.9143  53.3689  73.5970  212.2020 

STEP12  10.2119  23.2018  52.9450  17.6991  33.1950  36.4465  49.9376  42.3729  77.3694  144.6660 

LSCORE  0.7686  2.5324  5.7364  2.2223  2.4934  65.2294  6.5201  9.6946  41.5142  23.3633 

LSCORE/$  0.0317  0.0356  0.9318  0.0931  0.1384  0.5930  0.333762 

HSCORE  5.4967  7.5666  42.2283  12.31  22.8554  19.1639  33.6555  32.6461  49.5762  115.7030 

HSCORE/$  0.1759  0.3265  0.2737  0.4807  0.4663  0.7082  1.6529 

SCORE  2.0554  5.3774  15.5641  5.2318  7.5490  35.3561  14.8135  17.7902  45.3665  51.9923 

SCORE/$  0.0747  0.1078  0.5050  0.2116  0.2541  0.6480  0.74274 

3.6.6 Scalefactor10,1querystream,1server(8serversforv7cluster)SF10 #1/1  owlim5.3  owlim5.3  owlim5.3  v7cluster  v7cluster  v7cluster  virtuoso7  virtuoso7  virtuoso7  virtuoso7 

METHOD  basic  quad  rtree++  basic  quad  rtree  basic  quad  rtree  rtree++ 

STEP01  0.0050  0.0198  0.0104  0.4749  3.8095  0.5610  0.1166  0.8274  0.0947  0.1757 

STEP02  0.0109  0.0207  0.0323  0.7364  3.2862  0.6941  0.1037  1.6655  0.0854  0.4154 

STEP03  0.0208  0.0456  0.1077  0.9268  4.5829  0.6060  0.1109  1.7379  0.1068  0.6811 

STEP04  0.0346  0.0272  0.2104  1.0624  4.4822  0.9128  0.0942  1.1002  0.0928  0.7865 

STEP05  0.0500  0.1210  0.3210  1.3428  5.8858  1.1353  0.1786  2.5316  0.1802  1.0473 

STEP06  0.0492  0.1136  0.3207  1.8733  6.9979  1.4039  0.2710  3.9169  0.2731  1.0511 

STEP07  0.0667  0.0961  0.4799  1.0271  1.0001  0.7162  0.2425  0.2147  0.2204  0.9768 

STEP08  0.0683  0.0788  0.4983  1.1344  1.7382  0.7238  0.3339  0.4377  0.2771  1.0534 

STEP09  0.1121  0.0931  1.4551  1.2110  1.6526  0.9791  0.4087  0.6935  0.3241  2.1173 

STEP10  0.1049  0.0941  0.9657  1.2301  1.8821  0.8685  0.7598  3.3512  0.4157  1.5506 

STEP11  0.1920  0.1191  3.1938  1.3126  1.9704  1.4876  0.9669  3.6563  0.7057  3.6496 

STEP12  0.1678  0.6825  1.4376  1.3282  1.9142  1.0537  0.9318  5.3705  0.6004  1.9149 

LSCORE  0.0215  0.0438  0.0963  0.9763  4.6834  0.8368  0.1353  1.7222  0.1258  0.5921 

LSCORE/$  0.0088  0.0426  0.0076  0.0169  0.2152  0.0157  0.0740 

HSCORE  0.1096  0.1325  1.0749  1.2026  1.6526  0.9403  0.5321  1.2746  0.3895  1.6934 

HSCORE/$  0.0171  0.0236  0.0134  0.0665  0.1593  0.0487  0.2116 

SCORE  0.0485  0.0762  0.3217  1.0836  2.7820  0.8870  0.2684  1.4816  0.2214  1.0014 

SCORE/$  0.0098  0.0253  0.008  0.0335  0.1852  0.0276  0.1251 

Page 34: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page34

3.6.7 Scalefactor10,2querystreams,1server(8serversforv7cluster)SF10 #2/1  owlim5.3  owlim5.3  owlim5.3  v7cluster  v7cluster  v7cluster  virtuoso7  virtuoso7  virtuoso7  virtuoso7 

METHOD  basic  quad  rtree++  basic  quad  rtree  basic  quad  rtree  rtree++ 

STEP01  0.0065  0.0339  0.0100  0.8428  6.9037  1.0140  0.1989  2.5406  0.1479  0.2817 

STEP02  0.0140  0.0475  0.0196  1.3476  8.2174  1.2511  0.2304  4.9220  0.1974  0.6734 

STEP03  0.0280  0.0672  0.0912  2.1892  6.2783  1.2983  0.2715  3.9435  0.2591  1.1448 

STEP04  0.0451  0.1049  0.1901  2.4971  9.3195  1.7669  0.1964  5.1139  0.1986  1.6744 

STEP05  0.0689  0.0953  0.3302  2.7946  11.242  1.8100  0.2883  6.3940  0.2929  2.1664 

STEP06  0.0691  0.1240  0.4815  3.5152  14.435  2.0111  0.5444  5.4518  0.5457  2.2474 

STEP07  0.0995  0.1169  0.7798  2.0170  2.1313  1.4603  0.5686  0.5905  0.4803  1.7302 

STEP08  0.1002  0.1180  0.4186  2.3943  3.7048  1.2957  0.7306  1.1359  0.5880  1.6374 

STEP09  0.1738  0.3818  0.6771  2.3188  3.2702  2.1333  1.0197  1.6843  0.6839  3.7533 

STEP10  0.1573  0.4045  0.7272  2.3477  3.6686  1.9933  1.3206  7.5586  0.8095  2.5963 

STEP11  0.2782  1.3700  6.5831  2.5421  3.4931  2.2806  1.6689  7.5430  1.5672  6.9030 

STEP12  0.2377  0.6441  3.3194  2.6716  3.8395  2.1771  1.6871  8.9356  1.1288  3.7002 

LSCORE  0.0286  0.0716  0.0905  1.9835  9.0125  1.4817  0.2697  4.5402  0.2494  1.0999 

LSCORE/$  0.0180  0.0818  0.0134  0.0337  0.5675  0.0311  0.1374 

HSCORE  0.1621  0.3515  1.2328  2.3721  3.2894  1.8485  1.0786  2.8829  0.8073  2.9821 

HSCORE/$  0.0215  0.0299  0.0168  0.1348  0.3603  0.1009  0.3727 

SCORE  0.0682  0.1587  0.3341  2.1691  5.4448  1.6550  0.5394  3.6179  0.4488  1.8111 

SCORE/$  0.0197  0.0495  0.0150  0.0674  0.4522  0.0561  0.2263 

3.6.8 Scalefactor10,4querystreams,1server(8serversforv7cluster)SF10 #4/1  owlim5.3  owlim5.3  owlim5.3  v7cluster  v7cluster  v7cluster  virtuoso7  virtuoso7  virtuoso7  virtuoso7 

METHOD  basic  quad  rtree++  basic  quad  rtree  basic  quad  rtree  rtree++ 

STEP01  0.0105  0.0652  0.0242  1.6013  14.0383  1.8670  0.3959  5.8348  0.2644  0.5549 

STEP02  0.0212  0.0702  0.0820  2.4171  14.2772  2.1800  0.3698  6.9096  0.3257  1.4373 

STEP03  0.0427  0.1441  0.2278  3.4893  10.7229  2.6489  0.4208  10.2357  0.3958  2.2133 

STEP04  0.0657  0.1946  0.4703  4.4887  17.2670  3.6181  0.3787  13.6636  0.3721  2.7233 

STEP05  0.0992  0.3381  0.7094  5.2234  18.9309  3.2839  0.6232  22.4468  0.6240  3.5838 

STEP06  0.1000  0.3374  0.8064  6.6113  29.4153  3.7726  1.0406  27.9520  1.1231  3.4744 

STEP07  0.1538  0.3841  1.2721  3.8237  4.4956  2.2509  1.2260  1.7904  0.9262  2.6010 

STEP08  0.1479  0.4572  1.1595  4.0627  6.5959  2.5554  1.7328  4.2651  1.2789  2.4533 

STEP09  0.2576  0.6944  3.4924  4.7477  5.8628  3.0736  1.6984  4.9390  1.2811  5.3958 

STEP10  0.2424  0.9163  2.7187  4.8525  6.6822  3.0457  2.2505  16.0512  1.4027  3.9986 

STEP11  0.4361  2.2408  9.6816  4.4008  6.3772  4.1216  3.1623  18.0522  3.1385  10.445 

STEP12  0.4076  1.7863  3.3537  4.5142  7.0365  3.7629  2.7759  23.5145  1.9080  4.7708 

LSCORE  0.0429  0.1565  0.2228  3.5748  16.5469  2.8002  0.4975  12.3316  0.4553  1.9773 

LSCORE/$  0.0325  0.1505  0.0254  0.0621  1.5414  0.0569  0.2471 

HSCORE  0.2515  0.8745  2.7719  5.3825  6.1075  3.0673  2.0357  7.9669  1.5281  5.3566 

HSCORE/$  0.0398  0.0555  0.0279  0.2544  0.9958  0.1910  0.54457 

SCORE  0.1039  0.3700  0.7859  3.9581  10.0529  2.9307  1.0063  9.9118  0.8341  2.9350 

SCORE/$  0.0360  0.0914  0.0266  0.1257  1.2389  0.1042  0.3668 

Page 35: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page35

3.6.9 Scalefactor10,8querystreams,1server(8serversforv7cluster)SF10 #  owlim5.3  owlim5.3  owlim5.3  v7cluster  v7cluster  v7cluster  virtuoso7  virtuoso7  virtuoso7  virtuoso7 

METHOD  basic  quad  rtree++  basic  quad  rtree  basic  quad  rtree  rtree++ 

STEP01  0.0111  0.1058  0.0179  2.4963  17.8207  3.3792  0.6484  12.6956  0.4822  0.7251 

STEP02  0.0244  0.1041  0.0554  3.6211  16.9718  4.4317  0.6449  17.6573  0.5667  1.7568 

STEP03  0.0491  0.1936  0.1601  5.1298  17.7603  5.1870  0.7271  22.5014  0.7020  2.9094 

STEP04  0.0782  0.3317  0.3211  6.6643  23.7782  6.1971  0.6475  28.0677  0.6225  3.8872 

STEP05  0.1172  0.4156  0.5781  7.6602  27.0735  6.3392  0.9747  35.9309  0.9837  5.1091 

STEP06  0.1174  0.5800  0.5869  11.6547  42.7366  7.4973  1.9773  47.5651  2.2304  4.8659 

STEP07  0.1784  0.5611  1.1142  6.3838  8.0113  3.9240  2.4544  3.6566  1.5296  2.5534 

STEP08  0.1741  0.5952  1.1449  6.9567  11.1574  4.1500  2.8321  6.4509  1.9406  2.6415 

STEP09  0.3091  1.0202  3.6585  6.7299  9.2481  5.5324  3.3316  8.8511  2.2520  5.4293 

STEP10  0.2730  1.3726  2.5702  7.6770  10.8405  5.2976  3.9193  29.6419  2.4232  4.1596 

STEP11  0.5210  3.3685  9.4047  6.3636  10.1749  6.7232  5.1006  31.5107  4.5335  9.8784 

STEP12  0.4747  2.6020  3.7997  7.1536  10.7785  6.4210  4.5376  40.4639  3.4996  5.3200 

LSCORE  0.0493  0.2356  0.1609  5.4932  22.9646  5.3245  0.8509  24.9306  0.8000  2.6639 

LSCORE/$  0.0500  0.2089  0.0484  0.1063  3.1163  0.1000  0.3329 

HSCORE  0.2943  1.2650  2.7448  6.8573  9.9619  5.2325  3.5769  14.0949  2.5205  4.4699 

HSCORE/$  0.0623  0.0906  0.0476  0.4471  1.7618  0.3150  0.5587 

SCORE  0.1205  0.5460  0.6647  6.1375  15.1252  5.2783  1.7446  18.7455  1.4200  3.4507 

SCORE/$  0.0830  0.0558  0.1376  0.0480  0.2180  2.3431  0.17751  0.4313 

3.6.10 Scalefactor10,8querystreams,8replicatedserversSF10 #8/8  owlim5.3*  owlim5.3*  owlim5.3*  virtuoso7*  virtuoso7*  virtuoso7*  virtuoso7* 

METHOD  basic  quad  rtree++  basic  rtree  quad  rtree++ 

STEP01  0.0406  0.1590  0.0839  0.9335  0.7577  6.6192  1.4058 

STEP02  0.0871  0.1657  0.2590  0.8303  0.6837  13.3240  3.3235 

STEP03  0.1666  0.3649  0.8623  0.8872  0.8547  13.9034  5.4495 

STEP04  0.2774  0.2179  1.6838  0.7539  0.7429  8.8018  6.2927 

STEP05  0.4000  0.9684  2.5683  1.4292  1.4420  20.2530  8.3787 

STEP06  0.3942  0.9093  2.5656  2.1685  2.1855  31.3357  8.4095 

STEP07  0.5339  0.7689  3.8395  1.9402  1.7636  1.7182  7.8147 

STEP08  0.5470  0.6305  3.9868  2.6718  2.2173  3.5018  8.4272 

STEP09  0.8974  0.7452  11.6414  3.2697  2.5930  5.5482  16.9384 

STEP10  0.8394  0.7531  7.7257  6.0785  3.3257  26.8097  12.4050 

STEP11  1.5367  0.9532  25.5510  7.7354  5.6457  29.2505  29.1971 

STEP12  1.3430  5.4607  11.5009  7.4550  4.8039  42.9646  15.3198 

LSCORE  0.1720  0.3504  0.7699  1.0822  1.0060  13.7665  4.7334 

LSCORE/$  0.0154  0.0143  0.1966  0.0676 

HSCORE  0.8767  1.0597  8.5926  4.2533  3.1142  10.1884  13.5361 

HSCORE/$  0.0607  0.0444  0.1455  0.19337 

SCORE  0.3884  0.6093  2.5720  2.1455  1.7700  11.8431  8.0045 

SCORE/$  0.0306  0.0252  0.1691  0.1143 

Page 36: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page36

3.6.11 Scalefactor100,1and2querystreams,8partitionedserversSF100 #1/1  v7cluster  v7cluster  v7cluster  SF100 #2/1  v7cluster  v7cluster  v7cluster 

METHOD  basic  quad  rtree  METHOD  basic  quad  rtree 

STEP01  0.1058  0.4640  0.0821  STEP01  0.1683  0.6229  0.1695 

STEP02  0.2303  0.4118  0.3703  STEP02  0.3586  0.8682  0.5513 

STEP03  0.3943  0.9395  0.5208  STEP03  0.5829  0.5552  1.0560 

STEP04  0.5587  1.0636  0.4454  STEP04  0.8644  2.5406  1.0788 

STEP05  0.7482  2.1896  0.6861  STEP05  1.1468  4.5236  1.5198 

STEP06  0.9933  2.9455  0.7742  STEP06  1.4029  6.4307  1.5022 

STEP07  0.8034  0.5584  0.2992  STEP07  1.1142  0.6679  0.6052 

STEP08  0.8792  1.0936  0.3157  STEP08  1.4387  2.1125  0.6102 

STEP09  0.9900  1.0012  0.4326  STEP09  1.6271  1.7392  0.6401 

STEP10  0.9835  1.1958  0.2905  STEP10  1.8525  2.0997  0.5107 

STEP11  1.0821  1.2656  0.5427  STEP11  1.7895  2.1142  1.0567 

STEP12  1.1686  1.3336  0.5325  STEP12  1.7769  2.2917  0.8551 

LSCORE  0.3984  1.0353  0.3943  LSCORE  0.6049  1.6760  0.7901 

LSCORE/$  0.0036  0.0094  0.0035  LSCORE/$  0.0055  0.0152  0.0071 

HSCORE  0.9770  1.0356  0.3885  HSCORE  1.5764  1.7092  0.6914 

HSCORE/$  0.0088  0.0094  0.0035  HSCORE/$  0.0143  0.0155  0.0062 

SCORE  0.6239  1.0355  0.3914  SCORE  0.9765  1.6925  0.7391 

SCORE/$  0.0056  0.0094  0.0035  SCORE/$  0.0088  0.0154  0.0067 

3.6.12 Scalefactor100,4and8querystreams,8partitionedserversSF100 #4/1  v7cluster  v7cluster  v7cluster  SF100 #8/1  v7cluster  v7cluster  v7cluster 

METHOD  basic  quad  rtree  METHOD  basic  quad  rtree 

STEP01  0.2992  1.3469  0.3275  STEP01  0.2387  1.6907  0.3302 

STEP02  0.5750  0.9592  0.7332  STEP02  0.5327  1.5570  0.5674 

STEP03  0.9540  2.5549  1.4347  STEP03  0.7490  2.6810  1.2959 

STEP04  1.1772  4.1456  1.4000  STEP04  1.1318  4.1801  0.9172 

STEP05  1.2345  6.470  1.4869  STEP05  1.3073  6.1331  1.3168 

STEP06  1.5035  12.354  2.0413  STEP06  1.7921  9.9537  1.6515 

STEP07  1.8901  1.8784  1.3785  STEP07  1.6881  2.1458  0.7626 

STEP08  2.3643  3.2899  1.3904  STEP08  2.1298  3.0693  0.9078 

STEP09  1.9110  2.4358  0.9731  STEP09  1.8109  2.6453  0.9261 

STEP10  2.5792  2.7515  1.0189  STEP10  2.2441  3.2538  0.8878 

STEP11  2.2809  2.8921  1.0824  STEP11  1.7612  3.0709  0.9556 

STEP12  3.2223  3.2312  1.1578  STEP12  1.7843  3.4796  1.0949 

LSCORE  0.8429  3.2084  1.0656  LSCORE  0.7951  3.4863  0.8862 

LSCORE/$  0.0076  0.0291  0.0097  LSCORE/$  0.0072  0.0317  0.0080 

HSCORE  2.3338  2.6985  1.1555  HSCORE  1.8918  2.9076  0.9173 

HSCORE/$  0.0212  0.0245  0.0105  HSCORE/$  0.0172  0.0264  0.0083 

SCORE  1.4026  2.9424  1.1096  SCORE  1.2265  3.1838  0.9016 

SCORE/$  0.0127  0.0267  0.0101  SCORE/$  0.0111  0.0289  0.0082 

Page 37: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page37

4. ConclusionsIn this report, we have described an evaluation of the LOD2 GeoBench on a variety of system

configurations.Wenowdrawconclusionsonthefollowingissues:

The benchmark itself. The LOD2 GeoBench is a challenging benchmark, specifically the InstanceAggregation and Retrieval Queries pose an intense workload to the system. We see that exactimplementations (i.e. basic, rtree, rtree++ but not quad) have a hard time scaling the InstanceAggregationQuerywellatthehigherzoomlevels.WealsoseethattheInstanceRetrievalQueryatthefirst zoom levels where it is used (7‐9) causes a dip in performance due to such retrieval queriesyieldingmanyinstancesandaccessingmanydatapagesinthedatabasesubsystem.Ontheonehandthistellsusthatthebenchmarkisinteresting.Publishingaboutthisbenchmarkwillputemphasisonfinding better solutions to e.g. the Instance Retrieval Query, e.g. by pushing the envelope in queryoptimization. Further, the inherent problems in the lower zoom levels may help the RDF servervendorstoprovidebetterhookstoperformindexingandpre‐computation.Asideasforav3.0ofthebenchmark,we should consider changing the switchover point from InstanceAggregationQuery toInstance Retrieval query at a deeper zoom level. This would be a natural reaction in a real‐lifeapplication toensuredependable latenciesacrossqueries.Further, in the futureweneed to testonlarger data, and with many more concurrent query streams. Finally, a better analysis of theperformancestabilityoftheresultsisneeded.Becauseweareworkingonrealdata,thecardinalitiesoftheselectionsarenotfullypredictableandcanvaryconsiderably,potentiallyintroducingnoiseinthebenchmarkscores.Thiscouldbeaddressedbyhavingthequerygeneratorbeingevenmoreintelligentingeneratingquerypatterns,suchastogenerateproperevenlybalancedparameterbindings.

ThestateofRDFdatabasetechnology.ThethreerightmostresultgroupsinFigure3areanexampleof the achievements in theLOD2project,whereacademic researchperformedbyCWIon columnarandvectorizedqueryexecutionhasmeasurably improvedtheperformanceof theOpenlinkVirtuosoproduct from V6 to V7 by a factor 7, in this case; creating a competitive advantage. In general,geographicalindextechnologyisshownbytheLOD2GeoBenchtobequiteeffective,inthehighzoomlevels. The plans do show some unexpected results, with certain quad Virtuoso V7 query plansbecoming slower than in V6, which likely is down to query optimizer issues. Query optimizationremainsoneofthebiggestchallengesinSPARQLqueryexecution;whichintheLOD2GeoBenchshowsin faults in properly handling the disjunctive queries (the four FACET selections) and the complexquadexpressions.

Even though the thinking in theRDFcommunitymaybe thatRDFdatabasesupport isclosing inonindustry readiness on relational technology, the LOD2 GeoBench shows some very significantconceptualholes.Forinstance,inrelationaltechnologythereareimportantphysicaldesignconcepts,suchasmaterializedviews andclustered indexes, explicitly created for certainpredicates.Theseconcepts are not possible to express in the RDF world. For instance, in a multi‐resolution mapsituation, a relational DBA or database designer would likely develop multiple tables at multipleresolutions, and create separate (RTree) indexes for these. Such tables, or materialized views thatstoreprecomputedexpressions(likefacetcountsatacertaingranularity).Thismeansthatqueriesona lowzoomlevelwouldonlyaccessthematerializedviewrelevant for itonly,whichona lowzoomlevel could have pruned most of dthe detailed data (the individual lamp posts). Accessing thatmaterializedviewthroughitsRTree indexwillbeefficient. Ifallmaterializedviewsforthedifferentresolutionswouldbeunifiedintoonebigdatastructure,allinformationforotherzoomlevelswouldendupintermingledinthesamediskblocksoftheRTree,suchthatmostofthedatascannedwouldbeirrelevant(becauseforadifferentresolution).Thisunifyingofalldata inonebigbucket iswhattheRDFmodeldoes.Whatisneededaremechanismstocreatematerializedviews(maybebyconstructingderiveddatainaspecialkindoftriplegraph)andallowingcertainindexes(suchasRTree)tobebuiltseparately for such a triple graph. That way the RTree will only contain relevant information.Currently,RDFdatabasetechnologydoesnotoffersuchdatabasedesignconcepts.

Page 38: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page38

RDF geographical browsing application design: faceted browsing on large datasets needs pre‐computation.There isnoway aGoogleMaps experience canbe created straight from the rawbasedata (triples) in a dataset. The quad approach described and benchmarked here specificallytransformstheapplicationdatabaseneedsinsuchawaythatprecomputationofexpressionsbecomespossible. In this case, thequadapproachprecomputes facet instance counts for all tiles, atmultipledifferentgranularities.Queriesthenusetheseprecomputedcountstoavoidhavingtogotothebasedata. Itcannotbestressedenoughthatwithoutprecomputation,queriesatahighzoomlevelwouldneverperformwell,norwouldtheyeverproducenice‐lookingresults(justmillionsoflamp‐poststhatcannotbesensiblydrawnonamap).Thereisalsolittlehopethatsuchprecomputationandindexingcouldbearrangedfullyautomatically.Thismeansthatapplicationdesignersneedtotakethedatabasedesignissueveryseriously.

The latest version of the LOD2Geographical Browser adds significant new features thatmakes theassociationbetweengeographical informationandRDFdata flexibletospecify.Theolderversion,ofwhich a screenshot has been posted in Figure 1, just assumed that the geographical literal (point,polyline,polygon)wouldbeadirectpropertyofafacetinstancesubject.Itis,however,alsopossibletoassociate facet instances over long(er) join paths to geographical literals. The consequence of suchlongerjoinpathsisthatgeographicalquerieswillexperiencelesslocalityfromtheRTreejoinpath,butmoreimportantly,aninterfacewheresuchjoinpathscouldbevariedatrun‐timeflexiblywouldmakeit much more difficult to generate materialized views (such as our pre‐generated quad triples).Creating a Browsing Interface that flexibly allows users to specify these associations, yet rendersresultpages in interactive timeonvery large‐scaledata isextremelychallenging(ifnot impossible).Anotherissueiswhetherordinaryusers,accessing(RDF)dataviagraphicalinterfacesarelookingfortheflexibilitytoassociatejoinpathsthroughacomplexdatamodel,thatislikelyunknowntothem.Itseemsmorepropablethatifrelevantcomplexjoinpathsbetweeninstancesandtheirgeographyexist,itwouldbe the taskofanapplicationdesigner to identify these. In suchacase, theaforementioneddesiredmaterializedviewfunctionalitythatiscalledforinRDFdatabasesystemswouldthencomeinhandytopre‐materializethesegeographiesasdirectpropertiesandacceleratetheminseparateRTreeindexes.

Page 39: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page39

5. Appendix:ConfigurationDetails5.1 Software

Virtuoso6:Version06.04.3132‐pthreadsforLinuxasofMay142012

Virtuoso7(forbothsingleandtheclusterversion):We used a development version of OpenLink Virtuoso Universal Server: Version07.00.3203‐pthreadsforLinuxasofAug182013

Owlim:Owlim‐SE:Version5.3.6156Tomcat:Version7.0.30

5.2 HardwareWe used CWI Scilens (www.scilens.org) cluster for the benchmark experiment. This cluster is

designed for high I/O bandwidth, and consists ofmultiple layers ofmachines. In order to get largeamounts of RAM,we used only the “bricks” layer,which contains itsmost powerfulmachines. ThemachineswereconnectedbyMellanoxMCX353A‐QCBTConnectX3VPIHCAcard(QDRIB40Gb/sand10GigE)throughanInfiniScaleIVQDRInfiniBandSwitch(MellanoxMIS5025Q).Eachmachinehasthefollowingspecification.

Hardware:(8machines)‐ Processors:2x Intel(R)Xeon(R)CPUE5‐2650,2.00GHz(8cores&hyperthreading),

SandyBridgearchitecture‐ Memory:256GB‐ HardDisks:3x1.8TB(7,200rpm)SATAinRAID0(180MB/ssequentialthroughput).

Software:‐ OperatingSystem:Linuxversion3.3.4‐3.fc16.x86_64

Filesystem:ext4‐ JavaVersionandJVM:Version1.6.0_31,64‐BitServerVM(build20.6‐b01).

ThetotalcostofthisconfigurationwasEUR70,000;whenacquiredin2012.

5.3 Configurationfiles Virtuoso6&Virtuoso7&V7cluster

Eachdatabasehas a virtuoso.ini file as the configuration file. For the cluster version, inaddition to the virtuoso.ini file, there are three other configuration files in each node:cluster.ini,virtuoso.global.ini,clusterglobal.ini.‐ Thevirtuoso.inifilereads:[Database]

DatabaseFile = virtuoso.db

TransactionFile = virtuoso.trx

ErrorLogFile = virtuoso.log

ErrorLogLevel = 7

FileExtend = 200

Striping = 0

Page 40: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page40

Syslog = 0

;

; Server parameters

;

TempStorage = TempDatabase

[Parameters]

ServerPort = 1113

ServerThreads = 100

AsyncQueueMaxThreads = 50

ThreadsPerQuery = 32

CheckpointInterval = 120

NumberOfBuffers = 6000000

MaxDirtyBuffers = 450000

MaxCheckpointRemap = 2500000

DefaultIsolation = 2

MaxMemPoolSize = 40000000

StopCompilerWhenXOverRunTime = 1

AdjustVectorSize = 1

IndexTreeMaps = 64

FDsPerFile = 4

UnremapQuota = 0

CaseMode = 2

AllowOSCalls = 1

SafeExecutables = ../../bin/isql

Debug = 0

SQLOptimizer = 1

CallstackOnException = 0

PlDebug = 0

DirsAllowed = /,., ../../vad,../../dataset

MaxVectorSize = 500000

[HTTPServer]

ServerPort = 8892

ServerThreads = 30

ServerRoot = .

FTPServerPort = 10565

FTPServerAnonymousLogin = 1

FTPServerTimeout = 1200

[AutoRepair]

BadParentLinks = 0

BadDTP = 0

[Client]

Page 41: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page41

SQL_QUERY_TIMEOUT = 0

SQL_TXN_TIMEOUT = 0

SQL_PREFETCH_ROWS = 100

SQL_PREFETCH_BYTES = 16000

[VDB]

ArrayOptimization = 0

NumArrayParameters = 10

[TempDatabase]

DatabaseFile = virtuoso.tdb

TransactionFile = virtuoso.ttr

FileExtend = 200

[Replication]

ServerName = virt6565

ServerEnable = 1

QueueMax = 50000

[URIQA]

DefaultHost = localhost.localdomain:13565

LocalHostNames = localhost:13565, master:13565, 10.1.1.1:13565

LocalHostMasks = master_.iv.dev.null:13565, master_:13565

Notethatineachclusternode,theserverportisdifferent.

‐ Thevirtuoso.global.iniinnode1(masternode)reads:

[Parameters] MaxQueryMem = 30G MaxVectorSize = 1000000 Affinity = 1-7 16-23 ListenerAffinity = 0 [Flags] enable_subscore = 0 dfg_empty_more_pause_msec = 100 dfg_max_empty_mores = 100000 qp_thread_min_usec = 100 cl_dfg_batch_bytes = 100000000 enable_high_card_part = 1 enable_vec_reuse = 1 mp_local_rc_sz = 0 dbf_explain_level = 3 enable_feed_other_dfg = 1 enable_cll_nb_read = 1 dbf_no_sample_timeout = 1

Notethat,mostoftheparametersinthevirtuoso.global.iniarethesameforeverynode,exceptthe“Affinity”,whichis9‐1524‐31forthenodeatevenindex(e.g.,node2,node4,etc)

‐ Thecluster.iniinnode1(masternode)reads:

Page 42: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page42

[Cluster]

Threads = 200

Master = Host1

ThisHost = Host1

ReqBatchSize = 10000

BatchesPerRPC = 4

BatchBufferBytes = 20000

LocalOnly = 2

MaxKeepAlivesMissed = 3000

[ELASTIC]

Slices = 16

Segment1 = 1024, cl1/cl1.db = q1

Notethat,only“ThisHost”parameterischangedforothernodes.“Slices=16”appearsonlyinthemasternode.

‐ Theclusterglobal.inireads:[Cluster]

Threads = 200

Master = Host1

ReqBatchSize = 10000

BatchesPerRPC = 4

BatchBufferBytes = 20000

LocalOnly = 2

MaxKeepAlivesMissed = 2000

Host1 = 192.168.64.203:22201

Host2 = 192.168.64.203:22202

Host3 = 192.168.64.204:22203

Host4 = 192.168.64.204:22204

Host5 = 192.168.64.209:22205

Host6 = 192.168.64.209:22206

Host7 = 192.168.64.207:22207

Host8 = 192.168.64.207:22208

Host9 = 192.168.64.205:22209

Host10 = 192.168.64.205:22210

Host11 = 192.168.64.211:22211

Host12 = 192.168.64.211:22212

Host13 = 192.168.64.212:22213

Host14 = 192.168.64.212:22214

Host15 = 192.168.64.213:22215

Host16 = 192.168.64.213:22216

Page 43: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page43

Owlim‐ Thegetting‐startedapplicationwasusedforbulk‐loading.Theowlim.ttlfileingetting‐

startedapplicationreads:

# Sesame configuration template for a owlim repository

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.

@prefix rep: <http://www.openrdf.org/config/repository#>.

@prefix sr: <http://www.openrdf.org/config/repository/sail#>.

@prefix sail: <http://www.openrdf.org/config/sail#>.

@prefix owlim: <http://www.ontotext.com/trree/owlim#>.

[] a rep:Repository ;

rep:repositoryID "owlim" ;

rdfs:label "OWLIM Getting Started" ;

rep:repositoryImpl [

rep:repositoryType "openrdf:SailRepository" ;

sr:sailImpl [

sail:sailType "owlim:Sail" ;

owlim:owlim-license "OWLIM_SE_01092013_128cores.license" ;

owlim:entity-index-size "500000000" ;

owlim:repository-type "file-repository" ;

owlim:ruleset "empty" ;

owlim:storage-folder "owlim-storage" ;

owlim:transaction-mode "fast" ;

# OWLIM-SE parameters

owlim:cache-memory "120G" ;

# OWLIM-Lite parameters

owlim:noPersist "false" ;

]

].

‐ Theexample.shscriptsingetting‐startedapplicationreads:foo=`pwd`

cd ..

. ./setvars.sh

cd $foo

#$JAVA_HOME/bin/java -Xmx512m -cp "bin:$CP_TESTS" GettingStarted $*

$JAVA_HOME/bin/java -d64 -Xmx200G -Xms160G -Dcache-memory=100G -Ddisable-plugins=rdfpriming -cp "bin:$CP_TESTS" GettingStarted context= $*

Page 44: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page44

‐ Owlim databases have a Sesame template file in ~/.aduna/openrdf‐sesame‐console/templates/.Sesametemplatefilereads:

#

# Sesame configuration template for an OWLIM-SE repository

#

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.

@prefix rep: <http://www.openrdf.org/config/repository#>.

@prefix sr: <http://www.openrdf.org/config/repository/sail#>.

@prefix sail: <http://www.openrdf.org/config/sail#>.

@prefix owlim: <http://www.ontotext.com/trree/owlim#>.

[] a rep:Repository ;

rep:repositoryID "olgeo10" ;

rdfs:label "OWLIM Geo 10" ;

rep:repositoryImpl [

rep:repositoryType "openrdf:SailRepository" ;

sr:sailImpl [

sail:sailType "owlim:Sail" ;

owlim:owlim-license OWLIM_SE_01092013_128cores.license" ;

owlim:base-URL "{%Base URL|http://example.org/owlim#%}" ;

owlim:defaultNS "{%Default namespaces for imports(';' delimited)%}" ;

owlim:entity-index-size "{%Entity index size|200000%}" ;

owlim:entity-id-size "{%Entity ID bit-size|32%}" ;

owlim:imports "{%Imported RDF files(';' delimited)%}" ;

owlim:repository-type "{%Repository type|file-repository%}" ;

owlim:ruleset "{%Rule-set|owl-horst-optimized%}" ;

owlim:storage-folder "{%Storage folder|storage%}" ;

owlim:enable-context-index "{%Use context index|false%}" ;

owlim:cache-memory "50G" ;

owlim:tuple-index-memory "{%Main index memory|80m%}" ;

owlim:enablePredicateList "{%Use predicate indices|false%}" ;

owlim:predicate-memory "{%Predicate index memory|0%}" ;

owlim:fts-memory "{%Full-text search memory|0%}" ;

owlim:ftsIndexPolicy "{%Full-text search indexing policy|never%}" ;

owlim:ftsLiteralsOnly "{%Full-text search literals only|true%}" ;

owlim:in-memory-literal-properties "{%Cache literal language tags|false%}" ;

owlim:enable-literal-index "{%Enable literal index|true%}" ;

owlim:index-compression-ratio "{%Index compression ratio|-1%}" ;

Page 45: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page45

owlim:check-for-inconsistencies "{%Check for inconsistencies|false%}" ;

owlim:disable-sameAs "{%Disable OWL sameAs optimisation|false%}" ;

owlim:enable-optimization "{%Enable query optimisation|true%}" ;

owlim:transaction-mode "{%Transaction mode|safe%}" ;

owlim:transaction-isolation "{%Transaction isolation|true%}" ;

owlim:query-timeout "{%Query time-out (seconds)|0%}" ;

owlim:query-limit-results "{%Limit query results|0%}" ;

owlim:throw-QueryEvaluationException-on-timeout "{%Throw exception on query time-out|false%}" ;

owlim:useShutdownHooks "{%Enable shutdown hooks|true%}" ;

owlim:read-only "{%Read-only|false%}" ;

]

].

5.4 BulkLoad Virtuoso6

‐ Bulk‐loadingwasrunwithonlysingle loadingprocess.Bulk‐loading forscale1withVirtuoso6takes3h35p.

11:41:47 PL LOG: Loader started

13:43:31 Checkpoint started

13:45:27 Checkpoint finished, log reused

15:16:20 PL LOG: No more files to load. Loader has finished

Virtuoso7‐ Bulk‐loadingwasrunwith14loadingprocessesinparallel.Forexample,Bulk‐loading

forscale10withVirtuoso7takes1hand32minutes.01:38:19 PL LOG: Loader started

01:38:19 PL LOG: Loader started

01:38:19 PL LOG: Loader started

01:38:19 PL LOG: Loader started

01:38:19 PL LOG: Loader started

01:38:19 PL LOG: Loader started

01:38:19 PL LOG: Loader started

01:38:19 PL LOG: Loader started

01:38:19 PL LOG: Loader started

01:38:19 PL LOG: Loader started

01:38:19 PL LOG: Loader started

01:38:19 PL LOG: Loader started

01:38:19 PL LOG: Loader started

01:38:19 PL LOG: Loader started

02:30:24 PL LOG: No more files to load. Loader has finished,

02:31:02 PL LOG: No more files to load. Loader has finished,

02:31:58 PL LOG: No more files to load. Loader has finished,

02:32:29 PL LOG: No more files to load. Loader has finished,

Page 46: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page46

02:33:47 PL LOG: No more files to load. Loader has finished,

02:36:08 PL LOG: No more files to load. Loader has finished,

02:39:10 PL LOG: No more files to load. Loader has finished,

02:40:21 PL LOG: No more files to load. Loader has finished,

02:40:21 PL LOG: No more files to load. Loader has finished,

02:45:59 PL LOG: No more files to load. Loader has finished,

02:46:06 PL LOG: No more files to load. Loader has finished,

02:47:06 PL LOG: No more files to load. Loader has finished,

02:47:39 PL LOG: No more files to load. Loader has finished,

03:10:47 PL LOG: No more files to load. Loader has finished,

V7cluster‐ Bulk‐loading was run with 2 loading processes in each node (thus, 32 loadingprocessesinall16nodes).Forexample,Bulk‐loadingforscale100withV7clusterintakes5hand11minutesinMasterNode.

17:13:50 PL LOG: Loader started

17:16:07 PL LOG: Loader started

22:24:00 PL LOG: No more files to load. Loader has finished

22:24:02 PL LOG: No more files to load. Loader has finished

Owlim‐ Bulk‐loadingwasrunbyusinggetting‐startedapplication.Thedatasetiscopiedtothe

preload directory and then is loaded into the owlim repository by running scriptexample.sh.Forexample,bulk‐loadingforscale1withOwlimtakes1h25p.

18:34:40 ===== Load Files (from the 'preload' parameter) ==========

18:34:40 Loading files from: /scratch/duc/lod2/GeoBench/owlim/owlim-se-5.3.6156/getting-started/./preload

Loading FacetCount12.nt 373566 statements

Loading FacetCount14.nt . 731376 statements

Loading FacetCount16.nt .. 1380933 statements

Loading FacetCount18.nt ..... 2563677 statements

Loading FacetCount20.nt ......... 4681485 statements

Loading FacetCount22.nt ................ 8342163 statements

Loading FacetCount24.nt ............................ 14383152 statements

Loading FacetMap12.nt ......... 4627860 statements

Loading FacetMap14.nt ................ 8364460 statements

Loading FacetMap16.nt ............................. 14697532 statements

Loading FacetMap18.nt ................................................. 24647112 statements

Loading FacetTile.nt ........................................................ 28258360 statements

Loading LGD-Dump-Ontology.nt 8721 statements

Loading refined_LGD-Dump-RelevantNodes.sorted.nt .......................................................................................................................................... 69495409 statements

Loading refined_LGD-Dump-RelevantWays.sorted.nt ........................................................................................................................................ 68475661 statements

19:59:54 TOTAL: 251031467 statements loaded

Page 47: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page47

5.4.1 Sizing Virtuoso6&Virtuoso7

Thedatabasesizeiscomputedbymeasuringthesizeofvirtuoso.*ineachdatabasedirectory.Forexample,databaseofscale10ismeasured:

[duc@bricks05 10gindex]$ ls -al -h virtuoso.*

-rw-r--r-- 1 duc ins1 108G Aug 21 12:03 virtuoso.db

-rwxrwxr-x 1 duc ins1 1.8K Aug 6 16:58 virtuoso.ini

-rw-r--r-- 1 duc ins1 14K Aug 21 12:03 virtuoso.log

-rw-r--r-- 1 duc ins1 0 Aug 1 01:34 virtuoso.pxa

-rw-r--r-- 1 duc ins1 14M Aug 21 02:54 virtuoso.tdb

-rw-r--r-- 1 duc ins1 0 Aug 21 12:03 virtuoso.trx

V7Cluster

The database size is computed by summarizing the size of each database directory in eachnode.Forexample,databaseofscale100ismeasured:

du -s -h /scratch/duc/lod2/cg100/*/cl*/

Database size at the node bricks03

88G /scratch/duc/lod2/cg100/01/cl1/

81G /scratch/duc/lod2/cg100/02/cl2/

Database size at the node bricks04

74G /scratch/duc/lod2/cg100/03/cl3/

73G /scratch/duc/lod2/cg100/04/cl4/

Database size at the node bricks09

71G /scratch/duc/lod2/cg100/05/cl5/

71G /scratch/duc/lod2/cg100/06/cl6/

Database size at the node bricks07

69G /scratch/duc/lod2/cg100/07/cl7/

93G /scratch/duc/lod2/cg100/08/cl8/

Database size at the node bricks05

67G /scratch/duc/lod2/cg100/09/cl9/

69G /scratch/duc/lod2/cg100/10/cl10/

Database size at the node bricks11

66G /scratch/duc/lod2/cg100/11/cl11/

68G /scratch/duc/lod2/cg100/12/cl12/

Database size at the node bricks12

69G /scratch/duc/lod2/cg100/13/cl13/

76G /scratch/duc/lod2/cg100/14/cl14/

Database size at the node bricks13

69G /scratch/duc/lod2/cg100/15/cl15/

72G /scratch/duc/lod2/cg100/16/cl16/

Owlim

Thedatabasesizeiscomputedbymeasuringthesizeofthecreatedrepository.Forexample,databasesizeofscale1ismeasured:

[duc@bricks13 data]$ du -s -h openrdf-sesame/repositories/olgeo1

24G openrdf-sesame/repositories/olgeo1

Page 48: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page48

5.4.2 BulkLoadScript

Virtuoso6andVirtuoso7

The bulk loading script for Virtuoso is applied on an empty database. First, theregister_load_files.sqlisruntoregisterthelistoffilestoload.Then,theloadingprocessisrunbyusingthescriptrdfload.sh.ForVirtuoso7,14“rdf_loader_run()”wereexecuted.

isql 1113 dba dba < register_load_files.sql

./rdfload.sh

[duc@bricks13 10gindex]$ cat register_load_files.sql

ld_dir ('/scratch/duc/lod2/GeoBench/datasets/10geoindex/', '%.gz', 'http://GeoBench.org');

[duc@bricks13 10gindex]$ cat rdfload.sh

echo "Start loading "

date

isql 1113 dba dba exec="rdf_loader_run();" &

isql 1113 dba dba exec="rdf_loader_run();" &

isql 1113 dba dba exec="rdf_loader_run();" &

isql 1113 dba dba exec="rdf_loader_run();" &

isql 1113 dba dba exec="rdf_loader_run();" &

isql 1113 dba dba exec="rdf_loader_run();" &

isql 1113 dba dba exec="rdf_loader_run();" &

isql 1113 dba dba exec="rdf_loader_run();" &

isql 1113 dba dba exec="rdf_loader_run();" &

isql 1113 dba dba exec="rdf_loader_run();" &

isql 1113 dba dba exec="rdf_loader_run();" &

isql 1113 dba dba exec="rdf_loader_run();" &

isql 1113 dba dba exec="rdf_loader_run();" &

isql 1113 dba dba exec="rdf_loader_run();" &

wait

isql 1113 dba dba exec="checkpoint;"

echo "end loading"

date

V7Cluster

The dataset files are equally divided into each machines. In each machines, theregister_load_files_GEO.sqlisusedforregisteringthelistoffiletoloadinthatmachine.

ssh bricks03 "isql 1113 dba dba < /scratch/duc/lod2/cg100/register_load_files_GEO.sql"

ssh bricks04 "isql 12203 dba dba < /scratch/duc/lod2/cg100/register_load_files_GEO.sql"

ssh bricks09 "isql 12205 dba dba < /scratch/duc/lod2/cg100/register_load_files_GEO.sql"

ssh bricks07 "isql 12207 dba dba < /scratch/duc/lod2/cg100/register_load_files_GEO.sql"

ssh bricks05 "isql 12209 dba dba < /scratch/duc/lod2/cg100/register_load_files_GEO.sql"

ssh bricks11 "isql 12211 dba dba < /scratch/duc/lod2/cg100/register_load_files_GEO.sql"

ssh bricks12 "isql 12213 dba dba < /scratch/duc/lod2/cg100/register_load_files_GEO.sql"

ssh bricks13 "isql 12215 dba dba < /scratch/duc/lod2/cg100/register_load_files_GEO.sql"

[duc@bricks05 /]$ cat /scratch/duc/lod2/cg100/register_load_files_GEO.sql

ld_dir ('/scratch/duc/lod2/cg100/datasetg100_5', '%.gz', 'http://linkedgeodata.org');

ThentheMasternodestartstheloadingprocessinallthenodes.

Page 49: LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

D5.1.4–v1.0

Page49

cl_exec (' rdf_ld_srv ()' ) &

cl_exec (' rdf_ld_srv ()' ) &

Owlim

Thebulkloadingisdonebycallingexample.shinthegetting‐startedapplication.cd /scratch/duc/lod2/GeoBench/owlim/owlim-se-5.3.6156/getting-started

./example.sh