LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month

CollaborativeProject

LOD2–CreatingKnowledgeoutofInterlinkedData

Deliverable5.1.4

LOD2GeoBenchv2.0Evaluation

DisseminationLevel Public

DueDateofDeliverable Month36,31/08/2013

ActualSubmissionDate Month36,31/08/2013

WorkPackage WP5‐ LinkedDataBrowsing,VisualizationandAuthoringInterfaces

Task T5.1

Type Report

ApprovalStatus Approved

Version 1.0

NumberofPages 49

Filename LOD2_D5_1_4_GEO_Benchmark_Evaluation.pdf

Abstract:ThisreportdescribestheevaluationoftheLOD2GeoBenchmark,developedtoensurethatRDFstorageenginesprovidetheproperleveloffunctionalityandperformancetofacilitatetheneedsofLinkedDataBrowsing,VisualizationandAuthoringInterfaces.

Theinformationinthisdocumentreflectsonlytheauthor’sviewsandtheEuropeanCommunityisnotliableforanyusethatmaybe made of the information contained therein. The information in this document is provided “as is” without guarantee orwarrantyofanykind,expressorimplied,includingbutnotlimitedtothefitnessoftheinformationforaparticularpurpose.Theuserthereofusestheinformationathis/hersoleriskandliability.

Projectco‐fundedbytheEuropeanCommissionwithintheSeventhFrameworkProgramme

(2007–2013)

ProjectNumber:257943 StartDateofProject: 01/09/2010 Duration:48months

D5.1.4–v1.0

Page2

HistoryVersion Date Reason Revisedby

0.2 07/08/2013 InitialDraft(incomplete) PeterBoncz

0.9 25/08/2012 CompleteDraft PeterBoncz

1.0 29/08/2013 Completeversionaftercomments DucMinhPham

1.1 30/08/2013 Minoredits,correctionoftypos PeterBoncz

AuthorListOrganisation Name ContactInformation

CWI PeterBoncz [email protected]

D5.1.4–v1.0

Page3

ExecutiveSummary

ThisreportgivesanaccountofevaluatingtheLODGeoBench‐aspreviouslydevelopedinD5.1.2‐onanumberofdifferentRDFdatabasesystems.ThisLOD2GeoBenchaimstotestthefunctionalityandperformanceofRDFstoresusedinLinkedDataBrowsing,VisualizationandAuthoringInterfaces.

Thisbenchmarkisnotintendedasapurelyscientificdeliverable,itisratherfocusedinaddressingpractical challenges in the Geo Browsing components, as developed by University Leipzig(browser.linkedgeodata.org). In particular, it highlights performance problems encountered whenlaying out linked objects on amap,whichmay have highly different zoom levels. The performancechallengeismakingsurethatperformancealwaysremainsinteractive,irrespectiveofthezoomlevelorfacetselections.

Thisreportcoincideswiththeopen‐sourcereleaseofv2.0oftheLOD2GeoBench.Theevaluationpresentedheregoesbeyond theoneat the initial specification inD5.1.2whichwas runon justonesystem(analphapre‐releaseversionofVirtuoso7).Hereweaddbenchmarkingonmultiplesystems,onlargedatasizes(scalefactor100)andusingclusterhardware,insteadofjustasinglemachine.

The overall message coming out of these experiments is that to create high‐performance(interactive) geospatial faceted browing interfaces, specific pre‐computation and indexing effort isneeded(thisisembodiedbythe“quad”implementation).Thismeansthatontheonehand,applicationdesigners need to think of their data access strategy. On the other hand, more hooks for physicaltuningareneededinRDFdatabasesystemstomakethispossible.

Tool Purpose Address

SPARQLendpoint ExecuteSPARQLqueries http://lod.openlinksw.com/sparql

WebServiceAPI RESTInterface http://lod.openlinksw.com/fct/service

FacetBrowser TextSearchandlookups http://lod.openliksw.com/fct

D5.1.4–v1.0

Page4

AbbreviationsandAcronymsAcronym ExplanationLOD LinkedOpenDataGeoJSON GeographicJavaScriptObjectNotationGFM GeneralFeatureModel(asdefinedinISO19109)GML GeographyMarkupLanguageKML KeyholeMarkupLanguageOWL WebOntologyLanguageRCC RegionConnectionCalculusRDF ResourceDescriptionFrameworkRDFS RDFSchemaRIF RuleInterchangeFormatSPARQL SPARQLProtocolandRDFQueryLanguageWKT WellKnownText(asdefinedbySimpleFeaturesorISO19125)W3C WorldWideWebConsortium(http://www.w3.org/)XML eXtendedMarkupLanguageOGC OpenGeospatialConsortiumLGD LinkedGeodataBrowser(http://browser.linkedgeodata.org)OSM OpenStreetMapLGB LOD2GeoBench(definedinthisdocument)

D5.1.4–v1.0

Page5

TableofContents1. Introduction................................................................................................................................................61.1 Outline..................................................................................................................................................6

2. Benchmark...................................................................................................................................................72.1 Goals......................................................................................................................................................72.2 Dataset..................................................................................................................................................72.2.2 QueryWorkload............................................................................................................................................82.2.3 BenchmarkMetrics..................................................................................................................................112.2.4 BenchmarkPrograms..............................................................................................................................12

2.3 BenchmarkImplementations....................................................................................................122.3.1 BasicImplementation..............................................................................................................................122.3.2 RTreeandRTree++Implementations..............................................................................................142.3.3 QuadImplementaton...............................................................................................................................14

3. Evaluation..................................................................................................................................................203.1 HardwarePlatform........................................................................................................................203.2 RDFDatabaseSystemsTested...................................................................................................203.3 LoadingResults...............................................................................................................................213.4 OverallBenchmarkResults.........................................................................................................223.5 DetailedQueryPerformanceResults......................................................................................253.6 FullQueryPerformanceResults...............................................................................................31

4. Conclusions................................................................................................................................................37

5. Appendix:ConfigurationDetails........................................................................................................39

D5.1.4–v1.0

Page6

1. IntroductionGeographic informationmanagement is a generallywell‐understood task in datamanagement.

Relational database systems technologically support geographical data, sometimes by incorporatingmulti‐dimensional indexing structures like the RTree, or using simple uni‐dimensional BTrees (inconjunctionwith a space‐filling curve). In RDF datamanagement,manyRDF stores support spatialdata management, providing functions to test geospatial predicates; sometimes technologicallysupportedbydatastructuressuchastheRTree.Thesespecificsystemextensionsarebeingreplacedby general adoption of the proposed GeoSPARQL standard proposed by the Open GeospatialConsortium. As such, application development and deploymentwhere the data involves geographyshouldbesupportablewithRDFdatabasesystems.ThisactivityinLOD2takesthattothetest.

InthepastdeliverableD5.1.2,anewdatabaseandapplicationbenchmarkforfacetedgeographicquerying was introduced, called the LOD2 GeoBench (v1.0). The underlying goal for creating thisbenchmarkisfocusonimprovingtheuserexperiencefortheGeospatialBrowserdevelopedbyAKSWinthecontextoftheLOD2project(browser.linkedgeodata.org),bothbyinfluencingthedesignoftheapplicationandmymeasuringandimprovingtherawpowerforgeographicalqueryexecutioninRDFdatabasesystems.

InthisdeliverablewereportonaseriesofexperimentswhenrunningtheLOD2GeoBenchonfourdifferentsystems:OWLIM5.3,OpenlinkVirtuosoV6(opensource),OpenlinkVirtuosoV7(opensource)and Openlink Virtuoso V7 Cluster Edition. The hardware platform usedwas the SCILENS databasecomputeclusteratCWI.Thishand‐builtclusterconsistsofthreedifferentlayersofnodes,ofwhichweused the highest “bricks” layer, built out of 16 large servers (16 cores, 256GB RAM). This sameplatformwasused to create the record‐breaking runswith150billion tripleson theBSBMExploreandBusinessIntelligencebenchmarks(seedeliverableD2.1.4and1).

1.1 OutlineIn Section 2,we describe the LOD2 GeoBench benchmark in its v2.0 version; released in open

source in conjunctionwith this deliverable. The benchmark can currently be implemented by RDFdatabasessystemsinfourdifferentways(basic,rtree,rtree++andquad),whichwedescribeindetail.

InSection3,weprovideanddiscusstheresultswhenrunningthebenchmarkatscalefactors1,10and100ontheplatformsdescribedabove.Whenusingthe“quad”implementation,whichprovidesimprecise answers, RDF database systems turn out to be capable of sustaining tens of concurrentclient requests simultaneously on a single machine. Considering that real users of the GeospatialBrowserwouldusesignificantthinktimeinbetweenqueries,thismeansthatasinglemachinecouldsupporthundredsofconcurrentusers.Ifpreciseanswersarerequired,theseexperimentsshowthatRDFbasedgeographicalsupport(“rtree++”)provideshighperformanceinqueriesthataremoderatelytostronglyzoomed in;whilequerieson largegeographicalareas(zoomedout)wouldstillhave lowperformance – though it is evident that this problem cannot be eliminated inside RDF databasesystems; only application redesign can overcome it. In all, the experimental results show clearimprovementsoverthesituation18monthsago,andasdocumentedinD5.1.2.

In Section 4 we make some forward looking statements and recommendations both forapplicationdesigningeographicalfacetedbrowsing,aswellonthesideofRDFdatabasetechnology.Inshort,applicationdesignshouldthinkaheadandcreateadditional(indexing)datastructures,inorderto ensure interactive performance at all times. Such physical database design is very common inrelational database systems, but almost completelyundeveloped inRDFdatabase systems.On theirpart,RDFsystemsshouldexposemorefeaturestoenablesuchadditional(indexing)opportunities.

1http://lod2.eu/BlogPost/1584‐big‐data‐rdf‐store‐benchmarking‐experiences.html

D5.1.4–v1.0

Page7

2. Benchmark2.1 Goals

The LOD2 GeoBench is an RDF database/application benchmark for faceted geographicalquerying. In particular, its queries use a combination of geographical selection and grouping andcounting by facets. Such faceted querying in itsmainstream use (outside RDF, e.g. using relationaltechnology) is known to be ahardproblem.Theproblembeing, that grouping and countingby thefacetrequiresa lotofcomputationaleffort if therearemany facet instancesqualifyingtheselection,yetduetotheinfiniteamountofpossibleselectionpredicatesitishardtopreparethesystemforthis.Thus,queriesinvolvingmillionsofinstancesmustreallygroupandcountmillionsoftuples(ortriples)andmakingsuchpartofaninteractivesystemthatshouldrenderaresultscreenwithin0.2secondsisachallenge.Also,facetedbrowsingserversonthewebmaybeusedbymanyclientssimultaneously.Assuch, the database system answering the queries should be capable of providing this interactiveexperiencetomanyusersatthesametime.

The goal of the LOD2 GeoBench result metric (queries per second per $) is to highlight theperformance and architecture problems faced by the Linked Geodata Browser application(browser.linkedgeodata.org),which is being developed atUniversity of Leipzig as part of the LOD2project.Specifically,itisintendedtostimulateboth(i)technicalprogressinRDFdatabasetechnology,improving both the query execution and query optimization support for geographical queries inSPARQLbackends,and(ii)tostimulatethinkingaboutapossibleredesignofRDF‐basedapplicationslike the Linked Geodata Browser. This suggestion for redesignpoints toward an opportunity toredesign physical RDF databases, where for specific access patterns and queries, the applicationarchitectandDBAcoulddecidetopre‐createcertainindexesandmaterializedviews(notethatthisisphrased in relational database terms, in practice this could take the form of additional synthetictriples).

TheLOD2GeoBenchwasdevelopedasdeliverableD5.1.2intheLOD2project,18monthsearlier.Coincidingwith this report,wehavereleasedaversionv2.0of thebenchmark,whosesoftwareanddocumentationisavailableinopensource:

http://svn.aksw.org/lod2/LOD2‐GeoBench

We therefore continue with a re‐cap of the benchmark design and description, including adescriptionofwhathaschangedinv2.0.

2.2 DatasetThedatasetusedbyLOD2GeoBenchistheRDF‐izedOpenStreetMap(OSM)datasetprovidedby

AKSWgroupUniverstyofLeipzig.Thebulkofthisdatasetconsistsof6Mpoints(RelevantNodes)and3.8Mpolygons(RelevantWays).

http://downloads.linkedgeodata.org/releases/2011‐04‐06/

10M dataset statistics ASCIIsize #triples #points #polygonsOntology 1.2MB 8KRelevantNodes 10GB 66M 6MRelevantWays 10GB 65M 60M 3.8MDBpediaInterlinks 14MB 101KGeoNamesInterlinks 60MB 487K

We call this core dataset the SF1 dataset. It contains roughly 10M geographic objects. Theamount of triples (130M) is significantly higher, and the uncompressed size in bytes is 20GB. For

D5.1.4–v1.0

Page8

practicalbenchmarking,weoptforsyntheticscalingofthiscoredataset.NotonlydoesthisnotdependontheavailabilityofadditionalresourcesatAKSW,oronthequestionwhethercreatinglargersubsetsofOSMinRDFmakesense,butitalsomakessurethegeographicalcharacteristicsofthedataremainequalatallbenchmarkscales.Thismakesitbetterpossibletointerpretbenchmarkresultsatdifferentscales.

Thebenchmarkthereforescales thiscoreofrealdatatoanycardinal factorx*SFbycopyingalltriples in all datasets x times, appending the string “_y” (for all y: 0<y<x) to all URIs starting withhttp://linkedgeodata.org/. Thismeanswe getmanymore facets in the Ontology and every facet isduplicated x times in the dataset, belonging to new copies of the instances. This kind of scaling ishighlysimilartotheoneproposedintheDBpediabenchmark,andmimicswhatwouldhappenifmorepropertiesofOpenStreetMapwouldgetincludedinthehttp://linkedgeodata.org/dump.

Thev1.0versionoftheLOD2GeoBenchwouldjustmaketheycopiesofthesamedatainstance,with different subject URIs, replicating the data. The geographic feature (point, polygon, polyline)would just be the same among the copies. This replication strategy backfires in systems that onlycreateRTreegeographicalsearchacceleratorstructuresontheuniquesetofliterals–Virtuosobeingsuchanexample.Thatis,becausethegeographicfeatureswerecopiedandremainedequal,theuniquesetofgeographicliteralswouldnotgrow,andhencethesizeoftheRTreewouldnotgrow.

Thev2.0versionoftheLOD2GeoBench,nowreleased,changesthescalingproceduretoshifteachreplicatedgeographical featurebya tinyrandom(lat,long)delta (encompassinga fewmeters).Thisway, all geographical features areunique, yet the setof such features still is realistic in its size andpositiondistribution.Thiswas themain reason to startwith a “real” coredataset in the firstplace,sinceitisveryhardtocreatesyntheticrandomlygeneratedgeographicaldatathat“makessense”andconformstoreal‐worlddistributions.

Since April 2011, there have been new releases of the core dataset in April and August 2013,whichcontainroughlythesamedata,butactualizedfromOpenStreetMap,splitintherawtripledatafilesbydatafacetcategory(theyusedtobetogether).However,intheLOD2GeoBenchV2.0wehavenot moved to this new core dataset. The rationale has been to keep the v1.0 and v2.0 of LOD2GeoBenchascompatibleaspossible.Having (onlyslightly)more triplesandhaving themactualizedfromOpenStreetMap isof limitedvalue forourpurposeshere. It is,however,possible that a futureversionofthisbenchmarkwillstartusingnewLinkedGeoDatadatasetreleases,ifaloneforthereasonthatthebenchmarkspecificationreliesonthedatareleasebeingonlineanddownloadable.

2.2.1.1 BulkLoadThebenchmarkstartsbycreatinganewdatabase,startingupthedatabaseserver,andloadingthe

fulldatasetintothedatabasesystem(includingpossiblyaddedtriplesinthedatapreparationstep).

ThefulldisclosureofaLOD2GeoBenchresultconsistsof:

1. theelapsedtimeuntilallbulk‐loadinghasfinished.2. thesizeinmegabytesoftheresultingdatabasefilesondisk3. allrelevantDBMSconfigurationfiles.4. scriptscontainingallcommandsusedforbulk‐loading.

2.2.2 QueryWorkloadTheLOD2GeoBenchworkloadmimicsabrowsinguser inaqueryrun.Aqueryrun,basedona

randomseed,deterministicallypicks10centerpoints,andexecutes12steps,eachstepconsistingoftwo queries: the Facet Count Query (FCQ) and an Instance Retrieval Query (IRQ) or an InstanceAggregationQueries (IAQ).Thus theworkload in total consistsof240queries.The sequenceof12stepsisasfollows:

1. displaymapatzoomlevel0atacenterpoint (FCQ1+IAQ1)

D5.1.4–v1.0

Page9

2. zoomtolevel1atthesamecenterpoint (FCQ2+IAQ2)3. zoomtolevel2atthesamecenterpoint (FCQ3+IAQ3)4. zoomtolevel3atthesamecenterpoint (FCQ4+IAQ4)5. zoomtolevel4atthesamecenterpoint (FCQ5+IAQ5)6. pan1/8widtheastatzoomlevel4 (FCQ6+IAQ6)7. zoomtolevel5atthesamecenter (FCQ7+IRQ1)8. pan1/4heightnorthatzoomlevel5 (FCQ8+IRQ2)9. zoomtolevel6 (FCQ9+IRQ3)10. pan1/2widthwestatzoomlevel6 (FCQ10+IRQ4)11. zoomtolevel7 (FCQ11+IRQ5)12. panoneheightsouthatzoomlevel7 (FCQ12+IRQ6)

The power query workload executes a query run directly after data load. It is immediatelyfollowed by the throughput workload. In the power workload, the queries in the query run areexecutedpurely after eachother. In the throughputworkload,multiplequery runs (generatedwithdifferent parameters), run concurrently on the system. The typical concurrency levels to test are2,4,8,16.

2.2.2.1 FacetCountQuery(FCQ)TheLinkedGeodataBrowserdisplaysanoverviewwiththecountperfacetoftheobjectsinthe

visiblewindow.This is anaggregationquery that countsalloccurrences foreach facet in thequerywindow, be it a currently selected (active) facet or not. The query parameters here are the querycenterpoint(LATITUDE,LONGITUDE)andthewindowHEIGHTandWIDTHindegrees.

2.2.2.2 InstanceRetrievalQuery(IRQ)TheMapdisplayedbytheLinkedGeodataBrowsershowsmarkersforallinstancesoftheselected

facets. Torenderascreen, thebenchmarkwillalwaysselect4 facets.This isapureselectionquery(rectangulargeographicwindowandfacets),thereisnogroupingoraggregationinvolved.Inadditionto the parameters LATITUDE,LONGITUDE,HEIGHT andWIDTH, this queryhence also receives fourURIparametersFACET1,FACET2,FACET3,FACET4identifyingthefacetsofinterest.

D5.1.4–v1.0

Page10

Figure 1: The Linked Geodata Browser mis‐handling situations with too many results: queries get

disabled(infowindows)andcertainpartofthescreenexhibitinformationoverflow.

2.2.2.3 InstanceAggregationQuery(IAQ)Atthelowerzoomlevels,whenaverylargeareafitsinthewindow,thesheeramountofresults

cancauseperformanceandusabilityproblems.Forinstance,tryimaginingtovisualizeallstreetlightsin all of Germany as markers on a map on a computer screen. This would mean that millions oflampposticonsneedtobeplacedonthescreen,whichdoesnotevenhaveenoughpixelsforthat.Theresulting drawing is bound to be judged as convoluted by average users. Further, even to arrive atsuchadrawnmapisaperformancechallenge,sincethequeryreturnsmanyresults,whichneedtobeprocessed (and, dependingon the architectureof the application,might alsoneed to be sent to theclient,e.g.awebbrowser).

Theinstanceaggregationquerydealswiththeproblemoftoomanyinstancesbysummarizingtheinstancesgeographically.ThisqueryisusedintheLOD2GeoBenchinsteadoftheInstanceQueryonthe first four zoom levels (the first six steps). For this purposes, it divides the map into 40x20conceptualsquaretiles,andjustallowsonemarkerperactivefacetinsideonetile.Itdoescounthowmany instances fall in a tile, and it displays the most relevant marker in a tile for display (in thebenchmark,wedonot really choose themost relevantmarker, but choose theonewith the largestsubjectURI–i.e.arandomone)andacountofoccurrences.

NotethattheInstanceAggregationQuerydeliverssomethingthatcouldalternativelybeshownasa“heatmap”,ratherthansummarymarkerswithanoccurrencecountinsidethem,assuggested.

2.2.2.4 QueryParametersCenterPoint. The randomly generated queries in the LOD2GeoBenchworkload use bounding

boxescenterednear(notexactlyinthecenter–arandomdistanceoff)arandomlychosenmajorcityinEurope.

D5.1.4–v1.0

Page11

Thecitieswechosefromare{Paris,Essen,Madrid,Milan,Barcelona,Berlin,Athens,Birmingham,Rome,Düsseldorf, Cologne, Katowice,Hamburg,Naples,Warsaw, Frankfurt,Munich, Brussels, Lisbon,Vienna, Manchester, Budapest, Amsterdam, Leeds, Stuttgart, Liverpool, Stockholm, Bucharest ,Rotterdam, Copenhagen, Prague, Lyon, Zürich, Turin,Newcastle, Sheffield, Southampton,Nottingham,Marseille,Dublin}.ThesecenterpointswerechosenbecauseOSMprovidesaveryhighlevelofdetailfortheseareassuchthatevenatzoomlevel7therewillbealotofdataperselectionwindow.

WidthandHeight.ThezoomlevelZatscalefactorSFcorrespondstoalongitudewidthof9/2Zdegreesanda latitudeheightof4.5/2Zdegrees.Notethatthe lowestzoomlevel=0selects9degreeslongitude and 4.5 degrees latitude, which roughly corresponds with an area like Germany minusBavaria.Atzoomlevel7,thewindowisdownto0.07by0.03degrees,asmalldowntownarea.

Facets.Themapdrawingqueryonlyvisualizesgeographicinstancesfor4randomlychosenfacetsfromfourrestrictedsets(onefacethttp://linkedgeodata.org/ontology/FACETfromeach):

1. Place,Parking,Village(1M)2. School,PlaceOfWorship,Leisure(700K)3. Peak,Restaurant,Tourism(360K)4. Sport,PostBox,Supermarket(200K)

These facet categories were chosen by analyzing the frequency of the various facets in OSM.Concretely,theabovefacetsarechosnfromthefacetsthathavethehighestfrequencyofoccurrence.Thesewerechoseninorder(i)tomakethequerieswhenzoomedoutchallengingastheywillselectmanyinstancesand(ii)toguaranteethatatthehighestzoomlevelstillanonzeroamountofinstancesareinthewindow.

Further,fromthesetofveryfrequentfacets(whichislargerthantheabove),weselectedgroupsoffacetsthathavequitesimilarfrequenciesandputthemintheabovefourgroups.Thatis,thereareroughly1millionplaces,parkingsandvillages,and200.000sport,postboxandsupermarketfeatures.Each query in the LOD2 GeoBench workload picks one from each category, e.g. (Parking, School,Tourism,Sport).Thatway,thequeriesalwayshaveahighlyasimilarfrequencycharacteristic.Thisinturnhelpstocreatemorestableperformancerunsamongtheresultsofrunningthesamequerywithdifferent parameter bindings (this is something that e.g. BSBMdoes not do,making it very hard tounderstandhowgoodorbadasystembehavesonacertainquery–asthismayvaryenormouslyonthechosenparameter).

At scale factorx*SFwith (x>1), these facets are suffixedwitha random“_y”,withy:0<=y<x.RecallthattheLOD2GeoBenchwhenscalingthedatasettoalargersize,notonlycreatescopiesofallgeographic featureswith a different subject URI, but also uses different property URIs, i.e. suffixedwith_y.Asmentioned,thefacetsusedarerelativelyfrequentfacets;theirfrequencyinthecoredatasetisindicatedinparenthesis.Atzoomlevel0weexpectroughly70Kinstancesintotalbelongingtoanyofthefourselected;theexpectedamountdecreasesateachzoomlevel,tojustahundredatzoomlevel 7. Note that aswe are focusing on high‐density areas (European city centers), the amount ofinstancesina4xsmallersub‐window(zoom‐in)isinfactlessthan4xsmaller.

2.2.3 BenchmarkMetrics

2.2.3.1 PagePerSecondThebasicresultmetricisPagePerSec,i.e.theaveragetimetorenderapageoftheLinkedGeoData

Browser,which is thesumof the facetcountqueryandthe instance(aggregation)query;butthis isreportedintheinverse,hencePagePerSec.Fromabenchmarkrun,thatexecuteseachstep10times,we derive an overall PagePerSec score at that step by averaging the 10 results (query latency inseconds). For multi‐stream runs, we add the PagePerSec metric results for each stream to get acombinedPagePerSecresult.

D5.1.4–v1.0

Page12

2.2.3.2 PagePerSecondPer$1000(PagePerSec/K$)To take into account the cost of the hardware used in various implementations,we divide the

PagePerSecmetric by themonetary cost of the hardware and softwareused: PagePerSec/K$. If theRDFsystemisacommercialsoftwareproduct,thepriceforsoftwaremustbethedollar(listprice,nodiscounts). The price quoted for hardware must be the publicly available end user price of thehardwareatanonlinemerchantatthedatethebenchmarkwasrun.

2.2.3.3 LowZoom,HighZoomandTotalScoreWeexpectdatabasesystemstoperformquitedifferentlyat lowzoomlevelswhencomparedto

highzoomlevels.Forthisreason,twodifferentsub‐metricsarereported,wheretheLowZoomScoreisderivedfromstep1‐step6;andtheHighZoomScorederivedfromstep6‐step12.WeusethegeometricmeanasthemethodtocombinethePagePerSecscoresfromthevarioussteps,becausethisrewardsrelative improvements at any step equally in the overall score, even if the individual scores at thevariousstepsarequitediverse.Similarly, theLOD2GeoBenchTotalScore(LGB‐TS)isthegeometricmeanoftheLowZoomScore(LGB‐LS)andHighZoomScore(LGB‐HS).

2.2.4 BenchmarkProgramsDataGenerator.Thebenchmarkcomeswithadatagenerator(geoscale.sh)thatreadsoneinput

file and produces x output files 0<=y<x with _y suffixes in the URIs. It should be used on all coredatasetfiles.ThesefilescanthenbeimportedintheRDFdatabasesystem.Generatingthecopiesofthecoredatasetfileshouldnotbeincludedindatabaseloadtime.

QueryGenerator. Thebenchmark comeswith aquery generator (geoqgen.c), that given a runnumberandascalefactor(SF)generates240textualqueries.Therunnumberis:

0forawarmuprun.Itgeneratesonesubdirectory01/withonestreamof240queries. 1for thepowerrun,whichtestshowthesystembehaveswhen ithandlesoneuserata

time.Itgeneratesonesubdirectory01/withonestreamof240queries. 2,4,8,16 for the throughput runs, which test how the system behaves when it handles

multiple user at a time. It generates multiple subdirectories 01/, .. xx/, each with onestreamof240queries.

Thequeriesgeneratedaredifferentbetweenthe10runsinsideastream,andbetweenmultiplestreams.Therefore,theselectivities(resultsetsizes)ofdifferentqueriesfromthesametemplate,alsodiffer.Thecurrentlyselectedrandomseednumbersusedtogeneratethequerieshavebeenselectedsuchthattheoverallsizeofintermediateresultsissimilar,though(within10%ofeachotherintermsofsumofresultsizesinaquerystream).

2.3 BenchmarkImplementationsThere are different ways an application can be designed (specifically, extra “indexing” triples

couldbepre‐generated),anddifferentwaysinwhichqueriestoRDFsystemcouldbeformulated.

TheLOD2GeoBenchv2.0currentlysupportsfourdifferentimplementationbasic:,rtree,rtree++andquadtiles.

2.3.1 BasicImplementationEachstepintheLOD2GeoBenchworkloadconsistsoftwoqueries.TheFacetCountQuerycounts

theamountoffacetinstancesintherectangularquerywindow.Thebasicstrategyisnottoassumeanygeographical support in theRDF backend and perform the selection on the (lat,long) values,whichleadstotehfollowingSPARQ1.1text:

select ?f as ?facet count(?s) as ?cnt

where { ?s <http://www.w3.org/2003/01/geo/wgs84_pos#lat> ?a;

D5.1.4–v1.0

Page13

<http://www.w3.org/2003/01/geo/wgs84_pos#long> ?o;

<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?f.

filter (?a >= LATITUDE-HEIGHT/2 && ?a <= LATITUDE+HEIGHT/2 &&

?o >= LONGITUDE-WIDTH/2 && ?o <= LONGITUDE+WIDTH/2) }

group by ?f

order by desc(?cnt)

limit 50

Typically,RDFstoreswillevaluatethisquerybyusingrangescansonthePOSorOPS index forrespectivelythelatitudeandlongitudepredicate,andintersecttheresultingtriplestreamsfromtheseonsubject.Thismeansthat if (say) theselectivityof thequery is1/10of the full latituderangeand1/10ofthefulllongituderange,and(say)hence1/100ofthetotaldatabase,theintermediateresultbefore the intersection is in the range of 1/10of the dataset.Hence, it it is 10x larger than strictlynecessary.Still,thisapproachissimpleandportable(itwillworkonanySPARQL1.1backend).

ThesecondqueryineachstepistheInstanceRetrievalQuery,ortheInstanceAggregationQuery.We start wth the Instance Retrieval Query. This query retrieves all the facet instances inside (oroverlappingwith)thequerywindow,forfourchosenfacets.

The Map displayed by the Linked Geodata Browser shows markers for all instances of theselected facets. To render a screen, the benchmark will always select 4 facets so there are fourdifferentFACETparameters,FACET1,FACET2,FACET3,FACET4:

sparql select ?s as ?instance ?f as ?facet ?a as ?lat ?o as ?lon

where

{ #where-start

{ #union-start

?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?f

filter (?f = <http://linkedgeodata.org/ontology/Village>)

} #union-end

union

{ #union-start

?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?f

filter (?f = <http://linkedgeodata.org/ontology/Leisure>)

} #union-end

union

{ #union-start

?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>

?f filter (?f = <http://linkedgeodata.org/ontology/Tourism>)

} #union-end

union

{ #union-start


?f filter (?f = <http://linkedgeodata.org/ontology/Supermarket>)

} #union-end

.

?s <http://www.w3.org/2003/01/geo/wgs84_pos#lat> ?a ;

<http://www.w3.org/2003/01/geo/wgs84_pos#long> ?o .

filter (?a >= LATITUDE-HEIGHT/2 && ?a <= LATITUDE+HEIGHT/2 &&

?o >= LONGITUDE-WIDTH/2 && ?o <= LONGITUDE+WIDTH/2)

Arguably, the selection on any of the four facets could also be done in a filter – however it isbelieved that thecurrent syntaxand theonewithdisjunctiveexpressionswouldusually lead to thesame physical query plan anyway. It should be noted that if desired, such an alternative, yetequivalent,querysyntaxwouldbepermissibleinaLOD2GeoBenchresult.

D5.1.4–v1.0

Page14

2.3.2 RTreeandRTree++ImplementationsIf an RDF database system supports efficient evaluation of geographical predicates (e.g. by

creating an RTree index in advance), such is very relevant for the LOD2 GeoBench. We allowreasonable query variants, for instance if the RDF database system being tested has specificgeographicsupport,thiscanbeused.

For instance, Virtuoso v6 provides RTree based indexing allowing to test spatial intersectionwithin a radius. It is possible to drawa circle around the querywindowanduse the radius of thiscircule and the center point of the window in this syntax. This was the first RTree syntax variantimplementedbyLOD2GeoBench(inv1.0)andthereforecarriesthename“rtree”:

select ?f as ?facet count(?s) as ?cnt

where { ?s <http://www.w3.org/2003/01/geo/wgs84_pos#lat> ?a;

<http://www.w3.org/2003/01/geo/wgs84_pos#long> ?o;

<http://www.w3.org/2003/01/geo/wgs84_pos#geometry> ?p;

<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?f.

filter (bif:st_intersects(bif:st_geomfromtext('POINT(LONGITUDE LATITUDE)',2000), ?p, RADIUS) &&

?a >= LATITUDE-HEIGHT/2 && ?a <= LATITUDE+HEIGHT/2 &&

?o >= LONGITUDE-WIDTH/2 && ?o <= LONGITUDE+WIDTH/2) }

group by ?f

order by desc(?cnt)

limit 50

WenowalsoallowqueryvariantsthatexploitthegeographiccapabilitiesofotherRDFdatabasesystems.Forinstance,bothOWLIM5.3andVirtuosov7supportsthepredicate:

bif:st_intersects(bif:st_geomfromtext("BOX(lat1 lon1, lat2 lon2)"), ?p)

This allows direct translation of the LOD2 GeoBench window queries into a geographicalpredicate.NotethatthepreviousqueryforVirtuosov6wouldcombineaquerywitharadius(circlequery)witha subsequent (lat lon) filter. InLOD2GeoBench, thisdirectBOXcomparison, supportedfromv2.0on,isdenoted“rtree++”.

WeomitdetaileddescriptionsoftheInstanceQueryandInstanceAggregationqueriesforthetreeandrtree++variants,asthesearenaturalvariantsofthebasicqueries,withasonlydifferenceusingtheappropriategeographicalfilters,mentionedabove.

2.3.3 QuadImplementatonAnapplicationliketheLinkedGeodataBrowserwithstronginteractivitydemandschallengesnot

only RDF database technology, but also the application design itself. Taking the analogy to GoogleMaps,onecanbeassuredthatratherthanqueryingfromasingledatacollectionforallzoomsettings,theresultscreensarerenderedfroma(pre‐generated)separatedatasetforeachdifferentzoomlevel.EventhoughGoogleMapslikelydoesnotrelyonrelationaldatabasetechnology,thisapproachwouldbelikehavingdifferenttablesstorethegeographicaldataofthevariouszoomlevels.Theadvantageisthat these tables canbe designed such thatwhen the zoomwindow is very large (low zoom level),irrelevant data that would be too big to show would be pruned, or frequency counts could besummarized (e.g. keep the amount of lampposts in Germany for each zipcode, rather than allindividuallampposts).Thisway,theselowerzoomlevelshavetooperateonmuchlessdata,allowingtheapplicationtoexhibitinteractiveperformancealways.

The quad approach, described here, formally is not a valid implementation of the LOD2GeoBench,as itwillprovideslightly incorrectqueryanswers,buthas thepotential toachievemuchbetterperformance,withonlyminorqualityreductioninthequeryanswersprovided.ItsperformancecanbemeasuredwiththeLOD2GeoBench.

Themainideaistocreateadditionalindexingtriplesthat(i)accelerategeospatialdataaccessat multiple zoom resolutions, even on systems that do not provide specific geospatial support (ii)

D5.1.4–v1.0

Page15

precomputes certain subquery results in order to accelerate query results, for all three types ofqueries(facetcount,instance, andinstanceaggregation).

QuadTiles. The geospatial acceleration comes from partitioning the 2D space according toQuadTiles,which isaZ‐orderingof the (LONGITUDE,LATITUDE)space into32‐bitsnumbers,whereLONGITUDEandLATITUDEgetdiscretizedfromtheirnormaldoubleprecisionranges[‐180,180]resp.[‐90,90] to the short integer [0,65536]. The below pictures from theOpenStreetMapwiki illustratethis:

Hence,asinglenumberidentifiesarectangleinthetwo‐dimensionalspace.Infact,wecancreatesuchnumbersatanyevenbitgranularity:theleftmostimageshowsthefourrectanglesidentifiedby2‐bitsquadtilenumbers:A=00,B=01,C=10,D=11.

QuadTileannotationscanbeexploitedbyaddingextraRDFtriplesthatannotateasubjectthathas a geography with those rectangles it overlaps with (one QuadTile triple for each). Each suchannotation for a geographical subject would add one triple with a property, e.g.http://linkedgeodata.org/intersects/quadtile and a value which would be the integer QuadTilenumber.Itistoberemarkedthatthisworksfineforpoints,butlargepolygonsmightgetneedmanytriplesiftheirsurfaceislarge.IntheOSMcoredataset,thisdoesnotseemtobeanissue,though.

FacetTiles&TileFacets.Furtherelaboratingonthisidea,wecanalsouse64‐bitsintegers,withthelower32bitsbeingthepreviouslydescribedQuadTilenumberwherethehigher32‐bitsareusedtostoreafacetidentifier.Inourcurrentimplementation,weuse52‐bitsintegersconsistingofa20‐bitfacetnumberanda32‐bitsQuadTilenumber.Thisfacetidentifierisa20‐bitsnumberidentifyingthefacetURIasdenotedbythehttp://www.w3.org/1999/02/22‐rdf‐syntax‐ns#typeproperty.Infact,werestrictourselvestothe1024mostfrequentfacets(morethan10instancesworldwide),forwhich10‐bitsareneeded.Thehigher10‐bitsofthe20‐bitsfacetnumberareusedfordatasetscaling.

Sowehave52‐bitsapproachwitha32‐bitsQuadTilenumberintheminorbitsanda20‐bitsfacetintegerinthemajorbits.WebaptizethesecombinationsofQuadTileandfacetnumbers“FacetTiles”.In caseof an equi‐selectionof FACET such as found in the InstanceQueries, thenumber rangewillhavethemajorbits(Facetpart)thesameintheMinandMaxvaluesofallrangesandonlyvaryinthelowerbits(QuadTiles).Hence,FacetTilessharewithQuadtilesalltheirnicegeospatiallocalityaspects,insuchsituations.

TheFacetCountQuery,however,mustcountthenumberofallfacetsthatfallintoit.ThiswouldleadtonorangerestrictionatallusingFacetTilenumbers.Therefore, forsuchquerieswewouldbemoreinterestedinhavingthefacetnumbersbeingtheminorbitsandtheQuadTilenumbersbeingthemajorbits.Letuscallsuchanumberingscheme"TileFacets".

WecanaddFacetTileannotationstoallgeospatialobjects(foreachRDFsubtypetheyhave).Suchtriplesconsistsofthesubject,ahttp://linkedgeodata.org/intersects/facettileproperty,andthe64‐bitsintegerliteralvalue.ThesetriplescanbecanleveragedtheInstanceQueryoftheLOD2GeoBench.ThisqueryasksforallinstancesoffourselectedFACETsthatfallinacertainquerywindow.

D5.1.4–v1.0

Page16

Itisrelativelyeasytomapageospatialquerywindowintoa(seriesof)rangerestrictionsontheQuadTilenumbers.Thisusuallygivesalimitednumberofconjunctiveranges,butstillitisoftenagoodideatousetheSPARQL1.1subqueryfeatureandencloseinthisSPARQLqueryasubquerythatsimplyhasasinglerangeconsistingoftheMINandMAXvalueofthemultiplerangesweareafter.ThisideatopresentabasicquerywithonlyoneselectionrangeisaworkaroundforweaknessesinSPARQLqueryoptimizers, that would otherwise not recognize the opportunity to use the POS index on thehttp://linkedgeodata.org/intersects/facettileproperty.Similarly,giventhatwequeryforfourFACETs,itmayworkbesttousetheabovequerymodeltoretrievealldataforonefacet,andwriteaquerythatunion‐sfoursuchsub‐queries.

Notethatinprinciple,giventhattheInstanceQueryisusedonthehighzoomlevelsonly,whereresultsetsarenotverylarge,thiswillleadtofourlocalindexlookupsinthePOSindex.ThismayworkbetterthananormalRTreewoulddo,becauseintheRTreeonewouldhaveallinstancesofallfacets,notonlythefourfacetsofinterest.ThismeansthatanRTreeselectionqueryintheleafnodesitvisitswill only find a low percentage of the data to be relevant for the query. Onewould need a kind ofpartitioned RTree (partitioned on facet) to get the same kind of locality as FacetTile. An exampleInstance Retrieval Query is below, shortened by having it only query two facets(http://linkedgeodata.org/ontology/Village, http://linkedgeodata.org/ontology/Supermarket) ratherthanfour:

select ?s as ?instance ?f as ?facet ?a as ?lat ?o as ?lon

where

{ #where-start

{ #union-start

{ #subquery-start

select ?s <http://linkedgeodata.org/ontology/Village> as ?f

where

{ #where-start

filter((?g >= 2554700922112 && ?g <= 2554700922367)

|| (?g >= 2554700922624 && ?g <= 2554700922879)

|| (?g >= 2554700944640 && ?g <= 2554700944895)

|| (?g >= 2554700945152 && ?g <= 2554700945407)

|| (?g >= 2554700965888 && ?g <= 2554700967167)

|| (?g >= 2554700967424 && ?g <= 2554700967679)

|| (?g >= 2554700988416 && ?g <= 2554700989695)

|| (?g >= 2554700989952 && ?g <= 2554700990207)) .

{ #subquery-start

select ?s ?g

where

{ #where-start

?s <http://linkedgeodata.org/intersects/facettile> ?g .

filter (?g >= 2554700922112 && ?g <= 2554700990207)

} #where-end

} #subquery-end

} #where-end

} #subquery-end

} #union-end

union

{ #union-start

{ #subquery-start

select ?s <http://linkedgeodata.org/ontology/Supermarket> as ?f

where

{ #where-start

filter((?g >= 2808103992576 && ?g <= 2808103992831)

|| (?g >= 2808103993088 && ?g <= 2808103993343)

|| (?g >= 2808104015104 && ?g <= 2808104015359)

D5.1.4–v1.0

Page17

|| (?g >= 2808104015616 && ?g <= 2808104015871)

|| (?g >= 2808104036352 && ?g <= 2808104037631)

|| (?g >= 2808104037888 && ?g <= 2808104038143)

|| (?g >= 2808104058880 && ?g <= 2808104060159)

|| (?g >= 2808104060416 && ?g <= 2808104060671)) .

{ #subquery-start

select ?s ?g

where

{ #where-start

?s <http://linkedgeodata.org/intersects/facettile> ?g .

filter (?g >= 2808103992576 && ?g <= 2808104060671)

} #where-end

} #subquery-end

} #where-end

} #subquery-end

} #union-end

.



filter(?a >= 45.6938 && ?a <= 45.8344 && ?o >= 4.77089 && ?o <= 5.05214)

} #where-end

The Facet Count Query, as said, does not have locality on Facet, so it can better exploit theTileFacetnumberingthanaFacetTilenumbering.WecanthushenceaddalsoTileFacetannotationstoall instances they intersect with. This speeds up the query, certainly on systems without built‐ingeospatial support (RTrees) as the geographical predicate can now be translated into a rangerestriction thatwill workwell on a POS index. Furthermore, we could pre‐aggregate the retrievedtuplesonthefacetnumber(lowerbits)beforeevenjoiningthemtoothertriples.

However,especiallyatthelowerzoomlevels,whereareasthesizeofGermanyfallinthevisiblewindow, suchquerieswill have to aggregate hundreds of thousands of triples, even at the smallestSF=1;andlinearlymoreathigherscalefactors.Aggregatingthismuchdata,evenifdeliveredfastbyaPOS index is still heavy CPUwork that can take various seconds at least andwhichwillmake thisquerynon‐interactiveathigherscalefactors.

Thereforewe donot add TileFacet annotations to instances, but usepre‐computation for the FacetCountQuery. Wedothisatvariousresolutions intherangeof12‐26bits,becausethe lowestzoomlevelselects1/402ofthedata(roughly26socorrespondingto6bitsforbothdimensions,so12bits),wherethedeepestzoomlevelis7stepsdeeper,soat26bits.Hence,weproposeTileFacetcountpre‐computationat7granularities:12,14,16,18,20,22and24bits.

Itisnowamatterofdeterminingaproperbitgranularityforevaluatingaquery,dependingonthezoom level. A good heuristic is to use the lowest granularity level atwhich at least one tile is fullyenclosedby thequerywindow(and ifnosuch levelexist,use thehighestbitgranularity);and thentranslatethewindowselectionpredicateinaseriesofrangepredicatesonTileFacets,likebefore.

The extra triples we keep hold the pre‐computed counts at the various resolutions for anyrectangleforeachTileFacetatthatresolution(e.g.16bits).Forallfacetsinstances,wegeneratetwotripleswithasubjectintheformofhttp://linkedgeodata.org/facetcount/0000XXXXXXandas:

property http://linkedgeodata.org/facetcount/tilefacet16, with as value its TileFacet number,with theQuadTile number part truncated to 16 bits in this case. This represents thus a certainrectableinthe2Dspace.

propertyhttp://linkedgeodata.org/facetcount/count,andasvaluethenumberofoccurencesofafacet. Note thatwe only need to generate http://linkedgeodata.org/facetcount triples for facetsthat have a non‐zero count in a certain rectangle. As such, the amount of these pre‐computedtriplesisalwayssignificantlylowerthantheamountofTileFacetannotationsweaddedbefore.

D5.1.4–v1.0

Page18

Property http://linkedgeodata.org/facetcount/facet stores the facet URI (i.e. dhttp://www.w3.org/1999/02/22‐rdf‐syntax‐ns#type value). It could be derived from thetilefacet16number,buthavingthisasatriplesimplifiesapplicationdevelopment.

The Facet Count Query can now be formulated by selecting all tiles at some granularity thatoverlapwiththequerywindow,andsummingupthesepre‐computedcounts.Hereisanexample:

select ?f as ?facet xsd:integer(sum(?c * 0.512)) as ?cnt

where

{ #where-start

?s <http://linkedgeodata.org/facetcount/count> ?c ;

<http://linkedgeodata.org/facetcount/facet> ?f .

filter((?g >= 3471158208888832 && ?g <= 3472257719468032)

|| (?g >= 3473357232144384 && ?g <= 3474456742723584)

|| (?g >= 3659174697238528 && ?g <= 3660549085724672)

|| (?g >= 3660823964680192 && ?g <= 3661098841538560)

|| (?g >= 3661373720494080 && ?g <= 3662748108980224)

|| (?g >= 3663022987935744 && ?g <= 3663297864794112)) .

{ #subquery-start

select ?s ?g

where

{ #where-start

?s <http://linkedgeodata.org/facetcount/tilefacet14> ?g .

filter (?g >= 3471158208888832 && ?g <= 3663297864794112)

} #where-end

} #subquery-end

} #where-end

group by ?f

order by desc(?cnt)

limit 50

Thedownsideofthisapproachisthatthefacetcountsprovidedwillbeanoverestimationofthereal facet counts, since the tiles from which the precomputed counts originate may (will) extendbeyond the visible window. However, users may tolerate such inaccuracies; but especially for thelower counts, itmight be annoying. One could envision a system that,when a userswants the realcountforanon‐frequentfacet,wecouldcomputetheexactvalue(withaseparatequeryexploitingtheFacetTileannotations,asintheprevioussection).

Thecurrentquerygeneratortriestocorrectforoverestimatingbynormalizingtheprecomputedresult to thesizeof thequerybox,bydividingwith theboxused foranswering thequery(which isequalorlarger).Intheaboveexample,thisleadstothe0,512constantinthefirstline,asonlyslightlyoverhalfofthere‐usedprecomputedresultsisinsidethequerybox.

TheproblemoflargequerywindowsatlowzoomlevelsalsooccursintheInstanceAggregationQuery.Recall thatthisquerytacklestheinformationoverloadproblemofwaytoomanymarkersbycombiningmarkers that are near to each other into a singlemarker, and visualizes a count of howmanyinstancesfallunderit.Similartopre‐computingcountspertile,weobservethatthisaggregationper facet per tile can also be pre‐computed. Note that herewe again needmarkers for only a fewfacets,sousingtheFacetTilenumbershereworksbest.SincetheInstanceAggregationQueryisonlyusedatthelowerzoomlevels,wecanjustindexthisatgranularities12,14,16and18bits.Thus,foreachtileatallgranularities(e.g.16bits)inwhichafacetoccursatleastonce,wegenerateanartificialnewsubjecthttp://linkedgeodata.org/facetmap/0000YYYYYYYYinthreetripleswithas:

property http://linkedgeodata.org/facetmap/facettile16 and as value its FacetTile number(identifyingarectangleinwhichtheclusteredmarkerlies).

properties holding the position http://linkedgeodata.org/facetmap/latitude andhttp://linkedgeodata.org/facetmap/longitudeofthemarker.

D5.1.4–v1.0

Page19

property http://linkedgeodata.org/facetmap/count, and as value the number of occurences of afacet into that 16x8 cell inside the tile. Againwe only add such pre‐computed triples if a facetoccursinacertaincell,sotheamountofgeneratedhttp://linkedgeodata.org/facetmap/triplesissignificantlylowerthantheamountofTileFacetannotationsweaddedbefore.

AnimplementationoftheInstanceAggregationQueryexploitingthesepre‐computedtriples,firstchoosesanappropriatebitgranularity for thezoom level.Thenallabove tiles thatoverlapwith thequerywindowarefetched;nextthereal(latitude,longitude)valuesfromtheexamplemarkersinthemarefetchedandfilteredagainwiththequerywindow.Thismapisthenpresented.Becauseweusepre‐aggregated data, just like in case of the Facet Count Query, the problem was having to aggregatehundredsofthousandsofinstances;andthispre‐computationisguaranteedtoavoidthis;asanytilemaximallycontains128points;andweaccessonlyfewtiles.

Anexample InstanceAggregationQuery isbelow, shortenedbyhaving itonlyquery two facets(http://linkedgeodata.org/ontology/Village, http://linkedgeodata.org/ontology/Supermarket) ratherthanfour:

select ?f as ?facet ?latlon ?cnt

where

{ #where-start

{ #subquery-start

select ?f ?x ?y max(concat(xsd:string(?a)," ",xsd:string(?o))) as ?latlon count(*) as ?cnt

where

{ #where-start

{ #subquery-start

select ?f ?a ?o xsd:integer(20*(?a - 43.5141)/4.5) as ?y

xsd:integer(40*(?o - 0.3412)/9) as ?x

where

{ #where-start

{ #union-start

?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?f .

filter (?f = <http://linkedgeodata.org/ontology/Village>)

} #union-end

union

{ #union-start


?f . filter (?f = <http://linkedgeodata.org/ontology/Supermarket>)

} #union-end

.



filter(?a >= 43.5141 && ?a <= 48.0141 && ?o >= 0.3412 && ?o <= 9.3412)

} #where-end

} #subquery-end

} #where-end

group by ?f ?x ?y

order by ?f ?x ?y

} #subquery-end

} #where-end

SincethequerywindowwillnotperfectlyalignwithQuadTilesboundariesattheresolutionused,andforthemarkercombinationinthepre‐computedtilesweuselesscells(16x8;becauseaquerywillbeanswered frommultiplecells), theclustercombinationwillgivedifferentresults thantheofficialLOD2 GeoBench Instance Aggregation Query, even if we later re‐aggregatemarkers on the desired40x20grid.Fortheuserexperience,theeffectofthisislikelytobeofminorimportance.

D5.1.4–v1.0

Page20

3. Evaluation3.1 HardwarePlatform

InordertoruntheLOD2GeoBenchatscale,weusedaclusterofcomputenodes,inparticular,theSCILENSclusterinstalledatCWI.

SCILENSisanewkindofhardwareclusterthathasbeendesignedfromthegrounduptoservelarge‐scaledatamanagement. Themachines in the SCILENS cluster areorganized in threedifferentlevels,called‘pebbles’,‘rocks’,and‘bricks’.Eachleveldecreasesinamountofnodesbuttheindividualmachinesusedinthelevelincreaseincomputationalanddiskresources(andpricetag).TheSCILENSclusterusescheapconsumerhardware,optimizedtopackasmuchpowerinaslittlespace,makinguseofconsumerhome‐threatermini‐PCcases(‘Shuttlebox’),connectedbyhigh‐performanceInfinibandnetwork.

Due to thenegativeperformance impactofnetwork trafficduringSPARQLqueryprocessingonlarge clusters (where joins tend to be 'communicating' joins where allmachines need to exchangedata),wherenetworkusagevolumeincreasessuper‐linearlywithmorenodes,itisgenerallybetterinRDFstores toworkwith fewernodeswithmore (RAM)resources thanwithmanynodeswith littleresources.Thus,wechoseasourexperimentalplatformthe‘bricks’layerofSCILENS,thatconsistsofsixteen256GBRAMmachines,eachwith16coresrunningat2.4GHz(dualsocketIntelservers,worth$8K).TheclusterrunsFedoraLinux.Theprice‐tagoftheeightmachinesinvolvedintheexperiments,inclusivetheInfinibandnetworkinfrastructureisroughly$100K.

The SCILENS cluster contains much more I/O resources per CPU core than usual in computeclusters.TherelationbetweenCPUpowerandI/OresourcesiscapturedbytheAmdahlnumber.ThisnumberistheamountofI/ObytespercoreCPUcyclethesystemcandeliver.IncaseoftheSCILENSclusterthisnumberiscloseto1.0whereastypicalclustersatsupercomputingfacilitiessuchasLISAatSARA,onlygetto0.2(1byteper5cycles).Wedoconfess,whileallthisI/Opowerisinteresting,intheworkloads presented so farmost data is RAM resident. One reasonwas that the high‐performancemulti‐SSD I/O subsystem of the bricks layer at the time of testing was not yet operational. Thisprovidesgroundforafollow‐upexperimentusingthisfastI/Olayer.Weexpectthistoacceleratetheloadphase,andalsotoallowtoaddressevenlargerdatasetsefficientlyonthesamehardware.

3.2 RDFDatabaseSystemsTestedForourevaluationoftheLOD2GeoBench,weusedfourdifferentsoftwareconfigurations:

OWLIM‐SE v5.3: we used the non‐cluster version of Ontotext’s OWLIM, which efficientlysupportsgeographicalquerying,asitstoresgeographicfeaturesinanRTree.OWLIM5.3withgeographicextensionisproprietarysoftware,butwehavenohardinformationonthecostof

Figure 2: The 'rocks' and 'pebbles' layers of the SCILENS cluster are hand‐built from384 Shuttleboxes,packingCPUandamplediskresourcesinlittlespace.

D5.1.4–v1.0

Page21

OWLIMat the timeofpreparationof thisdocument, soweomitted the scoresper$ for thissystem.

VirtuosoV6open source is still themostwidely used RDF store around (V7 open sourcebinary builds have starteddistributing only sinceAugust 2013). ThisOpenLinkproduct hasspecificsupportforgeographicalpredicates,albeitsomewhatlimited.Asdiscussed,directBOX(rectangularwindow)selectionsonlatitude,longitudearenotpossible,soweusetheRADIUSpre‐filteringapproach.

VirtuosoV7opensource:thismajornewreleasehasbeenstronglyinfluencedbytheLOD2project, wherein CWI advised Openlink on the introduction of numerous architecturalenhancements. Specifically, V7 introduces columnar storage for RDF triples as well asvectorizedexecution;patternedafterCWIresearchdatabasesystemprototypes.VirtuosoV7wasreleasedin2013andgenerallyofferssignificantstoragesavings,reducedmemoryusage,andimprovedcomputationalperformanceoverV6.

VirtuosoV7ClusterEdition:thismajornewreleaseofthe(nonopen‐source)clustereditionhas been documented in D2.1.6 and introduces a new vectorized cluster based executionparadigm that allows to parallelize any (complex) SPARQL query over a cluster of computenodes.As a result, it canhandle complex SPARQLqueries, such as theBusiness Intelligenceworkload of BSBM, but also the LOD2 GeoBenchmuchmore efficiently (or at all) than theclusterversionofV6everdid.Themonetarycostatthetimeofwritingofenterpriseversionfordepartmentserversis$25K.

3.3 LoadingResultsBeforeloading,firstwegeneratedextratriplesforthe“quad”approach2.Thisroughlydoubles

thedatasize.Thesetriplesarethenbulkloadedintothesystems.Incaseofowlim,afterbulkloading,theRTreegeographical indexneedstobecreated.Thetimeneededforthis is includedinthebelowtable(andisalwaysasmallpartoftherealloadtime).

SF10 #2/1 SF1 (224M triples) SF10 (2.24G triples) SF100 (22.4G triples)

METHOD time Size time Size time Size

owlim5.3 5257sec 24GB 102075sec 185GB

virtuoso6 12900sec 23GB

virtuoso7 780sec 12GB 5520sec 108GB

v7cluster 2280sec 156GB 18840sec 1.1TB

Loading Virtuoso6 was done using single loading process, since parallel loading wouldconsistentlyhangthesystem.WementionedalreadythefactthatloadingthescaleddatasizesalsohitanerrormessageintheRTreeloadingcode(abug),eveninsingle‐threadedmode,whichpreventedusfromtestingVirtuosoV6onthelargerdatasizes.

Virtuoso7 used the native parallel loading procedure and was done by running 14 loadingprocessesinparallel.LoadingVirtuosoV7ClusterEditionwasdonebyrunning2processespernode,givingintotal32loadingprocesses(2x2nodes/machinex8machines).

2Alternatively,wecouldhavecreatedseparatedatabasesfortheexperiments,onewiththeextragenerated

triplesforquad,andonewithout(torunbasic,rtreeandrtree++).Thiswasnotdoneformanageabilitypurposes,astheperformanceeffectsaredeemedtobeminor

D5.1.4–v1.0

Page22

3.4 OverallBenchmarkResultsWenowpresenttheoverallbenchmarkscoresoftheexperiments.TheLOD2GeoBenchmainpure

resultmetric(regardlesscost)isPagePerSec.Inordertomeasurethesystemunderload,i.e.withall8coresbusy,wederivethisSCOREfromtheworkloadunder8concurrentquerystreams;thisisknownasthe"throughputmetric"3.Thescoreiscomputedasthegeometricmeanofall12steps.Wealsosplititoutinthegeometricmeanoverthequeriesatlowzoom(zoomlevel0‐4,steps1‐6)andhighzoom(zoomlevel>4,steps7‐12).ThesearecalledLSCOREandHSCORE.

Figure3:LOD2GeoBenchScoresatSF1withoneserver(8core2.4GHz,256GBRAM)–8querystreams

The results at SF1have the approximate implementation “quad” in front, especially inVirtuosoV7,however,theimprovedRTreesupport“rtree++”comesquiteclose.Notably,high‐zoomquadqueriesworkedbetterinV6,sothereseemstobeeitheranoptimizerperformanceregression,oranundesiredeffect of the vectorized columnar execution inV7. Because the low‐zoom (compute‐intensive) quadqueriesaremuchfasteronV7,itsoverallscoreishigher.

AnotherinterestingcomparisonistheimpactoftheLOD2R&Dactivitiesinthepastfewyears,atleastin thisbenchmark, forVirtuosoversus its strongest competitor,OWLIM.WhereasOWLIMgenerallywas equivalent or faster than VirtuosoV6 (compare owlim rtree++ with V6 rtree), the rtree basedscore in Virtuoso improved by a factor 7, creating a significant performance advantage. Besidesimprovements to the RTree functionality, this is very likely caused by the columnar vectorizedexecutionmodelthatVirtuosoadoptedinV7,inspiredbyCWIresearchinthatarea.

Moving to SF10, though throughput drops by a factor 3,we see the relative advantage of the quadapproachimprovedramatically:

Figure4:LOD2GeoBenchScoresatS10withoneserver(8core,2.4GHz,256GBRAM)–8querystreams

3Forthe"Power"metricthattestsallqueriesinisolation,seetables3.6.1;3.6.6andtheleftpartof3.6.11.

0102030405060708090100

basic basic basic quad quad quad rtree++ rtree rtree++

owlim5.3 virtuoso6 virtuoso7 owlim5.3 virtuoso6 virtuoso7 owlim5.3 virtuoso6 virtuoso7

SCORE

LSCORE

HSCORE

051015202530

basic basic quad quad rtree++ rtree++

owlim5.3 virtuoso7 owlim5.3 virtuoso7 owlim5.3 virtuoso7

SCORELSCOREHSCORE

D5.1.4–v1.0

Page23

Figure5:LOD2GeoBenchScoresatSF10with8servers(8core2.4GHz,256GBRAM)–8querystreams

Whereasthepreviousexperimentswereusingasingleserver,nowwemovetoresultsobtainedwith8servers. One strategy that is applicable to any technology in a read‐only workload like this, is toreplicatethedatabaseinmultipleservers,anddividethequeriesamongthem.Weremainusingjust8querystreams,suchthateachservergetsasinglestream.Inordertouseallcores,thesystemsmustnowparallelizetheindividualqueriesinordertomakeuseoftheCPUresources.Thisexplainslackoflinearscale‐up inall systems.Note that replicationstill isaverypowerful technique inheavyread‐onlyworkloads:wecansafelyexpectthatwhenusing8replicatedserverswith64concurrentquerystreams, the resultswill be 8‐fold those in Figure 4 (for example, 8*virtuoso7would then reach athroughputscoreof128insteadofjust12).

We also tested the “true” cluster database system provided by OpenLink, i.e. Virtuoso V7 ClusterEdition.Datahereisnotreplicatedinallservers,butpartitionedamongthematloadtime.Thiscausesall queries to be parallelized. This explains the superior scores (with quad overall being the best)obtainedonthissystem,under light load.Having inmindthetheoreticalpeakusageofat least128,platformutilizationofv7clusterat15,canstillbesignificantlyoptimized.

Overall, the absolute performance of the peak throughput drops from50PagesPerSec at SF1 to 15PagePerSecatSF10.ThiscouldpartlybeexplainedbythelossofdatalocalityatSF10,butitcouldalsoindicate some query optimization problems. Namely, in the Virtuoso V7 quad implementation, theplans donot have a heavy computational load (as this has beenprecomputed) and in principle thecomplexity of all queries should be logarithmic to data size. In this sense, a drop of a factor 3mayindicatethatthequeryoptimizerdoesnotfindtheoptimalplansyet.

Figure6:LOD2GeoBenchscorepercost(“bangforthebuck”)on8servers,SF10,8querystreams

FinallywecomputedthePagePerSec/K$scoreforallbenchmarkedproducts,bothsingle‐serverand8‐node cluster setups, excludingOWLIM5.3 (forwhichwe lack pricing information). It turns out that

0

5

10

15

20

25

basic basic basic quad quad quad rtree++ rtree rtree rtree++

8*owlim5.3 8*virtuoso7 v7cluster/8 8*owlim5.3 8*virtuoso7 v7cluster/8 8*owlim5.3 v7cluster/8 8*virtuoso7 8*virtuoso7

SCORE

LSCORE

HSCORE

0

0,5

1

1,5

2

2,5

basic quad rtree++ rtree basic basic quad quad rtree rtree rtree++ rtree++ rtree++ rtree++

virtuoso7 virtuoso7 virtuoso7 virtuoso7 8*virtuoso7v7cluster/88*virtuoso7v7cluster/8v7cluster/88*virtuoso78*virtuoso78*virtuoso78*virtuoso78*virtuoso7

SCORE/$

D5.1.4–v1.0

Page24

fromamonetarypointofviewV7quadisthebestdeal;whereasforclusteredsetupwehavethe$25Ksoftware(plus$100Khardware)VirtuosoV7ClusterEditionbeatingasetupof8*8Khardwarenodeswiththeopen‐sourceversionreplicated.

Figure7:LOD2GeoBenchScoresatSF100with8servers(8core,2.4GHz,256GBRAM)‐8querystreams

The last overall benchmark results are the scores at SF100. Here, the trends continue, thoughperformanceofthedifferentqueryvariantsisstableandthedifferentresultsarenearertoeachother.

Infutureexperiments,theclusterexperiments(ormaybeallexperiments)shouldalsobeperformedunderahighqueryloadwithmanymorethan8streams,toensurethatallcoresarefullybusyunderpeakload.

00,51

1,52

2,53

3,54

basic quad rtree

v7cluster/8 v7cluster/8 v7cluster/8

SCORE

LSCORE

HSCORE

D5.1.4–v1.0

Page25

3.5 DetailedQueryPerformanceResultsWerantheLOD2GeoBenchatscalefactors(SF)1,10and100(130M,1.3G,13Gtriples);using1,2,4and8concurrentquerystreams.Notallsystemsaretestedusingallparameters:

we tested the non‐clustered systems only on SF1 and SF10; and Virtuoso V7 ClusterEditionwasnottestedatSF1,onlyatSF10andSF100.

aresultofthedifferentdatascalingmethodintheLOD2GeoBenchv2.0(vs.v1.0)isthatVirtuosoV6nolongercanloadthescalefactor10dataset.Ithitsabug,andgiventhatV7isout,thisbugmaynevergetfixed.ThereforeVirtuosoV6isonlytestedonscalefactor1.

Figure8:Page‐per‐secondperformanceforeachofthe12steps(SF1,#8/1)

SF1#8/1:AtSF1onasingleserverwith8concurrentquerystreams(whichshouldkeepthe8coresbusyatleast),theresultsshowthatforVirtuosoV7thehighestperformanceisachievedwiththenew RTree functionality (rtree++); however the performance linearly improves with higher zoomlevel, and isquitepoorat the lower zoom levels.At the lower zoom levels, the approximate “quad”approachismuchbetter.Interestingly,V6achievesbetterperformancethanV7.

ForOWLIM,thequadperformanceisbadduetoitgettingbadqueryplans(duetothedisjunctivefiltersandunions).TheRTreesupportinOWLIMisquitegood(rtree++)usuallyexceedingtheRTreesupportofVirtuosoV6(butnotV7).

0

20

40

60

80

100

120

140

160

180

1 2 3 4 5 6 7 8 9 10 11 12

owlim5.31basic

owlim5.31quad

owlim5.31rtree++

virtuoso61basic

virtuoso61quad

virtuoso61rtree

virtuoso71basic

virtuoso71rtree

virtuoso71rtree++

owlim5.31quad

owlim5.31rtree++

virtuoso71quad

D5.1.4–v1.0

Page26

Figure9:Page‐per‐secondperformanceforeachofthe12steps(SF10,#8/1and#8/8)

SF10#8/1:AtSF10onasingleserverwith8querystreams,theadvantageofthequadapproachinVirtuosoincreasesconsiderably.DifferenttoSF1,atSF10onV7theperformanceexceedsthatofV7rtree++andofV6quadconsiderably,thelatterbecauseofimprovementsmadeinthequeryoptimizerthat favour the InstanceRetrievalQuery. At SF10, the rtree++ performance is no longer very good.ThiscanbeexplainedbythefactthatinsidetheRTreeinstancesbelongingtoallfacetsarestored,notjustthefourfacetsrequiredbytheInstanceRetrievalQuery.Atlargerscalefactor,theincreaseddatasizecausesI/Otostartplayingarole.InthisexperimentatSF10,VirtuosoV6couldnotbetestedduetothedataloadingbug,mentionedearlier.

InthisexperimentwealsoseeVirtuosoV7ClusterEditionresultsoneightidenticalmachines.Itisstriking that the basic variant performs very close to rtree. If we further compare basic betweencluster (8machines) and a singlemachine, scalability for low zoom levels is near linear (factor 8).These queries perform a lot ofwork,which gets parallelized. However, the gains in the high zoomlevels,which access less data, aremore limited. It is an open questionwhy rtree does not providemuchbenefitinaclustersetting.

SF10#8/8:We nowmove to experiments at SF10with 8 query streams on 8 servers. In thefollowingwecomparetheVirtuosoV7ClusterEditionapproachwithsimplereplication.Theformer,followinga“true”clusterapproach,partitionsalldataacrossallservers,henceeachserverstores1/8thofthedata,andqueriesgetspreadoutoverallservers(parallelized).Simplereplication,incontrasts,

0

5

10

15

20

25

30

35

40

45

50

1 2 3 4 5 6 7 8 9 10 11 12

owlim5.310basic

owlim5.310quad

owlim5.310rtree++

v7cluster/810basic

v7cluster/810quad

v7cluster/810rtree

virtuoso710basic

virtuoso710quad

virtuoso710rtree

virtuoso710rtree++

D5.1.4–v1.0

Page27

loads the same data independently in eight different machines, and then executes the 8‐streambenchmarktestbyrunningthesingle‐streamtestindependentlyonall8machines.Assuchtheresultsof thisexperiment are roughly8‐foldhigher than the single‐streamsingle‐server test.Note that thehardwareismorethan8timesexpensive($8Kvs$100K,duetothecostoftheinfinibandswitch,oneinfiniband network card per server and cabling). This added price differencemake the replicationstrategy lessattractive in thePagePerSecond/$scores,presented later.The replicationexperimentsaremarkedwithastarinthelegend.

Figure10:Page‐per‐secondperformanceforeachofthe12stepsinthebenchmark(SF10,#8/8)

Replication: This experiment shows replicated owlim (owim5.3*) to compete on higher zoomlevelswith itsRTreesupport.ThereplicateVirtuosoV7(virtuoso7*)withthequadapproachscoreshigh,thoughisvulnerableintheFacetRetrievalQueryatthelowerzoomlevelswhereitisused(steps7‐9)andwhentherearemanyqueryresults.Theoverallwinner intermsofperformanceisClusterEditionwithquads(thegreendashes)thankstomorereliableperformanceatsteps7‐9,eventhoughitlosesouttoreplicationatsteps10‐12.

SF100#8/8:whenwescalethedatasettofactor100weonlyhaveresultsonVirtuosoV7Clustereditionon8servernodes.Onthelowzoomlevel,thequadapproachracesahead,withbasicandrtree

0

5

10

15

20

25

30

35

40

45

50

0 1 2 3 4 5 6 7 8 9 10 11 12

8*owlim5.310basic

8*owlim5.310quad

8*owlim5.310rtree++

8*virtuoso710basic

8*virtuoso710rtree

8*virtuoso710quad

8*virtuoso710rtree++

v7cluster/810basic

v7cluster/810quad

v7cluster/810rtree

D5.1.4–v1.0

Page28

behaving identically.Athigherzoom levels, all approaches improvegradually,but rteeclearlybeatsbasic,andquadclearlybeatsrtree.

Figure11:Page‐‐per‐secondperformanceforeachofthe12steps(SF100,8/8)

Intheexperimentsuntilnow,weshowtheperformanceperstep,howeverpleaserecallthateachstep is the combination of two queries. For steps 1‐6, it is a Facet Count Query with a InstanceAggregationQuery,andforsteps7‐12itisaFacetCountQueryfollowedbyaInstanceRetrievalQuery.Also, levels 6,8,10,12 just pan (to a partially overlapping area at the zame level),wheres the otherquerieszoomin.Intheabovethisisvisibleinqueries6,8,10,12scoringabovethetwotrendlinesthatonecanconstructinthestep1‐6and7‐12segments.

Analysisof IndivualQueryPerformance. However,what is also interesting is to look at theindividual queries. Each query stream consists of 24 queries, two per step. First, the Facet CountQuery,thentheInstanceAggregationorRetrievalQuery.

0

2

4

6

8

10

12

1 2 3 4 5 6 7 8 9 10 11 12

v7cluster/8100rtree

v7cluster/8100quad

v7cluster/8100basic

D5.1.4–v1.0

Page29

Theabove figuresshowatSF1#8/1on theright theQueries‐Per‐Secondachievedby theFacetCountQuery,andontheleftbytheInstanceAggregationQuery(steps1‐6)andtheInstanceRetrievalQuery(steps7‐12). Ifweexaminethescale,ateachstep, the Instancequeriesarethebottleneck. Infact,onVirtuoso7theFacetCountQuerydoesnotneedthequadapproximation,asrtree++isamongthe best.Herewe confirm that the performance dip at step 7‐9 is causedby the InstanceRetrievalQuery.Thereasonforthisisthelargeamountofinstancesatthesezoomlevels.Assuch,theseresultspoint to the fact that in the benchmark the switch‐over from Instance Aggregation to InstanceRetrievalshouldbetterbemadeatadeeperzoomlevel.

0

5

10

15

20

25

30

35

40

0 1 2 3 4 5 6 7 8 9101112

1owlim5.3basic

1owlim5.3quad

1owlim5.3rtree++

1virtuoso6basic

1virtuoso6rtree 0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 101112

InstanceInstance

Aggregation Retrieval

QueryQuery

FacetCountQuery

D5.1.4–v1.0

Page30

At SF10#8/1 (above) the cost balance between InstanceQueries (left graph) and Facet CountQueries(rightgraph)shift,astheybecomemorecomparable.Still,thebottleneckisinthefirstthreeInstance Retrieval Queries (step 7‐9, left). It is remarkable in these results that for the InstanceRetrievalQueries,owlim5.3doesagood job instep8‐12(left); in factbeating theVirtuoso7rtree++approach.

AtSF100#8/8theFacetCount(rightgraph)andInstanceQueries(leftgraph)areroughlythesamecost.Here, thequadapproachreallywins in theFacetCountQuery (right).The facet InstanceQueries (left) generally have lower performance, especially between query 7‐12 (i.e., the FacetRetrievalQuery).

0

5

10

15

20

25

30

1 2 3 4 5 6 7 8 9 101112

owlim4.3basic

owlim4.3quad

owlim4.3rtree++

virtuoso7basic

virtuoso7rtree

virtuoso7quad

virtuoso7rtree++

v7clusterbasic

v7clusterrtree

v7clusterquad0

5

10

15

20

25

1 2 3 4 5 6 7 8 9 101112

0

1

2

3

4

1 2 3 4 5 6 7 8 9 10 11 12

v7clusterbasic

v7clusterrtree

v7clusterquad

0

1

2

3

4

5

1 2 3 4 5 6 7 8 9101112

FacetCountQuery

FacetCountQuery

D5.1.4–v1.0

Page31

3.6 FullQueryPerformanceResultsIntheseresults,therednumbersarethoseproducedbytheslowest,whereasthegreennumbers

arethefastestruns.

3.6.1 Scalefactor1,1querystream,1serverSF1 #1/1 owlim5.3 owlim5.3 owlim5.3 virtuoso6 virtuoso6 virtuoso6 virtuoso7 virtuoso7 virtuoso7 virtuoso7

METHOD basic quad rtree++ basic quad rtree basic quad rtree rtree++

STEP01 0.0346 0.1176 0.0832 0.0905 4.7641 0.0387 0.1358 4.4662 0.2301 0.6190

STEP02 0.0556 0.1458 0.2586 0.1603 5.3937 0.1134 0.3253 3.7481 0.5640 1.4253

STEP03 0.0780 0.2968 0.7925 0.2619 6.9348 0.2772 1.0380 4.6468 1.8761 3.0703

STEP04 0.1237 0.2347 1.6066 0.3784 7.9176 0.6577 1.2910 5.3821 2.1349 5.0100

STEP05 0.2079 0.9257 2.1772 0.5676 15.3678 1.0667 2.0250 6.4850 2.3651 6.8446

STEP06 0.2028 0.9109 2.2862 0.5628 14.6413 1.0730 2.4520 8.9126 2.5886 6.7294

STEP07 0.3517 0.7462 3.2216 0.8517 1.1614 1.5647 2.6040 2.5953 2.5819 8.8261

STEP08 0.3610 0.5692 3.3288 0.8486 1.4243 1.5382 2.8520 5.0403 2.9967 9.0171

STEP09 0.7034 0.8040 6.2853 1.4900 2.3596 3.5868 4.2123 6.2972 4.4762 17.2414

STEP10 0.6758 0.7642 5.1177 1.4007 2.1092 2.4515 4.4111 7.8003 3.8051 14.0647

STEP11 1.3709 0.9497 9.5602 4.0144 5.0735 6.2383 6.4892 9.1996 6.6711 26.5252

STEP12 1.2764 2.9002 6.6181 2.2123 4.5558 4.1493 6.2422 9.6711 5.2966 18.0832

LSCORE 0.0961 0.3168 0.7176 0.2780 8.1604 0.3119 0.8156 5.1935 1.2128 2.9228

LSCORE/$ 0.0347 1.0200 0.0389 0.1019 0.6491 0.1516 0.3653

HSCORE 0.6876 0.9466 5.2829 1.5409 2.3974 2.8593 4.2104 6.2021 4.0841 14.4740

HSCORE/$ 0.1926 0.2996 0.3574 0.5263 0.7752 0.5105 1.8093

SCORE 0.2571 0.5476 1.9471 0.6545 4.4231 0.9444 1.8532 5.6755 2.2256 6.5044

SCORE/$ 0.0818 0.5528 0.1180 0.2316 0.7094 0.2782 0.8130

3.6.2 Scalefactor1,2querystreams,1serverSF1 #2/1 owlim5.3 owlim5.3 owlim5.3 virtuoso6 virtuoso6 virtuoso6 virtuoso7 virtuoso7 virtuoso7 virtuoso7


STEP01 0.0438 0.2046 0.1377 0.1450 8.7857 0.0592 0.2899 9.3350 0.5868 1.0153

STEP02 0.0760 0.3324 0.3922 0.2556 11.8142 0.1581 0.9248 10.0023 1.5272 2.1669

STEP03 0.1010 0.4646 1.0689 0.4358 12.5180 0.4040 2.7962 6.2473 4.2260 4.5340

STEP04 0.1579 0.7876 2.2714 0.6660 18.5405 0.8659 2.7117 9.9469 4.0895 8.4835

STEP05 0.2746 0.8481 4.6420 0.9965 25.4895 1.7024 3.4837 13.1304 4.8692 13.6153

STEP06 0.2704 0.9686 4.4050 0.9913 26.2031 1.7180 3.6731 19.0339 5.8937 13.4819

STEP07 0.4914 0.9496 6.5400 1.6060 2.0462 3.3723 5.5321 6.6725 5.8411 19.9092

STEP08 0.4976 0.9409 6.0975 1.6073 2.3136 3.3690 6.0500 10.2762 6.7234 19.6660

STEP09 1.0367 2.5329 12.4141 2.9654 5.3145 8.2270 9.8168 12.8568 11.5399 39.1348

STEP10 0.8869 2.6699 10.3294 3.0043 4.2300 5.9599 9.5017 16.0707 10.4757 29.2999

STEP11 1.8653 5.0793 20.1726 6.3119 8.6424 13.6262 13.5619 17.2091 17.0043 55.0651

STEP12 1.7027 3.8357 14.5142 5.3378 7.5906 9.8195 13.5811 17.8367 12.7472 38.4264

LSCORE 0.1258 0.5230 1.17876 0.4690 15.8713 0.4610 1.7210 10.6290 2.7614 4.9919

LSCORE/$ 0.0586 1.9839 0.0576 0.2151 1.3286 0.3451 0.6239

HSCORE 0.9454 2.2132 10.6857 3.0293 4.2201 6.4824 9.1109 12.7630 10.0386 31.3102

HSCORE/$ 0.3786 0.5275 0.8103 1.1388 1.5953 1.2548 3.9137

SCORE 0.3449 1.0759 3.5490 1.1920 8.1840 1.7287 3.9598 11.6472 5.2651 12.5020

SCORE/$ 0.1490 1.0230 0.2160 0.4949 1.4559 0.6581 1.5627

D5.1.4–v1.0

Page32



STEP01 0.0703 0.3726 0.2693 0.2875 18.7083 0.1195 0.5566 21.3604 1.5452 2.0748

STEP02 0.1121 0.4790 0.8264 0.5028 25.3611 0.3937 1.5829 18.1676 2.8779 5.3924

STEP03 0.1498 1.00556 2.3242 0.7441 31.2361 0.8582 4.6161 13.0845 8.6183 9.5042

STEP04 0.22611 1.4740 4.7571 1.1240 46.2466 1.9203 5.0335 21.9932 8.2786 16.5111

STEP05 0.3634 2.7782 7.6465 1.6884 74.7055 3.2683 6.9151 30.5185 9.1625 23.0400

STEP06 0.3694 2.6684 7.9984 1.6681 74.7607 3.2947 6.6436 45.7031 10.3364 24.2530

STEP07 0.7162 2.7846 12.2841 3.1894 4.4121 5.7132 11.4062 16.1533 10.6061 33.4092

STEP08 0.6863 2.8749 12.0616 2.9065 4.4726 5.7027 12.4466 23.5253 10.956 33.4019

STEP09 1.4399 4.4919 26.8988 6.7532 8.3481 13.7416 19.6487 28.2182 19.1360 65.3226

STEP10 1.3783 5.8125 19.5487 6.6083 8.09174 10.0076 19.2529 36.7007 17.3637 52.3030

STEP11 2.8112 12.0010 43.3150 14.7862 18.6510 27.0136 29.8168 40.1119 26.7478 96.3516

STEP12 2.4259 9.2902 27.8078 13.0194 15.15110 16.2686 30.6850 43.2767 23.4443 60.7765

LSCORE 0.1817 1.1187 2.3055 0.8357 39.2294 0.9703 3.1287 23.1667 5.5719 9.6245

LSCORE/$ 0.1044 4.9036 0.1212 0.3910 2.8958 0.6964 1.2030

HSCORE 1.3712 5.3409 21.2913 6.5543 8.4910 11.1847 19.1156 29.6368 16.9895 52.9802

HSCORE/$ 0.8192 1.0613 1.3980 2.3894 3.7046 2.1236 6.6225

SCORE 0.4992 2.4444 7.0063 2.34052 18.2510 3.2944 7.7335 26.2028 9.7295 22.5812

SCORE/$ 0.2925 2.2813 0.4118 0.9666 3.2753 1.2162 2.8226


METHOD quad basic rtree++ basic quad rtree basic quad rtree rtree++

STEP01 0.5448 0.0563 0.2743 0.4113 36.2784 0.1811 0.9454 40.7706 2.7136 3.4060

STEP02 0.7765 0.0892 0.9029 0.7124 46.3338 0.5223 2.9656 36.3412 7.7649 6.6037

STEP03 1.4639 0.1222 2.3402 1.1828 58.2323 1.3596 8.3323 31.9119 13.4467 13.7075

STEP04 2.5405 0.1910 5.1597 1.8460 90.0222 2.9270 9.1631 47.4474 13.7959 25.0132

STEP05 3.4629 0.3169 8.3595 2.7192 132.4990 5.2606 12.2425 59.0893 13.7280 38.6629

STEP06 4.6291 0.3096 8.8595 2.7852 139.9900 5.3247 13.1893 85.2281 15.7198 38.5826

STEP07 4.5387 0.5878 14.1144 4.9718 7.7641 10.0048 23.0038 33.8289 20.6775 55.1534

STEP08 4.6977 0.6517 15.3833 4.7890 7.9362 10.0792 23.8945 45.0062 21.1386 59.0653

STEP09 7.6525 1.3616 38.3478 10.2151 13.7468 24.4637 40.1600 55.6296 37.3605 105.0690

STEP10 9.7112 1.2118 25.6037 9.2489 13.3397 17.5706 37.1036 72.2938 31.9566 83.1055

STEP11 19.0270 2.2852 66.4700 21.7008 30.4268 45.2330 54.0770 77.3047 51.9836 156.4850

STEP12 13.2889 2.0098 34.6726 18.9416 25.7544 29.1453 54.146 79.2604 45.7914 105.1910

LSCORE 1.7121 0.1503 2.4589 1.3007 73.8155 1.4807 5.7035 47.2966 9.7115 15.0085

LSCORE/$ 0.1625 9.2269 0.1850 0.7129 5.9120 1.2139 1.8760

HSCORE 8.5786 1.1943 27.7408 9.8613 15.3680 19.6025 36.5335 57.7652 32.7411 87.9627

HSCORE/$ 1.2326 1.7960 2.4503 4.5666 7.22066 4.0926 10.9953

SCORE 3.8325 0.4238 8.2590 3.5814 32.5666 5.3876 14.4350 52.2695 17.8317 36.3344

SCORE/$ 0.4476 4.07083 0.6734 1.8043 6.5336 2.2289 4.5418

D5.1.4–v1.0

Page33

3.6.5 Scalefactor1,8querystreams,8replicatedserversSF1 #8/8 owlim5.3 owlim5.3 owlim5.3 virtuoso6 virtuoso6 virtuoso6 virtuoso7 virtuoso7 virtuoso7 virtuoso7

METHOD Basic Quad rtree++ basic rtree Quad basic rtree quad rtree++

STEP01 0.2772 0.9410 0.6660 0.7247 0.3102 38.1134 1.0867 1.8415 35.7302 4.9520

STEP02 0.4450 1.1669 2.0693 1.2831 0.9078 43.1499 2.6029 4.5126 29.9850 11.4025

STEP03 0.6244 2.3751 6.3406 2.0958 2.2179 55.4785 8.3064 15.0094 37.1740 24.5625

STEP04 0.9902 1.8779 12.8534 3.0279 5.2617 63.3413 10.3306 17.0794 35.0570 40.0802

STEP05 1.6636 7.4060 17.4178 4.5413 8.5342 114.942 16.2042 18.9214 51.8806 54.7570

STEP06 1.6230 7.2879 18.2899 4.5029 8.5846 117.1300 19.622 20.7093 71.3013 53.8358

STEP07 2.8137 5.9697 25.7732 6.8143 12.5176 9.2915 20.833 20.6558 20.7630 70.6090

STEP08 2.8886 4.5539 26.6311 6.7888 12.3058 11.3944 22.8180 23.9736 40.3226 72.1370

STEP09 5.6274 6.4365 50.2829 11.9207 28.6944 18.8768 33.6984 35.8102 50.3778 137.9310

STEP10 5.4065 6.1138 40.9417 11.2061 19.6126 16.8741 35.2890 30.4414 62.4025 112.5180

STEP11 10.9679 7.5980 76.4818 32.1156 49.9064 40.5886 51.9143 53.3689 73.5970 212.2020

STEP12 10.2119 23.2018 52.9450 17.6991 33.1950 36.4465 49.9376 42.3729 77.3694 144.6660

LSCORE 0.7686 2.5324 5.7364 2.2223 2.4934 65.2294 6.5201 9.6946 41.5142 23.3633

LSCORE/$ 0.0317 0.0356 0.9318 0.0931 0.1384 0.5930 0.333762

HSCORE 5.4967 7.5666 42.2283 12.31 22.8554 19.1639 33.6555 32.6461 49.5762 115.7030

HSCORE/$ 0.1759 0.3265 0.2737 0.4807 0.4663 0.7082 1.6529

SCORE 2.0554 5.3774 15.5641 5.2318 7.5490 35.3561 14.8135 17.7902 45.3665 51.9923

SCORE/$ 0.0747 0.1078 0.5050 0.2116 0.2541 0.6480 0.74274

3.6.6 Scalefactor10,1querystream,1server(8serversforv7cluster)SF10 #1/1 owlim5.3 owlim5.3 owlim5.3 v7cluster v7cluster v7cluster virtuoso7 virtuoso7 virtuoso7 virtuoso7


STEP01 0.0050 0.0198 0.0104 0.4749 3.8095 0.5610 0.1166 0.8274 0.0947 0.1757

STEP02 0.0109 0.0207 0.0323 0.7364 3.2862 0.6941 0.1037 1.6655 0.0854 0.4154

STEP03 0.0208 0.0456 0.1077 0.9268 4.5829 0.6060 0.1109 1.7379 0.1068 0.6811

STEP04 0.0346 0.0272 0.2104 1.0624 4.4822 0.9128 0.0942 1.1002 0.0928 0.7865

STEP05 0.0500 0.1210 0.3210 1.3428 5.8858 1.1353 0.1786 2.5316 0.1802 1.0473

STEP06 0.0492 0.1136 0.3207 1.8733 6.9979 1.4039 0.2710 3.9169 0.2731 1.0511

STEP07 0.0667 0.0961 0.4799 1.0271 1.0001 0.7162 0.2425 0.2147 0.2204 0.9768

STEP08 0.0683 0.0788 0.4983 1.1344 1.7382 0.7238 0.3339 0.4377 0.2771 1.0534

STEP09 0.1121 0.0931 1.4551 1.2110 1.6526 0.9791 0.4087 0.6935 0.3241 2.1173

STEP10 0.1049 0.0941 0.9657 1.2301 1.8821 0.8685 0.7598 3.3512 0.4157 1.5506

STEP11 0.1920 0.1191 3.1938 1.3126 1.9704 1.4876 0.9669 3.6563 0.7057 3.6496

STEP12 0.1678 0.6825 1.4376 1.3282 1.9142 1.0537 0.9318 5.3705 0.6004 1.9149

LSCORE 0.0215 0.0438 0.0963 0.9763 4.6834 0.8368 0.1353 1.7222 0.1258 0.5921

LSCORE/$ 0.0088 0.0426 0.0076 0.0169 0.2152 0.0157 0.0740

HSCORE 0.1096 0.1325 1.0749 1.2026 1.6526 0.9403 0.5321 1.2746 0.3895 1.6934

HSCORE/$ 0.0171 0.0236 0.0134 0.0665 0.1593 0.0487 0.2116

SCORE 0.0485 0.0762 0.3217 1.0836 2.7820 0.8870 0.2684 1.4816 0.2214 1.0014

SCORE/$ 0.0098 0.0253 0.008 0.0335 0.1852 0.0276 0.1251

D5.1.4–v1.0

Page34

3.6.7 Scalefactor10,2querystreams,1server(8serversforv7cluster)SF10 #2/1 owlim5.3 owlim5.3 owlim5.3 v7cluster v7cluster v7cluster virtuoso7 virtuoso7 virtuoso7 virtuoso7


STEP01 0.0065 0.0339 0.0100 0.8428 6.9037 1.0140 0.1989 2.5406 0.1479 0.2817

STEP02 0.0140 0.0475 0.0196 1.3476 8.2174 1.2511 0.2304 4.9220 0.1974 0.6734

STEP03 0.0280 0.0672 0.0912 2.1892 6.2783 1.2983 0.2715 3.9435 0.2591 1.1448

STEP04 0.0451 0.1049 0.1901 2.4971 9.3195 1.7669 0.1964 5.1139 0.1986 1.6744

STEP05 0.0689 0.0953 0.3302 2.7946 11.242 1.8100 0.2883 6.3940 0.2929 2.1664

STEP06 0.0691 0.1240 0.4815 3.5152 14.435 2.0111 0.5444 5.4518 0.5457 2.2474

STEP07 0.0995 0.1169 0.7798 2.0170 2.1313 1.4603 0.5686 0.5905 0.4803 1.7302

STEP08 0.1002 0.1180 0.4186 2.3943 3.7048 1.2957 0.7306 1.1359 0.5880 1.6374

STEP09 0.1738 0.3818 0.6771 2.3188 3.2702 2.1333 1.0197 1.6843 0.6839 3.7533

STEP10 0.1573 0.4045 0.7272 2.3477 3.6686 1.9933 1.3206 7.5586 0.8095 2.5963

STEP11 0.2782 1.3700 6.5831 2.5421 3.4931 2.2806 1.6689 7.5430 1.5672 6.9030

STEP12 0.2377 0.6441 3.3194 2.6716 3.8395 2.1771 1.6871 8.9356 1.1288 3.7002

LSCORE 0.0286 0.0716 0.0905 1.9835 9.0125 1.4817 0.2697 4.5402 0.2494 1.0999

LSCORE/$ 0.0180 0.0818 0.0134 0.0337 0.5675 0.0311 0.1374

HSCORE 0.1621 0.3515 1.2328 2.3721 3.2894 1.8485 1.0786 2.8829 0.8073 2.9821

HSCORE/$ 0.0215 0.0299 0.0168 0.1348 0.3603 0.1009 0.3727

SCORE 0.0682 0.1587 0.3341 2.1691 5.4448 1.6550 0.5394 3.6179 0.4488 1.8111

SCORE/$ 0.0197 0.0495 0.0150 0.0674 0.4522 0.0561 0.2263

3.6.8 Scalefactor10,4querystreams,1server(8serversforv7cluster)SF10 #4/1 owlim5.3 owlim5.3 owlim5.3 v7cluster v7cluster v7cluster virtuoso7 virtuoso7 virtuoso7 virtuoso7


STEP01 0.0105 0.0652 0.0242 1.6013 14.0383 1.8670 0.3959 5.8348 0.2644 0.5549

STEP02 0.0212 0.0702 0.0820 2.4171 14.2772 2.1800 0.3698 6.9096 0.3257 1.4373

STEP03 0.0427 0.1441 0.2278 3.4893 10.7229 2.6489 0.4208 10.2357 0.3958 2.2133

STEP04 0.0657 0.1946 0.4703 4.4887 17.2670 3.6181 0.3787 13.6636 0.3721 2.7233

STEP05 0.0992 0.3381 0.7094 5.2234 18.9309 3.2839 0.6232 22.4468 0.6240 3.5838

STEP06 0.1000 0.3374 0.8064 6.6113 29.4153 3.7726 1.0406 27.9520 1.1231 3.4744

STEP07 0.1538 0.3841 1.2721 3.8237 4.4956 2.2509 1.2260 1.7904 0.9262 2.6010

STEP08 0.1479 0.4572 1.1595 4.0627 6.5959 2.5554 1.7328 4.2651 1.2789 2.4533

STEP09 0.2576 0.6944 3.4924 4.7477 5.8628 3.0736 1.6984 4.9390 1.2811 5.3958

STEP10 0.2424 0.9163 2.7187 4.8525 6.6822 3.0457 2.2505 16.0512 1.4027 3.9986

STEP11 0.4361 2.2408 9.6816 4.4008 6.3772 4.1216 3.1623 18.0522 3.1385 10.445

STEP12 0.4076 1.7863 3.3537 4.5142 7.0365 3.7629 2.7759 23.5145 1.9080 4.7708

LSCORE 0.0429 0.1565 0.2228 3.5748 16.5469 2.8002 0.4975 12.3316 0.4553 1.9773

LSCORE/$ 0.0325 0.1505 0.0254 0.0621 1.5414 0.0569 0.2471

HSCORE 0.2515 0.8745 2.7719 5.3825 6.1075 3.0673 2.0357 7.9669 1.5281 5.3566

HSCORE/$ 0.0398 0.0555 0.0279 0.2544 0.9958 0.1910 0.54457

SCORE 0.1039 0.3700 0.7859 3.9581 10.0529 2.9307 1.0063 9.9118 0.8341 2.9350

SCORE/$ 0.0360 0.0914 0.0266 0.1257 1.2389 0.1042 0.3668

D5.1.4–v1.0

Page35

3.6.9 Scalefactor10,8querystreams,1server(8serversforv7cluster)SF10 # owlim5.3 owlim5.3 owlim5.3 v7cluster v7cluster v7cluster virtuoso7 virtuoso7 virtuoso7 virtuoso7


STEP01 0.0111 0.1058 0.0179 2.4963 17.8207 3.3792 0.6484 12.6956 0.4822 0.7251

STEP02 0.0244 0.1041 0.0554 3.6211 16.9718 4.4317 0.6449 17.6573 0.5667 1.7568

STEP03 0.0491 0.1936 0.1601 5.1298 17.7603 5.1870 0.7271 22.5014 0.7020 2.9094

STEP04 0.0782 0.3317 0.3211 6.6643 23.7782 6.1971 0.6475 28.0677 0.6225 3.8872

STEP05 0.1172 0.4156 0.5781 7.6602 27.0735 6.3392 0.9747 35.9309 0.9837 5.1091

STEP06 0.1174 0.5800 0.5869 11.6547 42.7366 7.4973 1.9773 47.5651 2.2304 4.8659

STEP07 0.1784 0.5611 1.1142 6.3838 8.0113 3.9240 2.4544 3.6566 1.5296 2.5534

STEP08 0.1741 0.5952 1.1449 6.9567 11.1574 4.1500 2.8321 6.4509 1.9406 2.6415

STEP09 0.3091 1.0202 3.6585 6.7299 9.2481 5.5324 3.3316 8.8511 2.2520 5.4293

STEP10 0.2730 1.3726 2.5702 7.6770 10.8405 5.2976 3.9193 29.6419 2.4232 4.1596

STEP11 0.5210 3.3685 9.4047 6.3636 10.1749 6.7232 5.1006 31.5107 4.5335 9.8784

STEP12 0.4747 2.6020 3.7997 7.1536 10.7785 6.4210 4.5376 40.4639 3.4996 5.3200

LSCORE 0.0493 0.2356 0.1609 5.4932 22.9646 5.3245 0.8509 24.9306 0.8000 2.6639

LSCORE/$ 0.0500 0.2089 0.0484 0.1063 3.1163 0.1000 0.3329

HSCORE 0.2943 1.2650 2.7448 6.8573 9.9619 5.2325 3.5769 14.0949 2.5205 4.4699

HSCORE/$ 0.0623 0.0906 0.0476 0.4471 1.7618 0.3150 0.5587

SCORE 0.1205 0.5460 0.6647 6.1375 15.1252 5.2783 1.7446 18.7455 1.4200 3.4507

SCORE/$ 0.0830 0.0558 0.1376 0.0480 0.2180 2.3431 0.17751 0.4313

3.6.10 Scalefactor10,8querystreams,8replicatedserversSF10 #8/8 owlim5.3* owlim5.3* owlim5.3* virtuoso7* virtuoso7* virtuoso7* virtuoso7*

METHOD basic quad rtree++ basic rtree quad rtree++

STEP01 0.0406 0.1590 0.0839 0.9335 0.7577 6.6192 1.4058

STEP02 0.0871 0.1657 0.2590 0.8303 0.6837 13.3240 3.3235

STEP03 0.1666 0.3649 0.8623 0.8872 0.8547 13.9034 5.4495

STEP04 0.2774 0.2179 1.6838 0.7539 0.7429 8.8018 6.2927

STEP05 0.4000 0.9684 2.5683 1.4292 1.4420 20.2530 8.3787

STEP06 0.3942 0.9093 2.5656 2.1685 2.1855 31.3357 8.4095

STEP07 0.5339 0.7689 3.8395 1.9402 1.7636 1.7182 7.8147

STEP08 0.5470 0.6305 3.9868 2.6718 2.2173 3.5018 8.4272

STEP09 0.8974 0.7452 11.6414 3.2697 2.5930 5.5482 16.9384

STEP10 0.8394 0.7531 7.7257 6.0785 3.3257 26.8097 12.4050

STEP11 1.5367 0.9532 25.5510 7.7354 5.6457 29.2505 29.1971

STEP12 1.3430 5.4607 11.5009 7.4550 4.8039 42.9646 15.3198

LSCORE 0.1720 0.3504 0.7699 1.0822 1.0060 13.7665 4.7334

LSCORE/$ 0.0154 0.0143 0.1966 0.0676

HSCORE 0.8767 1.0597 8.5926 4.2533 3.1142 10.1884 13.5361

HSCORE/$ 0.0607 0.0444 0.1455 0.19337

SCORE 0.3884 0.6093 2.5720 2.1455 1.7700 11.8431 8.0045

SCORE/$ 0.0306 0.0252 0.1691 0.1143

D5.1.4–v1.0

Page36

3.6.11 Scalefactor100,1and2querystreams,8partitionedserversSF100 #1/1 v7cluster v7cluster v7cluster SF100 #2/1 v7cluster v7cluster v7cluster

METHOD basic quad rtree METHOD basic quad rtree

STEP01 0.1058 0.4640 0.0821 STEP01 0.1683 0.6229 0.1695

STEP02 0.2303 0.4118 0.3703 STEP02 0.3586 0.8682 0.5513

STEP03 0.3943 0.9395 0.5208 STEP03 0.5829 0.5552 1.0560

STEP04 0.5587 1.0636 0.4454 STEP04 0.8644 2.5406 1.0788

STEP05 0.7482 2.1896 0.6861 STEP05 1.1468 4.5236 1.5198

STEP06 0.9933 2.9455 0.7742 STEP06 1.4029 6.4307 1.5022

STEP07 0.8034 0.5584 0.2992 STEP07 1.1142 0.6679 0.6052

STEP08 0.8792 1.0936 0.3157 STEP08 1.4387 2.1125 0.6102

STEP09 0.9900 1.0012 0.4326 STEP09 1.6271 1.7392 0.6401

STEP10 0.9835 1.1958 0.2905 STEP10 1.8525 2.0997 0.5107

STEP11 1.0821 1.2656 0.5427 STEP11 1.7895 2.1142 1.0567

STEP12 1.1686 1.3336 0.5325 STEP12 1.7769 2.2917 0.8551

LSCORE 0.3984 1.0353 0.3943 LSCORE 0.6049 1.6760 0.7901

LSCORE/$ 0.0036 0.0094 0.0035 LSCORE/$ 0.0055 0.0152 0.0071

HSCORE 0.9770 1.0356 0.3885 HSCORE 1.5764 1.7092 0.6914

HSCORE/$ 0.0088 0.0094 0.0035 HSCORE/$ 0.0143 0.0155 0.0062

SCORE 0.6239 1.0355 0.3914 SCORE 0.9765 1.6925 0.7391

SCORE/$ 0.0056 0.0094 0.0035 SCORE/$ 0.0088 0.0154 0.0067

3.6.12 Scalefactor100,4and8querystreams,8partitionedserversSF100 #4/1 v7cluster v7cluster v7cluster SF100 #8/1 v7cluster v7cluster v7cluster

METHOD basic quad rtree METHOD basic quad rtree

STEP01 0.2992 1.3469 0.3275 STEP01 0.2387 1.6907 0.3302

STEP02 0.5750 0.9592 0.7332 STEP02 0.5327 1.5570 0.5674

STEP03 0.9540 2.5549 1.4347 STEP03 0.7490 2.6810 1.2959

STEP04 1.1772 4.1456 1.4000 STEP04 1.1318 4.1801 0.9172

STEP05 1.2345 6.470 1.4869 STEP05 1.3073 6.1331 1.3168

STEP06 1.5035 12.354 2.0413 STEP06 1.7921 9.9537 1.6515

STEP07 1.8901 1.8784 1.3785 STEP07 1.6881 2.1458 0.7626

STEP08 2.3643 3.2899 1.3904 STEP08 2.1298 3.0693 0.9078

STEP09 1.9110 2.4358 0.9731 STEP09 1.8109 2.6453 0.9261

STEP10 2.5792 2.7515 1.0189 STEP10 2.2441 3.2538 0.8878

STEP11 2.2809 2.8921 1.0824 STEP11 1.7612 3.0709 0.9556

STEP12 3.2223 3.2312 1.1578 STEP12 1.7843 3.4796 1.0949

LSCORE 0.8429 3.2084 1.0656 LSCORE 0.7951 3.4863 0.8862

LSCORE/$ 0.0076 0.0291 0.0097 LSCORE/$ 0.0072 0.0317 0.0080

HSCORE 2.3338 2.6985 1.1555 HSCORE 1.8918 2.9076 0.9173

HSCORE/$ 0.0212 0.0245 0.0105 HSCORE/$ 0.0172 0.0264 0.0083

SCORE 1.4026 2.9424 1.1096 SCORE 1.2265 3.1838 0.9016

SCORE/$ 0.0127 0.0267 0.0101 SCORE/$ 0.0111 0.0289 0.0082

D5.1.4–v1.0

Page37

4. ConclusionsIn this report, we have described an evaluation of the LOD2 GeoBench on a variety of system

configurations.Wenowdrawconclusionsonthefollowingissues:

The benchmark itself. The LOD2 GeoBench is a challenging benchmark, specifically the InstanceAggregation and Retrieval Queries pose an intense workload to the system. We see that exactimplementations (i.e. basic, rtree, rtree++ but not quad) have a hard time scaling the InstanceAggregationQuerywellatthehigherzoomlevels.WealsoseethattheInstanceRetrievalQueryatthefirst zoom levels where it is used (7‐9) causes a dip in performance due to such retrieval queriesyieldingmanyinstancesandaccessingmanydatapagesinthedatabasesubsystem.Ontheonehandthistellsusthatthebenchmarkisinteresting.Publishingaboutthisbenchmarkwillputemphasisonfinding better solutions to e.g. the Instance Retrieval Query, e.g. by pushing the envelope in queryoptimization. Further, the inherent problems in the lower zoom levels may help the RDF servervendorstoprovidebetterhookstoperformindexingandpre‐computation.Asideasforav3.0ofthebenchmark,we should consider changing the switchover point from InstanceAggregationQuery toInstance Retrieval query at a deeper zoom level. This would be a natural reaction in a real‐lifeapplication toensuredependable latenciesacrossqueries.Further, in the futureweneed to testonlarger data, and with many more concurrent query streams. Finally, a better analysis of theperformancestabilityoftheresultsisneeded.Becauseweareworkingonrealdata,thecardinalitiesoftheselectionsarenotfullypredictableandcanvaryconsiderably,potentiallyintroducingnoiseinthebenchmarkscores.Thiscouldbeaddressedbyhavingthequerygeneratorbeingevenmoreintelligentingeneratingquerypatterns,suchastogenerateproperevenlybalancedparameterbindings.

ThestateofRDFdatabasetechnology.ThethreerightmostresultgroupsinFigure3areanexampleof the achievements in theLOD2project,whereacademic researchperformedbyCWIon columnarandvectorizedqueryexecutionhasmeasurably improvedtheperformanceof theOpenlinkVirtuosoproduct from V6 to V7 by a factor 7, in this case; creating a competitive advantage. In general,geographicalindextechnologyisshownbytheLOD2GeoBenchtobequiteeffective,inthehighzoomlevels. The plans do show some unexpected results, with certain quad Virtuoso V7 query plansbecoming slower than in V6, which likely is down to query optimizer issues. Query optimizationremainsoneofthebiggestchallengesinSPARQLqueryexecution;whichintheLOD2GeoBenchshowsin faults in properly handling the disjunctive queries (the four FACET selections) and the complexquadexpressions.

Even though the thinking in theRDFcommunitymaybe thatRDFdatabasesupport isclosing inonindustry readiness on relational technology, the LOD2 GeoBench shows some very significantconceptualholes.Forinstance,inrelationaltechnologythereareimportantphysicaldesignconcepts,suchasmaterializedviews andclustered indexes, explicitly created for certainpredicates.Theseconcepts are not possible to express in the RDF world. For instance, in a multi‐resolution mapsituation, a relational DBA or database designer would likely develop multiple tables at multipleresolutions, and create separate (RTree) indexes for these. Such tables, or materialized views thatstoreprecomputedexpressions(likefacetcountsatacertaingranularity).Thismeansthatqueriesona lowzoomlevelwouldonlyaccessthematerializedviewrelevant for itonly,whichona lowzoomlevel could have pruned most of dthe detailed data (the individual lamp posts). Accessing thatmaterializedviewthroughitsRTree indexwillbeefficient. Ifallmaterializedviewsforthedifferentresolutionswouldbeunifiedintoonebigdatastructure,allinformationforotherzoomlevelswouldendupintermingledinthesamediskblocksoftheRTree,suchthatmostofthedatascannedwouldbeirrelevant(becauseforadifferentresolution).Thisunifyingofalldata inonebigbucket iswhattheRDFmodeldoes.Whatisneededaremechanismstocreatematerializedviews(maybebyconstructingderiveddatainaspecialkindoftriplegraph)andallowingcertainindexes(suchasRTree)tobebuiltseparately for such a triple graph. That way the RTree will only contain relevant information.Currently,RDFdatabasetechnologydoesnotoffersuchdatabasedesignconcepts.

D5.1.4–v1.0

Page38

RDF geographical browsing application design: faceted browsing on large datasets needs pre‐computation.There isnoway aGoogleMaps experience canbe created straight from the rawbasedata (triples) in a dataset. The quad approach described and benchmarked here specificallytransformstheapplicationdatabaseneedsinsuchawaythatprecomputationofexpressionsbecomespossible. In this case, thequadapproachprecomputes facet instance counts for all tiles, atmultipledifferentgranularities.Queriesthenusetheseprecomputedcountstoavoidhavingtogotothebasedata. Itcannotbestressedenoughthatwithoutprecomputation,queriesatahighzoomlevelwouldneverperformwell,norwouldtheyeverproducenice‐lookingresults(justmillionsoflamp‐poststhatcannotbesensiblydrawnonamap).Thereisalsolittlehopethatsuchprecomputationandindexingcouldbearrangedfullyautomatically.Thismeansthatapplicationdesignersneedtotakethedatabasedesignissueveryseriously.

The latest version of the LOD2Geographical Browser adds significant new features thatmakes theassociationbetweengeographical informationandRDFdata flexibletospecify.Theolderversion,ofwhich a screenshot has been posted in Figure 1, just assumed that the geographical literal (point,polyline,polygon)wouldbeadirectpropertyofafacetinstancesubject.Itis,however,alsopossibletoassociate facet instances over long(er) join paths to geographical literals. The consequence of suchlongerjoinpathsisthatgeographicalquerieswillexperiencelesslocalityfromtheRTreejoinpath,butmoreimportantly,aninterfacewheresuchjoinpathscouldbevariedatrun‐timeflexiblywouldmakeit much more difficult to generate materialized views (such as our pre‐generated quad triples).Creating a Browsing Interface that flexibly allows users to specify these associations, yet rendersresultpages in interactive timeonvery large‐scaledata isextremelychallenging(ifnot impossible).Anotherissueiswhetherordinaryusers,accessing(RDF)dataviagraphicalinterfacesarelookingfortheflexibilitytoassociatejoinpathsthroughacomplexdatamodel,thatislikelyunknowntothem.Itseemsmorepropablethatifrelevantcomplexjoinpathsbetweeninstancesandtheirgeographyexist,itwouldbe the taskofanapplicationdesigner to identify these. In suchacase, theaforementioneddesiredmaterializedviewfunctionalitythatiscalledforinRDFdatabasesystemswouldthencomeinhandytopre‐materializethesegeographiesasdirectpropertiesandacceleratetheminseparateRTreeindexes.

D5.1.4–v1.0

Page39

5. Appendix:ConfigurationDetails5.1 Software

Virtuoso6:Version06.04.3132‐pthreadsforLinuxasofMay142012

Virtuoso7(forbothsingleandtheclusterversion):We used a development version of OpenLink Virtuoso Universal Server: Version07.00.3203‐pthreadsforLinuxasofAug182013

Owlim:Owlim‐SE:Version5.3.6156Tomcat:Version7.0.30

5.2 HardwareWe used CWI Scilens (www.scilens.org) cluster for the benchmark experiment. This cluster is

designed for high I/O bandwidth, and consists ofmultiple layers ofmachines. In order to get largeamounts of RAM,we used only the “bricks” layer,which contains itsmost powerfulmachines. ThemachineswereconnectedbyMellanoxMCX353A‐QCBTConnectX3VPIHCAcard(QDRIB40Gb/sand10GigE)throughanInfiniScaleIVQDRInfiniBandSwitch(MellanoxMIS5025Q).Eachmachinehasthefollowingspecification.

Hardware:(8machines)‐ Processors:2x Intel(R)Xeon(R)CPUE5‐2650,2.00GHz(8cores&hyperthreading),

SandyBridgearchitecture‐ Memory:256GB‐ HardDisks:3x1.8TB(7,200rpm)SATAinRAID0(180MB/ssequentialthroughput).

Software:‐ OperatingSystem:Linuxversion3.3.4‐3.fc16.x86_64

Filesystem:ext4‐ JavaVersionandJVM:Version1.6.0_31,64‐BitServerVM(build20.6‐b01).

ThetotalcostofthisconfigurationwasEUR70,000;whenacquiredin2012.

5.3 Configurationfiles Virtuoso6&Virtuoso7&V7cluster

Eachdatabasehas a virtuoso.ini file as the configuration file. For the cluster version, inaddition to the virtuoso.ini file, there are three other configuration files in each node:cluster.ini,virtuoso.global.ini,clusterglobal.ini.‐ Thevirtuoso.inifilereads:[Database]

DatabaseFile = virtuoso.db

TransactionFile = virtuoso.trx

ErrorLogFile = virtuoso.log

ErrorLogLevel = 7

FileExtend = 200

Striping = 0

D5.1.4–v1.0

Page40

Syslog = 0

;

; Server parameters

;

TempStorage = TempDatabase

[Parameters]

ServerPort = 1113

ServerThreads = 100

AsyncQueueMaxThreads = 50

ThreadsPerQuery = 32

CheckpointInterval = 120

NumberOfBuffers = 6000000

MaxDirtyBuffers = 450000

MaxCheckpointRemap = 2500000

DefaultIsolation = 2

MaxMemPoolSize = 40000000

StopCompilerWhenXOverRunTime = 1

AdjustVectorSize = 1

IndexTreeMaps = 64

FDsPerFile = 4

UnremapQuota = 0

CaseMode = 2

AllowOSCalls = 1

SafeExecutables = ../../bin/isql

Debug = 0

SQLOptimizer = 1

CallstackOnException = 0

PlDebug = 0

DirsAllowed = /,., ../../vad,../../dataset

MaxVectorSize = 500000

[HTTPServer]

ServerPort = 8892

ServerThreads = 30

ServerRoot = .

FTPServerPort = 10565

FTPServerAnonymousLogin = 1

FTPServerTimeout = 1200

[AutoRepair]

BadParentLinks = 0

BadDTP = 0

[Client]

D5.1.4–v1.0

Page41

SQL_QUERY_TIMEOUT = 0

SQL_TXN_TIMEOUT = 0

SQL_PREFETCH_ROWS = 100

SQL_PREFETCH_BYTES = 16000

[VDB]

ArrayOptimization = 0

NumArrayParameters = 10

[TempDatabase]

DatabaseFile = virtuoso.tdb

TransactionFile = virtuoso.ttr

FileExtend = 200

[Replication]

ServerName = virt6565

ServerEnable = 1

QueueMax = 50000

[URIQA]

DefaultHost = localhost.localdomain:13565

LocalHostNames = localhost:13565, master:13565, 10.1.1.1:13565

LocalHostMasks = master_.iv.dev.null:13565, master_:13565

Notethatineachclusternode,theserverportisdifferent.

‐ Thevirtuoso.global.iniinnode1(masternode)reads:

[Parameters] MaxQueryMem = 30G MaxVectorSize = 1000000 Affinity = 1-7 16-23 ListenerAffinity = 0 [Flags] enable_subscore = 0 dfg_empty_more_pause_msec = 100 dfg_max_empty_mores = 100000 qp_thread_min_usec = 100 cl_dfg_batch_bytes = 100000000 enable_high_card_part = 1 enable_vec_reuse = 1 mp_local_rc_sz = 0 dbf_explain_level = 3 enable_feed_other_dfg = 1 enable_cll_nb_read = 1 dbf_no_sample_timeout = 1

Notethat,mostoftheparametersinthevirtuoso.global.iniarethesameforeverynode,exceptthe“Affinity”,whichis9‐1524‐31forthenodeatevenindex(e.g.,node2,node4,etc)

‐ Thecluster.iniinnode1(masternode)reads:

D5.1.4–v1.0

Page42

[Cluster]

Threads = 200

Master = Host1

ThisHost = Host1

ReqBatchSize = 10000

BatchesPerRPC = 4

BatchBufferBytes = 20000

LocalOnly = 2

MaxKeepAlivesMissed = 3000

[ELASTIC]

Slices = 16

Segment1 = 1024, cl1/cl1.db = q1

Notethat,only“ThisHost”parameterischangedforothernodes.“Slices=16”appearsonlyinthemasternode.

‐ Theclusterglobal.inireads:[Cluster]

Threads = 200

Master = Host1

ReqBatchSize = 10000

BatchesPerRPC = 4

BatchBufferBytes = 20000

LocalOnly = 2

MaxKeepAlivesMissed = 2000

Host1 = 192.168.64.203:22201

Host2 = 192.168.64.203:22202

Host3 = 192.168.64.204:22203

Host4 = 192.168.64.204:22204

Host5 = 192.168.64.209:22205

Host6 = 192.168.64.209:22206

Host7 = 192.168.64.207:22207

Host8 = 192.168.64.207:22208

Host9 = 192.168.64.205:22209

Host10 = 192.168.64.205:22210

Host11 = 192.168.64.211:22211

Host12 = 192.168.64.211:22212

Host13 = 192.168.64.212:22213

Host14 = 192.168.64.212:22214

Host15 = 192.168.64.213:22215

Host16 = 192.168.64.213:22216

D5.1.4–v1.0

Page43

Owlim‐ Thegetting‐startedapplicationwasusedforbulk‐loading.Theowlim.ttlfileingetting‐

startedapplicationreads:

# Sesame configuration template for a owlim repository

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.

@prefix rep: <http://www.openrdf.org/config/repository#>.

@prefix sr: <http://www.openrdf.org/config/repository/sail#>.

@prefix sail: <http://www.openrdf.org/config/sail#>.

@prefix owlim: <http://www.ontotext.com/trree/owlim#>.

[] a rep:Repository ;

rep:repositoryID "owlim" ;

rdfs:label "OWLIM Getting Started" ;

rep:repositoryImpl [

rep:repositoryType "openrdf:SailRepository" ;

sr:sailImpl [

sail:sailType "owlim:Sail" ;

owlim:owlim-license "OWLIM_SE_01092013_128cores.license" ;

owlim:entity-index-size "500000000" ;

owlim:repository-type "file-repository" ;

owlim:ruleset "empty" ;

owlim:storage-folder "owlim-storage" ;

owlim:transaction-mode "fast" ;

# OWLIM-SE parameters

owlim:cache-memory "120G" ;

# OWLIM-Lite parameters

owlim:noPersist "false" ;

]

].

‐ Theexample.shscriptsingetting‐startedapplicationreads:foo=`pwd`

cd ..

. ./setvars.sh

cd $foo

#$JAVA_HOME/bin/java -Xmx512m -cp "bin:$CP_TESTS" GettingStarted $*

$JAVA_HOME/bin/java -d64 -Xmx200G -Xms160G -Dcache-memory=100G -Ddisable-plugins=rdfpriming -cp "bin:$CP_TESTS" GettingStarted context= $*

D5.1.4–v1.0

Page44

‐ Owlim databases have a Sesame template file in ~/.aduna/openrdf‐sesame‐console/templates/.Sesametemplatefilereads:

#

# Sesame configuration template for an OWLIM-SE repository

#

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.

@prefix rep: <http://www.openrdf.org/config/repository#>.

@prefix sr: <http://www.openrdf.org/config/repository/sail#>.

@prefix sail: <http://www.openrdf.org/config/sail#>.

@prefix owlim: <http://www.ontotext.com/trree/owlim#>.

[] a rep:Repository ;

rep:repositoryID "olgeo10" ;

rdfs:label "OWLIM Geo 10" ;

rep:repositoryImpl [

rep:repositoryType "openrdf:SailRepository" ;

sr:sailImpl [

sail:sailType "owlim:Sail" ;

owlim:owlim-license OWLIM_SE_01092013_128cores.license" ;

owlim:base-URL "{%Base URL|http://example.org/owlim#%}" ;

owlim:defaultNS "{%Default namespaces for imports(';' delimited)%}" ;

owlim:entity-index-size "{%Entity index size|200000%}" ;

owlim:entity-id-size "{%Entity ID bit-size|32%}" ;

owlim:imports "{%Imported RDF files(';' delimited)%}" ;

owlim:repository-type "{%Repository type|file-repository%}" ;

owlim:ruleset "{%Rule-set|owl-horst-optimized%}" ;

owlim:storage-folder "{%Storage folder|storage%}" ;

owlim:enable-context-index "{%Use context index|false%}" ;

owlim:cache-memory "50G" ;

owlim:tuple-index-memory "{%Main index memory|80m%}" ;

owlim:enablePredicateList "{%Use predicate indices|false%}" ;

owlim:predicate-memory "{%Predicate index memory|0%}" ;

owlim:fts-memory "{%Full-text search memory|0%}" ;

owlim:ftsIndexPolicy "{%Full-text search indexing policy|never%}" ;

owlim:ftsLiteralsOnly "{%Full-text search literals only|true%}" ;

owlim:in-memory-literal-properties "{%Cache literal language tags|false%}" ;

owlim:enable-literal-index "{%Enable literal index|true%}" ;

owlim:index-compression-ratio "{%Index compression ratio|-1%}" ;

D5.1.4–v1.0

Page45

owlim:check-for-inconsistencies "{%Check for inconsistencies|false%}" ;

owlim:disable-sameAs "{%Disable OWL sameAs optimisation|false%}" ;

owlim:enable-optimization "{%Enable query optimisation|true%}" ;

owlim:transaction-mode "{%Transaction mode|safe%}" ;

owlim:transaction-isolation "{%Transaction isolation|true%}" ;

owlim:query-timeout "{%Query time-out (seconds)|0%}" ;

owlim:query-limit-results "{%Limit query results|0%}" ;

owlim:throw-QueryEvaluationException-on-timeout "{%Throw exception on query time-out|false%}" ;

owlim:useShutdownHooks "{%Enable shutdown hooks|true%}" ;

owlim:read-only "{%Read-only|false%}" ;

]

].

5.4 BulkLoad Virtuoso6

‐ Bulk‐loadingwasrunwithonlysingle loadingprocess.Bulk‐loading forscale1withVirtuoso6takes3h35p.

11:41:47 PL LOG: Loader started

13:43:31 Checkpoint started

13:45:27 Checkpoint finished, log reused

15:16:20 PL LOG: No more files to load. Loader has finished

Virtuoso7‐ Bulk‐loadingwasrunwith14loadingprocessesinparallel.Forexample,Bulk‐loading

forscale10withVirtuoso7takes1hand32minutes.01:38:19 PL LOG: Loader started














02:30:24 PL LOG: No more files to load. Loader has finished,




D5.1.4–v1.0

Page46











V7cluster‐ Bulk‐loading was run with 2 loading processes in each node (thus, 32 loadingprocessesinall16nodes).Forexample,Bulk‐loadingforscale100withV7clusterintakes5hand11minutesinMasterNode.





Owlim‐ Bulk‐loadingwasrunbyusinggetting‐startedapplication.Thedatasetiscopiedtothe

preload directory and then is loaded into the owlim repository by running scriptexample.sh.Forexample,bulk‐loadingforscale1withOwlimtakes1h25p.

18:34:40 ===== Load Files (from the 'preload' parameter) ==========

18:34:40 Loading files from: /scratch/duc/lod2/GeoBench/owlim/owlim-se-5.3.6156/getting-started/./preload

Loading FacetCount12.nt 373566 statements

Loading FacetCount14.nt . 731376 statements

Loading FacetCount16.nt .. 1380933 statements

Loading FacetCount18.nt ..... 2563677 statements

Loading FacetCount20.nt ......... 4681485 statements

Loading FacetCount22.nt ................ 8342163 statements

Loading FacetCount24.nt ............................ 14383152 statements

Loading FacetMap12.nt ......... 4627860 statements

Loading FacetMap14.nt ................ 8364460 statements

Loading FacetMap16.nt ............................. 14697532 statements

Loading FacetMap18.nt ................................................. 24647112 statements

Loading FacetTile.nt ........................................................ 28258360 statements

Loading LGD-Dump-Ontology.nt 8721 statements

Loading refined_LGD-Dump-RelevantNodes.sorted.nt .......................................................................................................................................... 69495409 statements

Loading refined_LGD-Dump-RelevantWays.sorted.nt ........................................................................................................................................ 68475661 statements

19:59:54 TOTAL: 251031467 statements loaded

D5.1.4–v1.0

Page47

5.4.1 Sizing Virtuoso6&Virtuoso7

Thedatabasesizeiscomputedbymeasuringthesizeofvirtuoso.*ineachdatabasedirectory.Forexample,databaseofscale10ismeasured:

[duc@bricks05 10gindex]$ ls -al -h virtuoso.*

-rw-r--r-- 1 duc ins1 108G Aug 21 12:03 virtuoso.db

-rwxrwxr-x 1 duc ins1 1.8K Aug 6 16:58 virtuoso.ini

-rw-r--r-- 1 duc ins1 14K Aug 21 12:03 virtuoso.log

-rw-r--r-- 1 duc ins1 0 Aug 1 01:34 virtuoso.pxa

-rw-r--r-- 1 duc ins1 14M Aug 21 02:54 virtuoso.tdb

-rw-r--r-- 1 duc ins1 0 Aug 21 12:03 virtuoso.trx

V7Cluster

The database size is computed by summarizing the size of each database directory in eachnode.Forexample,databaseofscale100ismeasured:

du -s -h /scratch/duc/lod2/cg100/*/cl*/

Database size at the node bricks03

88G /scratch/duc/lod2/cg100/01/cl1/























Owlim

Thedatabasesizeiscomputedbymeasuringthesizeofthecreatedrepository.Forexample,databasesizeofscale1ismeasured:

[duc@bricks13 data]$ du -s -h openrdf-sesame/repositories/olgeo1

24G openrdf-sesame/repositories/olgeo1

D5.1.4–v1.0

Page48

5.4.2 BulkLoadScript

Virtuoso6andVirtuoso7

The bulk loading script for Virtuoso is applied on an empty database. First, theregister_load_files.sqlisruntoregisterthelistoffilestoload.Then,theloadingprocessisrunbyusingthescriptrdfload.sh.ForVirtuoso7,14“rdf_loader_run()”wereexecuted.

isql 1113 dba dba < register_load_files.sql

./rdfload.sh

[duc@bricks13 10gindex]$ cat register_load_files.sql

ld_dir ('/scratch/duc/lod2/GeoBench/datasets/10geoindex/', '%.gz', 'http://GeoBench.org');

[duc@bricks13 10gindex]$ cat rdfload.sh

echo "Start loading "

date

isql 1113 dba dba exec="rdf_loader_run();" &














wait

isql 1113 dba dba exec="checkpoint;"

echo "end loading"

date

V7Cluster

The dataset files are equally divided into each machines. In each machines, theregister_load_files_GEO.sqlisusedforregisteringthelistoffiletoloadinthatmachine.

ssh bricks03 "isql 1113 dba dba < /scratch/duc/lod2/cg100/register_load_files_GEO.sql"








[duc@bricks05 /]$ cat /scratch/duc/lod2/cg100/register_load_files_GEO.sql

ld_dir ('/scratch/duc/lod2/cg100/datasetg100_5', '%.gz', 'http://linkedgeodata.org');

ThentheMasternodestartstheloadingprocessinallthenodes.

D5.1.4–v1.0

Page49

cl_exec (' rdf_ld_srv ()' ) &

cl_exec (' rdf_ld_srv ()' ) &

Owlim

Thebulkloadingisdonebycallingexample.shinthegetting‐startedapplication.cd /scratch/duc/lod2/GeoBench/owlim/owlim-se-5.3.6156/getting-started

./example.sh

Documents

LOD2 GeoBench v2.0 Evaluation - AKSWsvn.aksw.org/lod2/D5.1.4/public.pdf · Deliverable 5.1.4 LOD2 GeoBench v2.0 Evaluation Dissemination Level Public Due Date of Deliverable Month