Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
CollaborativeProject
LOD2–CreatingKnowledgeoutofInterlinkedData
Deliverable5.1.4
LOD2GeoBenchv2.0Evaluation
DisseminationLevel Public
DueDateofDeliverable Month36,31/08/2013
ActualSubmissionDate Month36,31/08/2013
WorkPackage WP5‐ LinkedDataBrowsing,VisualizationandAuthoringInterfaces
Task T5.1
Type Report
ApprovalStatus Approved
Version 1.0
NumberofPages 49
Filename LOD2_D5_1_4_GEO_Benchmark_Evaluation.pdf
Abstract:ThisreportdescribestheevaluationoftheLOD2GeoBenchmark,developedtoensurethatRDFstorageenginesprovidetheproperleveloffunctionalityandperformancetofacilitatetheneedsofLinkedDataBrowsing,VisualizationandAuthoringInterfaces.
Theinformationinthisdocumentreflectsonlytheauthor’sviewsandtheEuropeanCommunityisnotliableforanyusethatmaybe made of the information contained therein. The information in this document is provided “as is” without guarantee orwarrantyofanykind,expressorimplied,includingbutnotlimitedtothefitnessoftheinformationforaparticularpurpose.Theuserthereofusestheinformationathis/hersoleriskandliability.
Projectco‐fundedbytheEuropeanCommissionwithintheSeventhFrameworkProgramme
(2007–2013)
ProjectNumber:257943 StartDateofProject: 01/09/2010 Duration:48months
D5.1.4–v1.0
Page2
HistoryVersion Date Reason Revisedby
0.2 07/08/2013 InitialDraft(incomplete) PeterBoncz
0.9 25/08/2012 CompleteDraft PeterBoncz
1.0 29/08/2013 Completeversionaftercomments DucMinhPham
1.1 30/08/2013 Minoredits,correctionoftypos PeterBoncz
AuthorListOrganisation Name ContactInformation
CWI PeterBoncz [email protected]
D5.1.4–v1.0
Page3
ExecutiveSummary
ThisreportgivesanaccountofevaluatingtheLODGeoBench‐aspreviouslydevelopedinD5.1.2‐onanumberofdifferentRDFdatabasesystems.ThisLOD2GeoBenchaimstotestthefunctionalityandperformanceofRDFstoresusedinLinkedDataBrowsing,VisualizationandAuthoringInterfaces.
Thisbenchmarkisnotintendedasapurelyscientificdeliverable,itisratherfocusedinaddressingpractical challenges in the Geo Browsing components, as developed by University Leipzig(browser.linkedgeodata.org). In particular, it highlights performance problems encountered whenlaying out linked objects on amap,whichmay have highly different zoom levels. The performancechallengeismakingsurethatperformancealwaysremainsinteractive,irrespectiveofthezoomlevelorfacetselections.
Thisreportcoincideswiththeopen‐sourcereleaseofv2.0oftheLOD2GeoBench.Theevaluationpresentedheregoesbeyond theoneat the initial specification inD5.1.2whichwas runon justonesystem(analphapre‐releaseversionofVirtuoso7).Hereweaddbenchmarkingonmultiplesystems,onlargedatasizes(scalefactor100)andusingclusterhardware,insteadofjustasinglemachine.
The overall message coming out of these experiments is that to create high‐performance(interactive) geospatial faceted browing interfaces, specific pre‐computation and indexing effort isneeded(thisisembodiedbythe“quad”implementation).Thismeansthatontheonehand,applicationdesigners need to think of their data access strategy. On the other hand, more hooks for physicaltuningareneededinRDFdatabasesystemstomakethispossible.
Tool Purpose Address
SPARQLendpoint ExecuteSPARQLqueries http://lod.openlinksw.com/sparql
WebServiceAPI RESTInterface http://lod.openlinksw.com/fct/service
FacetBrowser TextSearchandlookups http://lod.openliksw.com/fct
D5.1.4–v1.0
Page4
AbbreviationsandAcronymsAcronym ExplanationLOD LinkedOpenDataGeoJSON GeographicJavaScriptObjectNotationGFM GeneralFeatureModel(asdefinedinISO19109)GML GeographyMarkupLanguageKML KeyholeMarkupLanguageOWL WebOntologyLanguageRCC RegionConnectionCalculusRDF ResourceDescriptionFrameworkRDFS RDFSchemaRIF RuleInterchangeFormatSPARQL SPARQLProtocolandRDFQueryLanguageWKT WellKnownText(asdefinedbySimpleFeaturesorISO19125)W3C WorldWideWebConsortium(http://www.w3.org/)XML eXtendedMarkupLanguageOGC OpenGeospatialConsortiumLGD LinkedGeodataBrowser(http://browser.linkedgeodata.org)OSM OpenStreetMapLGB LOD2GeoBench(definedinthisdocument)
D5.1.4–v1.0
Page5
TableofContents1. Introduction................................................................................................................................................61.1 Outline..................................................................................................................................................6
2. Benchmark...................................................................................................................................................72.1 Goals......................................................................................................................................................72.2 Dataset..................................................................................................................................................72.2.2 QueryWorkload............................................................................................................................................82.2.3 BenchmarkMetrics..................................................................................................................................112.2.4 BenchmarkPrograms..............................................................................................................................12
2.3 BenchmarkImplementations....................................................................................................122.3.1 BasicImplementation..............................................................................................................................122.3.2 RTreeandRTree++Implementations..............................................................................................142.3.3 QuadImplementaton...............................................................................................................................14
3. Evaluation..................................................................................................................................................203.1 HardwarePlatform........................................................................................................................203.2 RDFDatabaseSystemsTested...................................................................................................203.3 LoadingResults...............................................................................................................................213.4 OverallBenchmarkResults.........................................................................................................223.5 DetailedQueryPerformanceResults......................................................................................253.6 FullQueryPerformanceResults...............................................................................................31
4. Conclusions................................................................................................................................................37
5. Appendix:ConfigurationDetails........................................................................................................39
D5.1.4–v1.0
Page6
1. IntroductionGeographic informationmanagement is a generallywell‐understood task in datamanagement.
Relational database systems technologically support geographical data, sometimes by incorporatingmulti‐dimensional indexing structures like the RTree, or using simple uni‐dimensional BTrees (inconjunctionwith a space‐filling curve). In RDF datamanagement,manyRDF stores support spatialdata management, providing functions to test geospatial predicates; sometimes technologicallysupportedbydatastructuressuchastheRTree.Thesespecificsystemextensionsarebeingreplacedby general adoption of the proposed GeoSPARQL standard proposed by the Open GeospatialConsortium. As such, application development and deploymentwhere the data involves geographyshouldbesupportablewithRDFdatabasesystems.ThisactivityinLOD2takesthattothetest.
InthepastdeliverableD5.1.2,anewdatabaseandapplicationbenchmarkforfacetedgeographicquerying was introduced, called the LOD2 GeoBench (v1.0). The underlying goal for creating thisbenchmarkisfocusonimprovingtheuserexperiencefortheGeospatialBrowserdevelopedbyAKSWinthecontextoftheLOD2project(browser.linkedgeodata.org),bothbyinfluencingthedesignoftheapplicationandmymeasuringandimprovingtherawpowerforgeographicalqueryexecutioninRDFdatabasesystems.
InthisdeliverablewereportonaseriesofexperimentswhenrunningtheLOD2GeoBenchonfourdifferentsystems:OWLIM5.3,OpenlinkVirtuosoV6(opensource),OpenlinkVirtuosoV7(opensource)and Openlink Virtuoso V7 Cluster Edition. The hardware platform usedwas the SCILENS databasecomputeclusteratCWI.Thishand‐builtclusterconsistsofthreedifferentlayersofnodes,ofwhichweused the highest “bricks” layer, built out of 16 large servers (16 cores, 256GB RAM). This sameplatformwasused to create the record‐breaking runswith150billion tripleson theBSBMExploreandBusinessIntelligencebenchmarks(seedeliverableD2.1.4and1).
1.1 OutlineIn Section 2,we describe the LOD2 GeoBench benchmark in its v2.0 version; released in open
source in conjunctionwith this deliverable. The benchmark can currently be implemented by RDFdatabasessystemsinfourdifferentways(basic,rtree,rtree++andquad),whichwedescribeindetail.
InSection3,weprovideanddiscusstheresultswhenrunningthebenchmarkatscalefactors1,10and100ontheplatformsdescribedabove.Whenusingthe“quad”implementation,whichprovidesimprecise answers, RDF database systems turn out to be capable of sustaining tens of concurrentclient requests simultaneously on a single machine. Considering that real users of the GeospatialBrowserwouldusesignificantthinktimeinbetweenqueries,thismeansthatasinglemachinecouldsupporthundredsofconcurrentusers.Ifpreciseanswersarerequired,theseexperimentsshowthatRDFbasedgeographicalsupport(“rtree++”)provideshighperformanceinqueriesthataremoderatelytostronglyzoomed in;whilequerieson largegeographicalareas(zoomedout)wouldstillhave lowperformance – though it is evident that this problem cannot be eliminated inside RDF databasesystems; only application redesign can overcome it. In all, the experimental results show clearimprovementsoverthesituation18monthsago,andasdocumentedinD5.1.2.
In Section 4 we make some forward looking statements and recommendations both forapplicationdesigningeographicalfacetedbrowsing,aswellonthesideofRDFdatabasetechnology.Inshort,applicationdesignshouldthinkaheadandcreateadditional(indexing)datastructures,inorderto ensure interactive performance at all times. Such physical database design is very common inrelational database systems, but almost completelyundeveloped inRDFdatabase systems.On theirpart,RDFsystemsshouldexposemorefeaturestoenablesuchadditional(indexing)opportunities.
1http://lod2.eu/BlogPost/1584‐big‐data‐rdf‐store‐benchmarking‐experiences.html
D5.1.4–v1.0
Page7
2. Benchmark2.1 Goals
The LOD2 GeoBench is an RDF database/application benchmark for faceted geographicalquerying. In particular, its queries use a combination of geographical selection and grouping andcounting by facets. Such faceted querying in itsmainstream use (outside RDF, e.g. using relationaltechnology) is known to be ahardproblem.Theproblembeing, that grouping and countingby thefacetrequiresa lotofcomputationaleffort if therearemany facet instancesqualifyingtheselection,yetduetotheinfiniteamountofpossibleselectionpredicatesitishardtopreparethesystemforthis.Thus,queriesinvolvingmillionsofinstancesmustreallygroupandcountmillionsoftuples(ortriples)andmakingsuchpartofaninteractivesystemthatshouldrenderaresultscreenwithin0.2secondsisachallenge.Also,facetedbrowsingserversonthewebmaybeusedbymanyclientssimultaneously.Assuch, the database system answering the queries should be capable of providing this interactiveexperiencetomanyusersatthesametime.
The goal of the LOD2 GeoBench result metric (queries per second per $) is to highlight theperformance and architecture problems faced by the Linked Geodata Browser application(browser.linkedgeodata.org),which is being developed atUniversity of Leipzig as part of the LOD2project.Specifically,itisintendedtostimulateboth(i)technicalprogressinRDFdatabasetechnology,improving both the query execution and query optimization support for geographical queries inSPARQLbackends,and(ii)tostimulatethinkingaboutapossibleredesignofRDF‐basedapplicationslike the Linked Geodata Browser. This suggestion for redesignpoints toward an opportunity toredesign physical RDF databases, where for specific access patterns and queries, the applicationarchitectandDBAcoulddecidetopre‐createcertainindexesandmaterializedviews(notethatthisisphrased in relational database terms, in practice this could take the form of additional synthetictriples).
TheLOD2GeoBenchwasdevelopedasdeliverableD5.1.2intheLOD2project,18monthsearlier.Coincidingwith this report,wehavereleasedaversionv2.0of thebenchmark,whosesoftwareanddocumentationisavailableinopensource:
http://svn.aksw.org/lod2/LOD2‐GeoBench
We therefore continue with a re‐cap of the benchmark design and description, including adescriptionofwhathaschangedinv2.0.
2.2 DatasetThedatasetusedbyLOD2GeoBenchistheRDF‐izedOpenStreetMap(OSM)datasetprovidedby
AKSWgroupUniverstyofLeipzig.Thebulkofthisdatasetconsistsof6Mpoints(RelevantNodes)and3.8Mpolygons(RelevantWays).
http://downloads.linkedgeodata.org/releases/2011‐04‐06/
10M dataset statistics ASCIIsize #triples #points #polygonsOntology 1.2MB 8KRelevantNodes 10GB 66M 6MRelevantWays 10GB 65M 60M 3.8MDBpediaInterlinks 14MB 101KGeoNamesInterlinks 60MB 487K
We call this core dataset the SF1 dataset. It contains roughly 10M geographic objects. Theamount of triples (130M) is significantly higher, and the uncompressed size in bytes is 20GB. For
D5.1.4–v1.0
Page8
practicalbenchmarking,weoptforsyntheticscalingofthiscoredataset.NotonlydoesthisnotdependontheavailabilityofadditionalresourcesatAKSW,oronthequestionwhethercreatinglargersubsetsofOSMinRDFmakesense,butitalsomakessurethegeographicalcharacteristicsofthedataremainequalatallbenchmarkscales.Thismakesitbetterpossibletointerpretbenchmarkresultsatdifferentscales.
Thebenchmarkthereforescales thiscoreofrealdatatoanycardinal factorx*SFbycopyingalltriples in all datasets x times, appending the string “_y” (for all y: 0<y<x) to all URIs starting withhttp://linkedgeodata.org/. Thismeanswe getmanymore facets in the Ontology and every facet isduplicated x times in the dataset, belonging to new copies of the instances. This kind of scaling ishighlysimilartotheoneproposedintheDBpediabenchmark,andmimicswhatwouldhappenifmorepropertiesofOpenStreetMapwouldgetincludedinthehttp://linkedgeodata.org/dump.
Thev1.0versionoftheLOD2GeoBenchwouldjustmaketheycopiesofthesamedatainstance,with different subject URIs, replicating the data. The geographic feature (point, polygon, polyline)would just be the same among the copies. This replication strategy backfires in systems that onlycreateRTreegeographicalsearchacceleratorstructuresontheuniquesetofliterals–Virtuosobeingsuchanexample.Thatis,becausethegeographicfeatureswerecopiedandremainedequal,theuniquesetofgeographicliteralswouldnotgrow,andhencethesizeoftheRTreewouldnotgrow.
Thev2.0versionoftheLOD2GeoBench,nowreleased,changesthescalingproceduretoshifteachreplicatedgeographical featurebya tinyrandom(lat,long)delta (encompassinga fewmeters).Thisway, all geographical features areunique, yet the setof such features still is realistic in its size andpositiondistribution.Thiswas themain reason to startwith a “real” coredataset in the firstplace,sinceitisveryhardtocreatesyntheticrandomlygeneratedgeographicaldatathat“makessense”andconformstoreal‐worlddistributions.
Since April 2011, there have been new releases of the core dataset in April and August 2013,whichcontainroughlythesamedata,butactualizedfromOpenStreetMap,splitintherawtripledatafilesbydatafacetcategory(theyusedtobetogether).However,intheLOD2GeoBenchV2.0wehavenot moved to this new core dataset. The rationale has been to keep the v1.0 and v2.0 of LOD2GeoBenchascompatibleaspossible.Having (onlyslightly)more triplesandhaving themactualizedfromOpenStreetMap isof limitedvalue forourpurposeshere. It is,however,possible that a futureversionofthisbenchmarkwillstartusingnewLinkedGeoDatadatasetreleases,ifaloneforthereasonthatthebenchmarkspecificationreliesonthedatareleasebeingonlineanddownloadable.
2.2.1.1 BulkLoadThebenchmarkstartsbycreatinganewdatabase,startingupthedatabaseserver,andloadingthe
fulldatasetintothedatabasesystem(includingpossiblyaddedtriplesinthedatapreparationstep).
ThefulldisclosureofaLOD2GeoBenchresultconsistsof:
1. theelapsedtimeuntilallbulk‐loadinghasfinished.2. thesizeinmegabytesoftheresultingdatabasefilesondisk3. allrelevantDBMSconfigurationfiles.4. scriptscontainingallcommandsusedforbulk‐loading.
2.2.2 QueryWorkloadTheLOD2GeoBenchworkloadmimicsabrowsinguser inaqueryrun.Aqueryrun,basedona
randomseed,deterministicallypicks10centerpoints,andexecutes12steps,eachstepconsistingoftwo queries: the Facet Count Query (FCQ) and an Instance Retrieval Query (IRQ) or an InstanceAggregationQueries (IAQ).Thus theworkload in total consistsof240queries.The sequenceof12stepsisasfollows:
1. displaymapatzoomlevel0atacenterpoint (FCQ1+IAQ1)
D5.1.4–v1.0
Page9
2. zoomtolevel1atthesamecenterpoint (FCQ2+IAQ2)3. zoomtolevel2atthesamecenterpoint (FCQ3+IAQ3)4. zoomtolevel3atthesamecenterpoint (FCQ4+IAQ4)5. zoomtolevel4atthesamecenterpoint (FCQ5+IAQ5)6. pan1/8widtheastatzoomlevel4 (FCQ6+IAQ6)7. zoomtolevel5atthesamecenter (FCQ7+IRQ1)8. pan1/4heightnorthatzoomlevel5 (FCQ8+IRQ2)9. zoomtolevel6 (FCQ9+IRQ3)10. pan1/2widthwestatzoomlevel6 (FCQ10+IRQ4)11. zoomtolevel7 (FCQ11+IRQ5)12. panoneheightsouthatzoomlevel7 (FCQ12+IRQ6)
The power query workload executes a query run directly after data load. It is immediatelyfollowed by the throughput workload. In the power workload, the queries in the query run areexecutedpurely after eachother. In the throughputworkload,multiplequery runs (generatedwithdifferent parameters), run concurrently on the system. The typical concurrency levels to test are2,4,8,16.
2.2.2.1 FacetCountQuery(FCQ)TheLinkedGeodataBrowserdisplaysanoverviewwiththecountperfacetoftheobjectsinthe
visiblewindow.This is anaggregationquery that countsalloccurrences foreach facet in thequerywindow, be it a currently selected (active) facet or not. The query parameters here are the querycenterpoint(LATITUDE,LONGITUDE)andthewindowHEIGHTandWIDTHindegrees.
2.2.2.2 InstanceRetrievalQuery(IRQ)TheMapdisplayedbytheLinkedGeodataBrowsershowsmarkersforallinstancesoftheselected
facets. Torenderascreen, thebenchmarkwillalwaysselect4 facets.This isapureselectionquery(rectangulargeographicwindowandfacets),thereisnogroupingoraggregationinvolved.Inadditionto the parameters LATITUDE,LONGITUDE,HEIGHT andWIDTH, this queryhence also receives fourURIparametersFACET1,FACET2,FACET3,FACET4identifyingthefacetsofinterest.
D5.1.4–v1.0
Page10
Figure 1: The Linked Geodata Browser mis‐handling situations with too many results: queries get
disabled(infowindows)andcertainpartofthescreenexhibitinformationoverflow.
2.2.2.3 InstanceAggregationQuery(IAQ)Atthelowerzoomlevels,whenaverylargeareafitsinthewindow,thesheeramountofresults
cancauseperformanceandusabilityproblems.Forinstance,tryimaginingtovisualizeallstreetlightsin all of Germany as markers on a map on a computer screen. This would mean that millions oflampposticonsneedtobeplacedonthescreen,whichdoesnotevenhaveenoughpixelsforthat.Theresulting drawing is bound to be judged as convoluted by average users. Further, even to arrive atsuchadrawnmapisaperformancechallenge,sincethequeryreturnsmanyresults,whichneedtobeprocessed (and, dependingon the architectureof the application,might alsoneed to be sent to theclient,e.g.awebbrowser).
Theinstanceaggregationquerydealswiththeproblemoftoomanyinstancesbysummarizingtheinstancesgeographically.ThisqueryisusedintheLOD2GeoBenchinsteadoftheInstanceQueryonthe first four zoom levels (the first six steps). For this purposes, it divides the map into 40x20conceptualsquaretiles,andjustallowsonemarkerperactivefacetinsideonetile.Itdoescounthowmany instances fall in a tile, and it displays the most relevant marker in a tile for display (in thebenchmark,wedonot really choose themost relevantmarker, but choose theonewith the largestsubjectURI–i.e.arandomone)andacountofoccurrences.
NotethattheInstanceAggregationQuerydeliverssomethingthatcouldalternativelybeshownasa“heatmap”,ratherthansummarymarkerswithanoccurrencecountinsidethem,assuggested.
2.2.2.4 QueryParametersCenterPoint. The randomly generated queries in the LOD2GeoBenchworkload use bounding
boxescenterednear(notexactlyinthecenter–arandomdistanceoff)arandomlychosenmajorcityinEurope.
D5.1.4–v1.0
Page11
Thecitieswechosefromare{Paris,Essen,Madrid,Milan,Barcelona,Berlin,Athens,Birmingham,Rome,Düsseldorf, Cologne, Katowice,Hamburg,Naples,Warsaw, Frankfurt,Munich, Brussels, Lisbon,Vienna, Manchester, Budapest, Amsterdam, Leeds, Stuttgart, Liverpool, Stockholm, Bucharest ,Rotterdam, Copenhagen, Prague, Lyon, Zürich, Turin,Newcastle, Sheffield, Southampton,Nottingham,Marseille,Dublin}.ThesecenterpointswerechosenbecauseOSMprovidesaveryhighlevelofdetailfortheseareassuchthatevenatzoomlevel7therewillbealotofdataperselectionwindow.
WidthandHeight.ThezoomlevelZatscalefactorSFcorrespondstoalongitudewidthof9/2Zdegreesanda latitudeheightof4.5/2Zdegrees.Notethatthe lowestzoomlevel=0selects9degreeslongitude and 4.5 degrees latitude, which roughly corresponds with an area like Germany minusBavaria.Atzoomlevel7,thewindowisdownto0.07by0.03degrees,asmalldowntownarea.
Facets.Themapdrawingqueryonlyvisualizesgeographicinstancesfor4randomlychosenfacetsfromfourrestrictedsets(onefacethttp://linkedgeodata.org/ontology/FACETfromeach):
1. Place,Parking,Village(1M)2. School,PlaceOfWorship,Leisure(700K)3. Peak,Restaurant,Tourism(360K)4. Sport,PostBox,Supermarket(200K)
These facet categories were chosen by analyzing the frequency of the various facets in OSM.Concretely,theabovefacetsarechosnfromthefacetsthathavethehighestfrequencyofoccurrence.Thesewerechoseninorder(i)tomakethequerieswhenzoomedoutchallengingastheywillselectmanyinstancesand(ii)toguaranteethatatthehighestzoomlevelstillanonzeroamountofinstancesareinthewindow.
Further,fromthesetofveryfrequentfacets(whichislargerthantheabove),weselectedgroupsoffacetsthathavequitesimilarfrequenciesandputthemintheabovefourgroups.Thatis,thereareroughly1millionplaces,parkingsandvillages,and200.000sport,postboxandsupermarketfeatures.Each query in the LOD2 GeoBench workload picks one from each category, e.g. (Parking, School,Tourism,Sport).Thatway,thequeriesalwayshaveahighlyasimilarfrequencycharacteristic.Thisinturnhelpstocreatemorestableperformancerunsamongtheresultsofrunningthesamequerywithdifferent parameter bindings (this is something that e.g. BSBMdoes not do,making it very hard tounderstandhowgoodorbadasystembehavesonacertainquery–asthismayvaryenormouslyonthechosenparameter).
At scale factorx*SFwith (x>1), these facets are suffixedwitha random“_y”,withy:0<=y<x.RecallthattheLOD2GeoBenchwhenscalingthedatasettoalargersize,notonlycreatescopiesofallgeographic featureswith a different subject URI, but also uses different property URIs, i.e. suffixedwith_y.Asmentioned,thefacetsusedarerelativelyfrequentfacets;theirfrequencyinthecoredatasetisindicatedinparenthesis.Atzoomlevel0weexpectroughly70Kinstancesintotalbelongingtoanyofthefourselected;theexpectedamountdecreasesateachzoomlevel,tojustahundredatzoomlevel 7. Note that aswe are focusing on high‐density areas (European city centers), the amount ofinstancesina4xsmallersub‐window(zoom‐in)isinfactlessthan4xsmaller.
2.2.3 BenchmarkMetrics
2.2.3.1 PagePerSecondThebasicresultmetricisPagePerSec,i.e.theaveragetimetorenderapageoftheLinkedGeoData
Browser,which is thesumof the facetcountqueryandthe instance(aggregation)query;butthis isreportedintheinverse,hencePagePerSec.Fromabenchmarkrun,thatexecuteseachstep10times,we derive an overall PagePerSec score at that step by averaging the 10 results (query latency inseconds). For multi‐stream runs, we add the PagePerSec metric results for each stream to get acombinedPagePerSecresult.
D5.1.4–v1.0
Page12
2.2.3.2 PagePerSecondPer$1000(PagePerSec/K$)To take into account the cost of the hardware used in various implementations,we divide the
PagePerSecmetric by themonetary cost of the hardware and softwareused: PagePerSec/K$. If theRDFsystemisacommercialsoftwareproduct,thepriceforsoftwaremustbethedollar(listprice,nodiscounts). The price quoted for hardware must be the publicly available end user price of thehardwareatanonlinemerchantatthedatethebenchmarkwasrun.
2.2.3.3 LowZoom,HighZoomandTotalScoreWeexpectdatabasesystemstoperformquitedifferentlyat lowzoomlevelswhencomparedto
highzoomlevels.Forthisreason,twodifferentsub‐metricsarereported,wheretheLowZoomScoreisderivedfromstep1‐step6;andtheHighZoomScorederivedfromstep6‐step12.WeusethegeometricmeanasthemethodtocombinethePagePerSecscoresfromthevarioussteps,becausethisrewardsrelative improvements at any step equally in the overall score, even if the individual scores at thevariousstepsarequitediverse.Similarly, theLOD2GeoBenchTotalScore(LGB‐TS)isthegeometricmeanoftheLowZoomScore(LGB‐LS)andHighZoomScore(LGB‐HS).
2.2.4 BenchmarkProgramsDataGenerator.Thebenchmarkcomeswithadatagenerator(geoscale.sh)thatreadsoneinput
file and produces x output files 0<=y<x with _y suffixes in the URIs. It should be used on all coredatasetfiles.ThesefilescanthenbeimportedintheRDFdatabasesystem.Generatingthecopiesofthecoredatasetfileshouldnotbeincludedindatabaseloadtime.
QueryGenerator. Thebenchmark comeswith aquery generator (geoqgen.c), that given a runnumberandascalefactor(SF)generates240textualqueries.Therunnumberis:
0forawarmuprun.Itgeneratesonesubdirectory01/withonestreamof240queries. 1for thepowerrun,whichtestshowthesystembehaveswhen ithandlesoneuserata
time.Itgeneratesonesubdirectory01/withonestreamof240queries. 2,4,8,16 for the throughput runs, which test how the system behaves when it handles
multiple user at a time. It generates multiple subdirectories 01/, .. xx/, each with onestreamof240queries.
Thequeriesgeneratedaredifferentbetweenthe10runsinsideastream,andbetweenmultiplestreams.Therefore,theselectivities(resultsetsizes)ofdifferentqueriesfromthesametemplate,alsodiffer.Thecurrentlyselectedrandomseednumbersusedtogeneratethequerieshavebeenselectedsuchthattheoverallsizeofintermediateresultsissimilar,though(within10%ofeachotherintermsofsumofresultsizesinaquerystream).
2.3 BenchmarkImplementationsThere are different ways an application can be designed (specifically, extra “indexing” triples
couldbepre‐generated),anddifferentwaysinwhichqueriestoRDFsystemcouldbeformulated.
TheLOD2GeoBenchv2.0currentlysupportsfourdifferentimplementationbasic:,rtree,rtree++andquadtiles.
2.3.1 BasicImplementationEachstepintheLOD2GeoBenchworkloadconsistsoftwoqueries.TheFacetCountQuerycounts
theamountoffacetinstancesintherectangularquerywindow.Thebasicstrategyisnottoassumeanygeographical support in theRDF backend and perform the selection on the (lat,long) values,whichleadstotehfollowingSPARQ1.1text:
select ?f as ?facet count(?s) as ?cnt
where { ?s <http://www.w3.org/2003/01/geo/wgs84_pos#lat> ?a;
D5.1.4–v1.0
Page13
<http://www.w3.org/2003/01/geo/wgs84_pos#long> ?o;
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?f.
filter (?a >= LATITUDE-HEIGHT/2 && ?a <= LATITUDE+HEIGHT/2 &&
?o >= LONGITUDE-WIDTH/2 && ?o <= LONGITUDE+WIDTH/2) }
group by ?f
order by desc(?cnt)
limit 50
Typically,RDFstoreswillevaluatethisquerybyusingrangescansonthePOSorOPS index forrespectivelythelatitudeandlongitudepredicate,andintersecttheresultingtriplestreamsfromtheseonsubject.Thismeansthat if (say) theselectivityof thequery is1/10of the full latituderangeand1/10ofthefulllongituderange,and(say)hence1/100ofthetotaldatabase,theintermediateresultbefore the intersection is in the range of 1/10of the dataset.Hence, it it is 10x larger than strictlynecessary.Still,thisapproachissimpleandportable(itwillworkonanySPARQL1.1backend).
ThesecondqueryineachstepistheInstanceRetrievalQuery,ortheInstanceAggregationQuery.We start wth the Instance Retrieval Query. This query retrieves all the facet instances inside (oroverlappingwith)thequerywindow,forfourchosenfacets.
The Map displayed by the Linked Geodata Browser shows markers for all instances of theselected facets. To render a screen, the benchmark will always select 4 facets so there are fourdifferentFACETparameters,FACET1,FACET2,FACET3,FACET4:
sparql select ?s as ?instance ?f as ?facet ?a as ?lat ?o as ?lon
where
{ #where-start
{ #union-start
?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?f
filter (?f = <http://linkedgeodata.org/ontology/Village>)
} #union-end
union
{ #union-start
?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?f
filter (?f = <http://linkedgeodata.org/ontology/Leisure>)
} #union-end
union
{ #union-start
?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
?f filter (?f = <http://linkedgeodata.org/ontology/Tourism>)
} #union-end
union
{ #union-start
?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
?f filter (?f = <http://linkedgeodata.org/ontology/Supermarket>)
} #union-end
.
?s <http://www.w3.org/2003/01/geo/wgs84_pos#lat> ?a ;
<http://www.w3.org/2003/01/geo/wgs84_pos#long> ?o .
filter (?a >= LATITUDE-HEIGHT/2 && ?a <= LATITUDE+HEIGHT/2 &&
?o >= LONGITUDE-WIDTH/2 && ?o <= LONGITUDE+WIDTH/2)
Arguably, the selection on any of the four facets could also be done in a filter – however it isbelieved that thecurrent syntaxand theonewithdisjunctiveexpressionswouldusually lead to thesame physical query plan anyway. It should be noted that if desired, such an alternative, yetequivalent,querysyntaxwouldbepermissibleinaLOD2GeoBenchresult.
D5.1.4–v1.0
Page14
2.3.2 RTreeandRTree++ImplementationsIf an RDF database system supports efficient evaluation of geographical predicates (e.g. by
creating an RTree index in advance), such is very relevant for the LOD2 GeoBench. We allowreasonable query variants, for instance if the RDF database system being tested has specificgeographicsupport,thiscanbeused.
For instance, Virtuoso v6 provides RTree based indexing allowing to test spatial intersectionwithin a radius. It is possible to drawa circle around the querywindowanduse the radius of thiscircule and the center point of the window in this syntax. This was the first RTree syntax variantimplementedbyLOD2GeoBench(inv1.0)andthereforecarriesthename“rtree”:
select ?f as ?facet count(?s) as ?cnt
where { ?s <http://www.w3.org/2003/01/geo/wgs84_pos#lat> ?a;
<http://www.w3.org/2003/01/geo/wgs84_pos#long> ?o;
<http://www.w3.org/2003/01/geo/wgs84_pos#geometry> ?p;
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?f.
filter (bif:st_intersects(bif:st_geomfromtext('POINT(LONGITUDE LATITUDE)',2000), ?p, RADIUS) &&
?a >= LATITUDE-HEIGHT/2 && ?a <= LATITUDE+HEIGHT/2 &&
?o >= LONGITUDE-WIDTH/2 && ?o <= LONGITUDE+WIDTH/2) }
group by ?f
order by desc(?cnt)
limit 50
WenowalsoallowqueryvariantsthatexploitthegeographiccapabilitiesofotherRDFdatabasesystems.Forinstance,bothOWLIM5.3andVirtuosov7supportsthepredicate:
bif:st_intersects(bif:st_geomfromtext("BOX(lat1 lon1, lat2 lon2)"), ?p)
This allows direct translation of the LOD2 GeoBench window queries into a geographicalpredicate.NotethatthepreviousqueryforVirtuosov6wouldcombineaquerywitharadius(circlequery)witha subsequent (lat lon) filter. InLOD2GeoBench, thisdirectBOXcomparison, supportedfromv2.0on,isdenoted“rtree++”.
WeomitdetaileddescriptionsoftheInstanceQueryandInstanceAggregationqueriesforthetreeandrtree++variants,asthesearenaturalvariantsofthebasicqueries,withasonlydifferenceusingtheappropriategeographicalfilters,mentionedabove.
2.3.3 QuadImplementatonAnapplicationliketheLinkedGeodataBrowserwithstronginteractivitydemandschallengesnot
only RDF database technology, but also the application design itself. Taking the analogy to GoogleMaps,onecanbeassuredthatratherthanqueryingfromasingledatacollectionforallzoomsettings,theresultscreensarerenderedfroma(pre‐generated)separatedatasetforeachdifferentzoomlevel.EventhoughGoogleMapslikelydoesnotrelyonrelationaldatabasetechnology,thisapproachwouldbelikehavingdifferenttablesstorethegeographicaldataofthevariouszoomlevels.Theadvantageisthat these tables canbe designed such thatwhen the zoomwindow is very large (low zoom level),irrelevant data that would be too big to show would be pruned, or frequency counts could besummarized (e.g. keep the amount of lampposts in Germany for each zipcode, rather than allindividuallampposts).Thisway,theselowerzoomlevelshavetooperateonmuchlessdata,allowingtheapplicationtoexhibitinteractiveperformancealways.
The quad approach, described here, formally is not a valid implementation of the LOD2GeoBench,as itwillprovideslightly incorrectqueryanswers,buthas thepotential toachievemuchbetterperformance,withonlyminorqualityreductioninthequeryanswersprovided.ItsperformancecanbemeasuredwiththeLOD2GeoBench.
Themainideaistocreateadditionalindexingtriplesthat(i)accelerategeospatialdataaccessat multiple zoom resolutions, even on systems that do not provide specific geospatial support (ii)
D5.1.4–v1.0
Page15
precomputes certain subquery results in order to accelerate query results, for all three types ofqueries(facetcount,instance, andinstanceaggregation).
QuadTiles. The geospatial acceleration comes from partitioning the 2D space according toQuadTiles,which isaZ‐orderingof the (LONGITUDE,LATITUDE)space into32‐bitsnumbers,whereLONGITUDEandLATITUDEgetdiscretizedfromtheirnormaldoubleprecisionranges[‐180,180]resp.[‐90,90] to the short integer [0,65536]. The below pictures from theOpenStreetMapwiki illustratethis:
Hence,asinglenumberidentifiesarectangleinthetwo‐dimensionalspace.Infact,wecancreatesuchnumbersatanyevenbitgranularity:theleftmostimageshowsthefourrectanglesidentifiedby2‐bitsquadtilenumbers:A=00,B=01,C=10,D=11.
QuadTileannotationscanbeexploitedbyaddingextraRDFtriplesthatannotateasubjectthathas a geography with those rectangles it overlaps with (one QuadTile triple for each). Each suchannotation for a geographical subject would add one triple with a property, e.g.http://linkedgeodata.org/intersects/quadtile and a value which would be the integer QuadTilenumber.Itistoberemarkedthatthisworksfineforpoints,butlargepolygonsmightgetneedmanytriplesiftheirsurfaceislarge.IntheOSMcoredataset,thisdoesnotseemtobeanissue,though.
FacetTiles&TileFacets.Furtherelaboratingonthisidea,wecanalsouse64‐bitsintegers,withthelower32bitsbeingthepreviouslydescribedQuadTilenumberwherethehigher32‐bitsareusedtostoreafacetidentifier.Inourcurrentimplementation,weuse52‐bitsintegersconsistingofa20‐bitfacetnumberanda32‐bitsQuadTilenumber.Thisfacetidentifierisa20‐bitsnumberidentifyingthefacetURIasdenotedbythehttp://www.w3.org/1999/02/22‐rdf‐syntax‐ns#typeproperty.Infact,werestrictourselvestothe1024mostfrequentfacets(morethan10instancesworldwide),forwhich10‐bitsareneeded.Thehigher10‐bitsofthe20‐bitsfacetnumberareusedfordatasetscaling.
Sowehave52‐bitsapproachwitha32‐bitsQuadTilenumberintheminorbitsanda20‐bitsfacetintegerinthemajorbits.WebaptizethesecombinationsofQuadTileandfacetnumbers“FacetTiles”.In caseof an equi‐selectionof FACET such as found in the InstanceQueries, thenumber rangewillhavethemajorbits(Facetpart)thesameintheMinandMaxvaluesofallrangesandonlyvaryinthelowerbits(QuadTiles).Hence,FacetTilessharewithQuadtilesalltheirnicegeospatiallocalityaspects,insuchsituations.
TheFacetCountQuery,however,mustcountthenumberofallfacetsthatfallintoit.ThiswouldleadtonorangerestrictionatallusingFacetTilenumbers.Therefore, forsuchquerieswewouldbemoreinterestedinhavingthefacetnumbersbeingtheminorbitsandtheQuadTilenumbersbeingthemajorbits.Letuscallsuchanumberingscheme"TileFacets".
WecanaddFacetTileannotationstoallgeospatialobjects(foreachRDFsubtypetheyhave).Suchtriplesconsistsofthesubject,ahttp://linkedgeodata.org/intersects/facettileproperty,andthe64‐bitsintegerliteralvalue.ThesetriplescanbecanleveragedtheInstanceQueryoftheLOD2GeoBench.ThisqueryasksforallinstancesoffourselectedFACETsthatfallinacertainquerywindow.
D5.1.4–v1.0
Page16
Itisrelativelyeasytomapageospatialquerywindowintoa(seriesof)rangerestrictionsontheQuadTilenumbers.Thisusuallygivesalimitednumberofconjunctiveranges,butstillitisoftenagoodideatousetheSPARQL1.1subqueryfeatureandencloseinthisSPARQLqueryasubquerythatsimplyhasasinglerangeconsistingoftheMINandMAXvalueofthemultiplerangesweareafter.ThisideatopresentabasicquerywithonlyoneselectionrangeisaworkaroundforweaknessesinSPARQLqueryoptimizers, that would otherwise not recognize the opportunity to use the POS index on thehttp://linkedgeodata.org/intersects/facettileproperty.Similarly,giventhatwequeryforfourFACETs,itmayworkbesttousetheabovequerymodeltoretrievealldataforonefacet,andwriteaquerythatunion‐sfoursuchsub‐queries.
Notethatinprinciple,giventhattheInstanceQueryisusedonthehighzoomlevelsonly,whereresultsetsarenotverylarge,thiswillleadtofourlocalindexlookupsinthePOSindex.ThismayworkbetterthananormalRTreewoulddo,becauseintheRTreeonewouldhaveallinstancesofallfacets,notonlythefourfacetsofinterest.ThismeansthatanRTreeselectionqueryintheleafnodesitvisitswill only find a low percentage of the data to be relevant for the query. Onewould need a kind ofpartitioned RTree (partitioned on facet) to get the same kind of locality as FacetTile. An exampleInstance Retrieval Query is below, shortened by having it only query two facets(http://linkedgeodata.org/ontology/Village, http://linkedgeodata.org/ontology/Supermarket) ratherthanfour:
select ?s as ?instance ?f as ?facet ?a as ?lat ?o as ?lon
where
{ #where-start
{ #union-start
{ #subquery-start
select ?s <http://linkedgeodata.org/ontology/Village> as ?f
where
{ #where-start
filter((?g >= 2554700922112 && ?g <= 2554700922367)
|| (?g >= 2554700922624 && ?g <= 2554700922879)
|| (?g >= 2554700944640 && ?g <= 2554700944895)
|| (?g >= 2554700945152 && ?g <= 2554700945407)
|| (?g >= 2554700965888 && ?g <= 2554700967167)
|| (?g >= 2554700967424 && ?g <= 2554700967679)
|| (?g >= 2554700988416 && ?g <= 2554700989695)
|| (?g >= 2554700989952 && ?g <= 2554700990207)) .
{ #subquery-start
select ?s ?g
where
{ #where-start
?s <http://linkedgeodata.org/intersects/facettile> ?g .
filter (?g >= 2554700922112 && ?g <= 2554700990207)
} #where-end
} #subquery-end
} #where-end
} #subquery-end
} #union-end
union
{ #union-start
{ #subquery-start
select ?s <http://linkedgeodata.org/ontology/Supermarket> as ?f
where
{ #where-start
filter((?g >= 2808103992576 && ?g <= 2808103992831)
|| (?g >= 2808103993088 && ?g <= 2808103993343)
|| (?g >= 2808104015104 && ?g <= 2808104015359)
D5.1.4–v1.0
Page17
|| (?g >= 2808104015616 && ?g <= 2808104015871)
|| (?g >= 2808104036352 && ?g <= 2808104037631)
|| (?g >= 2808104037888 && ?g <= 2808104038143)
|| (?g >= 2808104058880 && ?g <= 2808104060159)
|| (?g >= 2808104060416 && ?g <= 2808104060671)) .
{ #subquery-start
select ?s ?g
where
{ #where-start
?s <http://linkedgeodata.org/intersects/facettile> ?g .
filter (?g >= 2808103992576 && ?g <= 2808104060671)
} #where-end
} #subquery-end
} #where-end
} #subquery-end
} #union-end
.
?s <http://www.w3.org/2003/01/geo/wgs84_pos#lat> ?a ;
<http://www.w3.org/2003/01/geo/wgs84_pos#long> ?o .
filter(?a >= 45.6938 && ?a <= 45.8344 && ?o >= 4.77089 && ?o <= 5.05214)
} #where-end
The Facet Count Query, as said, does not have locality on Facet, so it can better exploit theTileFacetnumberingthanaFacetTilenumbering.WecanthushenceaddalsoTileFacetannotationstoall instances they intersect with. This speeds up the query, certainly on systems without built‐ingeospatial support (RTrees) as the geographical predicate can now be translated into a rangerestriction thatwill workwell on a POS index. Furthermore, we could pre‐aggregate the retrievedtuplesonthefacetnumber(lowerbits)beforeevenjoiningthemtoothertriples.
However,especiallyatthelowerzoomlevels,whereareasthesizeofGermanyfallinthevisiblewindow, suchquerieswill have to aggregate hundreds of thousands of triples, even at the smallestSF=1;andlinearlymoreathigherscalefactors.Aggregatingthismuchdata,evenifdeliveredfastbyaPOS index is still heavy CPUwork that can take various seconds at least andwhichwillmake thisquerynon‐interactiveathigherscalefactors.
Thereforewe donot add TileFacet annotations to instances, but usepre‐computation for the FacetCountQuery. Wedothisatvariousresolutions intherangeof12‐26bits,becausethe lowestzoomlevelselects1/402ofthedata(roughly26socorrespondingto6bitsforbothdimensions,so12bits),wherethedeepestzoomlevelis7stepsdeeper,soat26bits.Hence,weproposeTileFacetcountpre‐computationat7granularities:12,14,16,18,20,22and24bits.
Itisnowamatterofdeterminingaproperbitgranularityforevaluatingaquery,dependingonthezoom level. A good heuristic is to use the lowest granularity level atwhich at least one tile is fullyenclosedby thequerywindow(and ifnosuch levelexist,use thehighestbitgranularity);and thentranslatethewindowselectionpredicateinaseriesofrangepredicatesonTileFacets,likebefore.
The extra triples we keep hold the pre‐computed counts at the various resolutions for anyrectangleforeachTileFacetatthatresolution(e.g.16bits).Forallfacetsinstances,wegeneratetwotripleswithasubjectintheformofhttp://linkedgeodata.org/facetcount/0000XXXXXXandas:
property http://linkedgeodata.org/facetcount/tilefacet16, with as value its TileFacet number,with theQuadTile number part truncated to 16 bits in this case. This represents thus a certainrectableinthe2Dspace.
propertyhttp://linkedgeodata.org/facetcount/count,andasvaluethenumberofoccurencesofafacet. Note thatwe only need to generate http://linkedgeodata.org/facetcount triples for facetsthat have a non‐zero count in a certain rectangle. As such, the amount of these pre‐computedtriplesisalwayssignificantlylowerthantheamountofTileFacetannotationsweaddedbefore.
D5.1.4–v1.0
Page18
Property http://linkedgeodata.org/facetcount/facet stores the facet URI (i.e. dhttp://www.w3.org/1999/02/22‐rdf‐syntax‐ns#type value). It could be derived from thetilefacet16number,buthavingthisasatriplesimplifiesapplicationdevelopment.
The Facet Count Query can now be formulated by selecting all tiles at some granularity thatoverlapwiththequerywindow,andsummingupthesepre‐computedcounts.Hereisanexample:
select ?f as ?facet xsd:integer(sum(?c * 0.512)) as ?cnt
where
{ #where-start
?s <http://linkedgeodata.org/facetcount/count> ?c ;
<http://linkedgeodata.org/facetcount/facet> ?f .
filter((?g >= 3471158208888832 && ?g <= 3472257719468032)
|| (?g >= 3473357232144384 && ?g <= 3474456742723584)
|| (?g >= 3659174697238528 && ?g <= 3660549085724672)
|| (?g >= 3660823964680192 && ?g <= 3661098841538560)
|| (?g >= 3661373720494080 && ?g <= 3662748108980224)
|| (?g >= 3663022987935744 && ?g <= 3663297864794112)) .
{ #subquery-start
select ?s ?g
where
{ #where-start
?s <http://linkedgeodata.org/facetcount/tilefacet14> ?g .
filter (?g >= 3471158208888832 && ?g <= 3663297864794112)
} #where-end
} #subquery-end
} #where-end
group by ?f
order by desc(?cnt)
limit 50
Thedownsideofthisapproachisthatthefacetcountsprovidedwillbeanoverestimationofthereal facet counts, since the tiles from which the precomputed counts originate may (will) extendbeyond the visible window. However, users may tolerate such inaccuracies; but especially for thelower counts, itmight be annoying. One could envision a system that,when a userswants the realcountforanon‐frequentfacet,wecouldcomputetheexactvalue(withaseparatequeryexploitingtheFacetTileannotations,asintheprevioussection).
Thecurrentquerygeneratortriestocorrectforoverestimatingbynormalizingtheprecomputedresult to thesizeof thequerybox,bydividingwith theboxused foranswering thequery(which isequalorlarger).Intheaboveexample,thisleadstothe0,512constantinthefirstline,asonlyslightlyoverhalfofthere‐usedprecomputedresultsisinsidethequerybox.
TheproblemoflargequerywindowsatlowzoomlevelsalsooccursintheInstanceAggregationQuery.Recall thatthisquerytacklestheinformationoverloadproblemofwaytoomanymarkersbycombiningmarkers that are near to each other into a singlemarker, and visualizes a count of howmanyinstancesfallunderit.Similartopre‐computingcountspertile,weobservethatthisaggregationper facet per tile can also be pre‐computed. Note that herewe again needmarkers for only a fewfacets,sousingtheFacetTilenumbershereworksbest.SincetheInstanceAggregationQueryisonlyusedatthelowerzoomlevels,wecanjustindexthisatgranularities12,14,16and18bits.Thus,foreachtileatallgranularities(e.g.16bits)inwhichafacetoccursatleastonce,wegenerateanartificialnewsubjecthttp://linkedgeodata.org/facetmap/0000YYYYYYYYinthreetripleswithas:
property http://linkedgeodata.org/facetmap/facettile16 and as value its FacetTile number(identifyingarectangleinwhichtheclusteredmarkerlies).
properties holding the position http://linkedgeodata.org/facetmap/latitude andhttp://linkedgeodata.org/facetmap/longitudeofthemarker.
D5.1.4–v1.0
Page19
property http://linkedgeodata.org/facetmap/count, and as value the number of occurences of afacet into that 16x8 cell inside the tile. Againwe only add such pre‐computed triples if a facetoccursinacertaincell,sotheamountofgeneratedhttp://linkedgeodata.org/facetmap/triplesissignificantlylowerthantheamountofTileFacetannotationsweaddedbefore.
AnimplementationoftheInstanceAggregationQueryexploitingthesepre‐computedtriples,firstchoosesanappropriatebitgranularity for thezoom level.Thenallabove tiles thatoverlapwith thequerywindowarefetched;nextthereal(latitude,longitude)valuesfromtheexamplemarkersinthemarefetchedandfilteredagainwiththequerywindow.Thismapisthenpresented.Becauseweusepre‐aggregated data, just like in case of the Facet Count Query, the problem was having to aggregatehundredsofthousandsofinstances;andthispre‐computationisguaranteedtoavoidthis;asanytilemaximallycontains128points;andweaccessonlyfewtiles.
Anexample InstanceAggregationQuery isbelow, shortenedbyhaving itonlyquery two facets(http://linkedgeodata.org/ontology/Village, http://linkedgeodata.org/ontology/Supermarket) ratherthanfour:
select ?f as ?facet ?latlon ?cnt
where
{ #where-start
{ #subquery-start
select ?f ?x ?y max(concat(xsd:string(?a)," ",xsd:string(?o))) as ?latlon count(*) as ?cnt
where
{ #where-start
{ #subquery-start
select ?f ?a ?o xsd:integer(20*(?a - 43.5141)/4.5) as ?y
xsd:integer(40*(?o - 0.3412)/9) as ?x
where
{ #where-start
{ #union-start
?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?f .
filter (?f = <http://linkedgeodata.org/ontology/Village>)
} #union-end
union
{ #union-start
?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
?f . filter (?f = <http://linkedgeodata.org/ontology/Supermarket>)
} #union-end
.
?s <http://www.w3.org/2003/01/geo/wgs84_pos#lat> ?a ;
<http://www.w3.org/2003/01/geo/wgs84_pos#long> ?o .
filter(?a >= 43.5141 && ?a <= 48.0141 && ?o >= 0.3412 && ?o <= 9.3412)
} #where-end
} #subquery-end
} #where-end
group by ?f ?x ?y
order by ?f ?x ?y
} #subquery-end
} #where-end
SincethequerywindowwillnotperfectlyalignwithQuadTilesboundariesattheresolutionused,andforthemarkercombinationinthepre‐computedtilesweuselesscells(16x8;becauseaquerywillbeanswered frommultiplecells), theclustercombinationwillgivedifferentresults thantheofficialLOD2 GeoBench Instance Aggregation Query, even if we later re‐aggregatemarkers on the desired40x20grid.Fortheuserexperience,theeffectofthisislikelytobeofminorimportance.
D5.1.4–v1.0
Page20
3. Evaluation3.1 HardwarePlatform
InordertoruntheLOD2GeoBenchatscale,weusedaclusterofcomputenodes,inparticular,theSCILENSclusterinstalledatCWI.
SCILENSisanewkindofhardwareclusterthathasbeendesignedfromthegrounduptoservelarge‐scaledatamanagement. Themachines in the SCILENS cluster areorganized in threedifferentlevels,called‘pebbles’,‘rocks’,and‘bricks’.Eachleveldecreasesinamountofnodesbuttheindividualmachinesusedinthelevelincreaseincomputationalanddiskresources(andpricetag).TheSCILENSclusterusescheapconsumerhardware,optimizedtopackasmuchpowerinaslittlespace,makinguseofconsumerhome‐threatermini‐PCcases(‘Shuttlebox’),connectedbyhigh‐performanceInfinibandnetwork.
Due to thenegativeperformance impactofnetwork trafficduringSPARQLqueryprocessingonlarge clusters (where joins tend to be 'communicating' joins where allmachines need to exchangedata),wherenetworkusagevolumeincreasessuper‐linearlywithmorenodes,itisgenerallybetterinRDFstores toworkwith fewernodeswithmore (RAM)resources thanwithmanynodeswith littleresources.Thus,wechoseasourexperimentalplatformthe‘bricks’layerofSCILENS,thatconsistsofsixteen256GBRAMmachines,eachwith16coresrunningat2.4GHz(dualsocketIntelservers,worth$8K).TheclusterrunsFedoraLinux.Theprice‐tagoftheeightmachinesinvolvedintheexperiments,inclusivetheInfinibandnetworkinfrastructureisroughly$100K.
The SCILENS cluster contains much more I/O resources per CPU core than usual in computeclusters.TherelationbetweenCPUpowerandI/OresourcesiscapturedbytheAmdahlnumber.ThisnumberistheamountofI/ObytespercoreCPUcyclethesystemcandeliver.IncaseoftheSCILENSclusterthisnumberiscloseto1.0whereastypicalclustersatsupercomputingfacilitiessuchasLISAatSARA,onlygetto0.2(1byteper5cycles).Wedoconfess,whileallthisI/Opowerisinteresting,intheworkloads presented so farmost data is RAM resident. One reasonwas that the high‐performancemulti‐SSD I/O subsystem of the bricks layer at the time of testing was not yet operational. Thisprovidesgroundforafollow‐upexperimentusingthisfastI/Olayer.Weexpectthistoacceleratetheloadphase,andalsotoallowtoaddressevenlargerdatasetsefficientlyonthesamehardware.
3.2 RDFDatabaseSystemsTestedForourevaluationoftheLOD2GeoBench,weusedfourdifferentsoftwareconfigurations:
OWLIM‐SE v5.3: we used the non‐cluster version of Ontotext’s OWLIM, which efficientlysupportsgeographicalquerying,asitstoresgeographicfeaturesinanRTree.OWLIM5.3withgeographicextensionisproprietarysoftware,butwehavenohardinformationonthecostof
Figure 2: The 'rocks' and 'pebbles' layers of the SCILENS cluster are hand‐built from384 Shuttleboxes,packingCPUandamplediskresourcesinlittlespace.
D5.1.4–v1.0
Page21
OWLIMat the timeofpreparationof thisdocument, soweomitted the scoresper$ for thissystem.
VirtuosoV6open source is still themostwidely used RDF store around (V7 open sourcebinary builds have starteddistributing only sinceAugust 2013). ThisOpenLinkproduct hasspecificsupportforgeographicalpredicates,albeitsomewhatlimited.Asdiscussed,directBOX(rectangularwindow)selectionsonlatitude,longitudearenotpossible,soweusetheRADIUSpre‐filteringapproach.
VirtuosoV7opensource:thismajornewreleasehasbeenstronglyinfluencedbytheLOD2project, wherein CWI advised Openlink on the introduction of numerous architecturalenhancements. Specifically, V7 introduces columnar storage for RDF triples as well asvectorizedexecution;patternedafterCWIresearchdatabasesystemprototypes.VirtuosoV7wasreleasedin2013andgenerallyofferssignificantstoragesavings,reducedmemoryusage,andimprovedcomputationalperformanceoverV6.
VirtuosoV7ClusterEdition:thismajornewreleaseofthe(nonopen‐source)clustereditionhas been documented in D2.1.6 and introduces a new vectorized cluster based executionparadigm that allows to parallelize any (complex) SPARQL query over a cluster of computenodes.As a result, it canhandle complex SPARQLqueries, such as theBusiness Intelligenceworkload of BSBM, but also the LOD2 GeoBenchmuchmore efficiently (or at all) than theclusterversionofV6everdid.Themonetarycostatthetimeofwritingofenterpriseversionfordepartmentserversis$25K.
3.3 LoadingResultsBeforeloading,firstwegeneratedextratriplesforthe“quad”approach2.Thisroughlydoubles
thedatasize.Thesetriplesarethenbulkloadedintothesystems.Incaseofowlim,afterbulkloading,theRTreegeographical indexneedstobecreated.Thetimeneededforthis is includedinthebelowtable(andisalwaysasmallpartoftherealloadtime).
SF10 #2/1 SF1 (224M triples) SF10 (2.24G triples) SF100 (22.4G triples)
METHOD time Size time Size time Size
owlim5.3 5257sec 24GB 102075sec 185GB
virtuoso6 12900sec 23GB
virtuoso7 780sec 12GB 5520sec 108GB
v7cluster 2280sec 156GB 18840sec 1.1TB
Loading Virtuoso6 was done using single loading process, since parallel loading wouldconsistentlyhangthesystem.WementionedalreadythefactthatloadingthescaleddatasizesalsohitanerrormessageintheRTreeloadingcode(abug),eveninsingle‐threadedmode,whichpreventedusfromtestingVirtuosoV6onthelargerdatasizes.
Virtuoso7 used the native parallel loading procedure and was done by running 14 loadingprocessesinparallel.LoadingVirtuosoV7ClusterEditionwasdonebyrunning2processespernode,givingintotal32loadingprocesses(2x2nodes/machinex8machines).
2Alternatively,wecouldhavecreatedseparatedatabasesfortheexperiments,onewiththeextragenerated
triplesforquad,andonewithout(torunbasic,rtreeandrtree++).Thiswasnotdoneformanageabilitypurposes,astheperformanceeffectsaredeemedtobeminor
D5.1.4–v1.0
Page22
3.4 OverallBenchmarkResultsWenowpresenttheoverallbenchmarkscoresoftheexperiments.TheLOD2GeoBenchmainpure
resultmetric(regardlesscost)isPagePerSec.Inordertomeasurethesystemunderload,i.e.withall8coresbusy,wederivethisSCOREfromtheworkloadunder8concurrentquerystreams;thisisknownasthe"throughputmetric"3.Thescoreiscomputedasthegeometricmeanofall12steps.Wealsosplititoutinthegeometricmeanoverthequeriesatlowzoom(zoomlevel0‐4,steps1‐6)andhighzoom(zoomlevel>4,steps7‐12).ThesearecalledLSCOREandHSCORE.
Figure3:LOD2GeoBenchScoresatSF1withoneserver(8core2.4GHz,256GBRAM)–8querystreams
The results at SF1have the approximate implementation “quad” in front, especially inVirtuosoV7,however,theimprovedRTreesupport“rtree++”comesquiteclose.Notably,high‐zoomquadqueriesworkedbetterinV6,sothereseemstobeeitheranoptimizerperformanceregression,oranundesiredeffect of the vectorized columnar execution inV7. Because the low‐zoom (compute‐intensive) quadqueriesaremuchfasteronV7,itsoverallscoreishigher.
AnotherinterestingcomparisonistheimpactoftheLOD2R&Dactivitiesinthepastfewyears,atleastin thisbenchmark, forVirtuosoversus its strongest competitor,OWLIM.WhereasOWLIMgenerallywas equivalent or faster than VirtuosoV6 (compare owlim rtree++ with V6 rtree), the rtree basedscore in Virtuoso improved by a factor 7, creating a significant performance advantage. Besidesimprovements to the RTree functionality, this is very likely caused by the columnar vectorizedexecutionmodelthatVirtuosoadoptedinV7,inspiredbyCWIresearchinthatarea.
Moving to SF10, though throughput drops by a factor 3,we see the relative advantage of the quadapproachimprovedramatically:
Figure4:LOD2GeoBenchScoresatS10withoneserver(8core,2.4GHz,256GBRAM)–8querystreams
3Forthe"Power"metricthattestsallqueriesinisolation,seetables3.6.1;3.6.6andtheleftpartof3.6.11.
0102030405060708090100
basic basic basic quad quad quad rtree++ rtree rtree++
owlim5.3 virtuoso6 virtuoso7 owlim5.3 virtuoso6 virtuoso7 owlim5.3 virtuoso6 virtuoso7
SCORE
LSCORE
HSCORE
051015202530
basic basic quad quad rtree++ rtree++
owlim5.3 virtuoso7 owlim5.3 virtuoso7 owlim5.3 virtuoso7
SCORELSCOREHSCORE
D5.1.4–v1.0
Page23
Figure5:LOD2GeoBenchScoresatSF10with8servers(8core2.4GHz,256GBRAM)–8querystreams
Whereasthepreviousexperimentswereusingasingleserver,nowwemovetoresultsobtainedwith8servers. One strategy that is applicable to any technology in a read‐only workload like this, is toreplicatethedatabaseinmultipleservers,anddividethequeriesamongthem.Weremainusingjust8querystreams,suchthateachservergetsasinglestream.Inordertouseallcores,thesystemsmustnowparallelizetheindividualqueriesinordertomakeuseoftheCPUresources.Thisexplainslackoflinearscale‐up inall systems.Note that replicationstill isaverypowerful technique inheavyread‐onlyworkloads:wecansafelyexpectthatwhenusing8replicatedserverswith64concurrentquerystreams, the resultswill be 8‐fold those in Figure 4 (for example, 8*virtuoso7would then reach athroughputscoreof128insteadofjust12).
We also tested the “true” cluster database system provided by OpenLink, i.e. Virtuoso V7 ClusterEdition.Datahereisnotreplicatedinallservers,butpartitionedamongthematloadtime.Thiscausesall queries to be parallelized. This explains the superior scores (with quad overall being the best)obtainedonthissystem,under light load.Having inmindthetheoreticalpeakusageofat least128,platformutilizationofv7clusterat15,canstillbesignificantlyoptimized.
Overall, the absolute performance of the peak throughput drops from50PagesPerSec at SF1 to 15PagePerSecatSF10.ThiscouldpartlybeexplainedbythelossofdatalocalityatSF10,butitcouldalsoindicate some query optimization problems. Namely, in the Virtuoso V7 quad implementation, theplans donot have a heavy computational load (as this has beenprecomputed) and in principle thecomplexity of all queries should be logarithmic to data size. In this sense, a drop of a factor 3mayindicatethatthequeryoptimizerdoesnotfindtheoptimalplansyet.
Figure6:LOD2GeoBenchscorepercost(“bangforthebuck”)on8servers,SF10,8querystreams
FinallywecomputedthePagePerSec/K$scoreforallbenchmarkedproducts,bothsingle‐serverand8‐node cluster setups, excludingOWLIM5.3 (forwhichwe lack pricing information). It turns out that
0
5
10
15
20
25
basic basic basic quad quad quad rtree++ rtree rtree rtree++
8*owlim5.3 8*virtuoso7 v7cluster/8 8*owlim5.3 8*virtuoso7 v7cluster/8 8*owlim5.3 v7cluster/8 8*virtuoso7 8*virtuoso7
SCORE
LSCORE
HSCORE
0
0,5
1
1,5
2
2,5
basic quad rtree++ rtree basic basic quad quad rtree rtree rtree++ rtree++ rtree++ rtree++
virtuoso7 virtuoso7 virtuoso7 virtuoso7 8*virtuoso7v7cluster/88*virtuoso7v7cluster/8v7cluster/88*virtuoso78*virtuoso78*virtuoso78*virtuoso78*virtuoso7
SCORE/$
D5.1.4–v1.0
Page24
fromamonetarypointofviewV7quadisthebestdeal;whereasforclusteredsetupwehavethe$25Ksoftware(plus$100Khardware)VirtuosoV7ClusterEditionbeatingasetupof8*8Khardwarenodeswiththeopen‐sourceversionreplicated.
Figure7:LOD2GeoBenchScoresatSF100with8servers(8core,2.4GHz,256GBRAM)‐8querystreams
The last overall benchmark results are the scores at SF100. Here, the trends continue, thoughperformanceofthedifferentqueryvariantsisstableandthedifferentresultsarenearertoeachother.
Infutureexperiments,theclusterexperiments(ormaybeallexperiments)shouldalsobeperformedunderahighqueryloadwithmanymorethan8streams,toensurethatallcoresarefullybusyunderpeakload.
00,51
1,52
2,53
3,54
basic quad rtree
v7cluster/8 v7cluster/8 v7cluster/8
SCORE
LSCORE
HSCORE
D5.1.4–v1.0
Page25
3.5 DetailedQueryPerformanceResultsWerantheLOD2GeoBenchatscalefactors(SF)1,10and100(130M,1.3G,13Gtriples);using1,2,4and8concurrentquerystreams.Notallsystemsaretestedusingallparameters:
we tested the non‐clustered systems only on SF1 and SF10; and Virtuoso V7 ClusterEditionwasnottestedatSF1,onlyatSF10andSF100.
aresultofthedifferentdatascalingmethodintheLOD2GeoBenchv2.0(vs.v1.0)isthatVirtuosoV6nolongercanloadthescalefactor10dataset.Ithitsabug,andgiventhatV7isout,thisbugmaynevergetfixed.ThereforeVirtuosoV6isonlytestedonscalefactor1.
Figure8:Page‐per‐secondperformanceforeachofthe12steps(SF1,#8/1)
SF1#8/1:AtSF1onasingleserverwith8concurrentquerystreams(whichshouldkeepthe8coresbusyatleast),theresultsshowthatforVirtuosoV7thehighestperformanceisachievedwiththenew RTree functionality (rtree++); however the performance linearly improves with higher zoomlevel, and isquitepoorat the lower zoom levels.At the lower zoom levels, the approximate “quad”approachismuchbetter.Interestingly,V6achievesbetterperformancethanV7.
ForOWLIM,thequadperformanceisbadduetoitgettingbadqueryplans(duetothedisjunctivefiltersandunions).TheRTreesupportinOWLIMisquitegood(rtree++)usuallyexceedingtheRTreesupportofVirtuosoV6(butnotV7).
0
20
40
60
80
100
120
140
160
180
1 2 3 4 5 6 7 8 9 10 11 12
owlim5.31basic
owlim5.31quad
owlim5.31rtree++
virtuoso61basic
virtuoso61quad
virtuoso61rtree
virtuoso71basic
virtuoso71rtree
virtuoso71rtree++
owlim5.31quad
owlim5.31rtree++
virtuoso71quad
D5.1.4–v1.0
Page26
Figure9:Page‐per‐secondperformanceforeachofthe12steps(SF10,#8/1and#8/8)
SF10#8/1:AtSF10onasingleserverwith8querystreams,theadvantageofthequadapproachinVirtuosoincreasesconsiderably.DifferenttoSF1,atSF10onV7theperformanceexceedsthatofV7rtree++andofV6quadconsiderably,thelatterbecauseofimprovementsmadeinthequeryoptimizerthat favour the InstanceRetrievalQuery. At SF10, the rtree++ performance is no longer very good.ThiscanbeexplainedbythefactthatinsidetheRTreeinstancesbelongingtoallfacetsarestored,notjustthefourfacetsrequiredbytheInstanceRetrievalQuery.Atlargerscalefactor,theincreaseddatasizecausesI/Otostartplayingarole.InthisexperimentatSF10,VirtuosoV6couldnotbetestedduetothedataloadingbug,mentionedearlier.
InthisexperimentwealsoseeVirtuosoV7ClusterEditionresultsoneightidenticalmachines.Itisstriking that the basic variant performs very close to rtree. If we further compare basic betweencluster (8machines) and a singlemachine, scalability for low zoom levels is near linear (factor 8).These queries perform a lot ofwork,which gets parallelized. However, the gains in the high zoomlevels,which access less data, aremore limited. It is an open questionwhy rtree does not providemuchbenefitinaclustersetting.
SF10#8/8:We nowmove to experiments at SF10with 8 query streams on 8 servers. In thefollowingwecomparetheVirtuosoV7ClusterEditionapproachwithsimplereplication.Theformer,followinga“true”clusterapproach,partitionsalldataacrossallservers,henceeachserverstores1/8thofthedata,andqueriesgetspreadoutoverallservers(parallelized).Simplereplication,incontrasts,
0
5
10
15
20
25
30
35
40
45
50
1 2 3 4 5 6 7 8 9 10 11 12
owlim5.310basic
owlim5.310quad
owlim5.310rtree++
v7cluster/810basic
v7cluster/810quad
v7cluster/810rtree
virtuoso710basic
virtuoso710quad
virtuoso710rtree
virtuoso710rtree++
D5.1.4–v1.0
Page27
loads the same data independently in eight different machines, and then executes the 8‐streambenchmarktestbyrunningthesingle‐streamtestindependentlyonall8machines.Assuchtheresultsof thisexperiment are roughly8‐foldhigher than the single‐streamsingle‐server test.Note that thehardwareismorethan8timesexpensive($8Kvs$100K,duetothecostoftheinfinibandswitch,oneinfiniband network card per server and cabling). This added price differencemake the replicationstrategy lessattractive in thePagePerSecond/$scores,presented later.The replicationexperimentsaremarkedwithastarinthelegend.
Figure10:Page‐per‐secondperformanceforeachofthe12stepsinthebenchmark(SF10,#8/8)
Replication: This experiment shows replicated owlim (owim5.3*) to compete on higher zoomlevelswith itsRTreesupport.ThereplicateVirtuosoV7(virtuoso7*)withthequadapproachscoreshigh,thoughisvulnerableintheFacetRetrievalQueryatthelowerzoomlevelswhereitisused(steps7‐9)andwhentherearemanyqueryresults.Theoverallwinner intermsofperformanceisClusterEditionwithquads(thegreendashes)thankstomorereliableperformanceatsteps7‐9,eventhoughitlosesouttoreplicationatsteps10‐12.
SF100#8/8:whenwescalethedatasettofactor100weonlyhaveresultsonVirtuosoV7Clustereditionon8servernodes.Onthelowzoomlevel,thequadapproachracesahead,withbasicandrtree
0
5
10
15
20
25
30
35
40
45
50
0 1 2 3 4 5 6 7 8 9 10 11 12
8*owlim5.310basic
8*owlim5.310quad
8*owlim5.310rtree++
8*virtuoso710basic
8*virtuoso710rtree
8*virtuoso710quad
8*virtuoso710rtree++
v7cluster/810basic
v7cluster/810quad
v7cluster/810rtree
D5.1.4–v1.0
Page28
behaving identically.Athigherzoom levels, all approaches improvegradually,but rteeclearlybeatsbasic,andquadclearlybeatsrtree.
Figure11:Page‐‐per‐secondperformanceforeachofthe12steps(SF100,8/8)
Intheexperimentsuntilnow,weshowtheperformanceperstep,howeverpleaserecallthateachstep is the combination of two queries. For steps 1‐6, it is a Facet Count Query with a InstanceAggregationQuery,andforsteps7‐12itisaFacetCountQueryfollowedbyaInstanceRetrievalQuery.Also, levels 6,8,10,12 just pan (to a partially overlapping area at the zame level),wheres the otherquerieszoomin.Intheabovethisisvisibleinqueries6,8,10,12scoringabovethetwotrendlinesthatonecanconstructinthestep1‐6and7‐12segments.
Analysisof IndivualQueryPerformance. However,what is also interesting is to look at theindividual queries. Each query stream consists of 24 queries, two per step. First, the Facet CountQuery,thentheInstanceAggregationorRetrievalQuery.
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8 9 10 11 12
v7cluster/8100rtree
v7cluster/8100quad
v7cluster/8100basic
D5.1.4–v1.0
Page29
Theabove figuresshowatSF1#8/1on theright theQueries‐Per‐Secondachievedby theFacetCountQuery,andontheleftbytheInstanceAggregationQuery(steps1‐6)andtheInstanceRetrievalQuery(steps7‐12). Ifweexaminethescale,ateachstep, the Instancequeriesarethebottleneck. Infact,onVirtuoso7theFacetCountQuerydoesnotneedthequadapproximation,asrtree++isamongthe best.Herewe confirm that the performance dip at step 7‐9 is causedby the InstanceRetrievalQuery.Thereasonforthisisthelargeamountofinstancesatthesezoomlevels.Assuch,theseresultspoint to the fact that in the benchmark the switch‐over from Instance Aggregation to InstanceRetrievalshouldbetterbemadeatadeeperzoomlevel.
0
5
10
15
20
25
30
35
40
0 1 2 3 4 5 6 7 8 9101112
1owlim5.3basic
1owlim5.3quad
1owlim5.3rtree++
1virtuoso6basic
1virtuoso6rtree 0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 101112
InstanceInstance
Aggregation Retrieval
QueryQuery
FacetCountQuery
D5.1.4–v1.0
Page30
At SF10#8/1 (above) the cost balance between InstanceQueries (left graph) and Facet CountQueries(rightgraph)shift,astheybecomemorecomparable.Still,thebottleneckisinthefirstthreeInstance Retrieval Queries (step 7‐9, left). It is remarkable in these results that for the InstanceRetrievalQueries,owlim5.3doesagood job instep8‐12(left); in factbeating theVirtuoso7rtree++approach.
AtSF100#8/8theFacetCount(rightgraph)andInstanceQueries(leftgraph)areroughlythesamecost.Here, thequadapproachreallywins in theFacetCountQuery (right).The facet InstanceQueries (left) generally have lower performance, especially between query 7‐12 (i.e., the FacetRetrievalQuery).
0
5
10
15
20
25
30
1 2 3 4 5 6 7 8 9 101112
owlim4.3basic
owlim4.3quad
owlim4.3rtree++
virtuoso7basic
virtuoso7rtree
virtuoso7quad
virtuoso7rtree++
v7clusterbasic
v7clusterrtree
v7clusterquad0
5
10
15
20
25
1 2 3 4 5 6 7 8 9 101112
0
1
2
3
4
1 2 3 4 5 6 7 8 9 10 11 12
v7clusterbasic
v7clusterrtree
v7clusterquad
0
1
2
3
4
5
1 2 3 4 5 6 7 8 9101112
FacetCountQuery
FacetCountQuery
D5.1.4–v1.0
Page31
3.6 FullQueryPerformanceResultsIntheseresults,therednumbersarethoseproducedbytheslowest,whereasthegreennumbers
arethefastestruns.
3.6.1 Scalefactor1,1querystream,1serverSF1 #1/1 owlim5.3 owlim5.3 owlim5.3 virtuoso6 virtuoso6 virtuoso6 virtuoso7 virtuoso7 virtuoso7 virtuoso7
METHOD basic quad rtree++ basic quad rtree basic quad rtree rtree++
STEP01 0.0346 0.1176 0.0832 0.0905 4.7641 0.0387 0.1358 4.4662 0.2301 0.6190
STEP02 0.0556 0.1458 0.2586 0.1603 5.3937 0.1134 0.3253 3.7481 0.5640 1.4253
STEP03 0.0780 0.2968 0.7925 0.2619 6.9348 0.2772 1.0380 4.6468 1.8761 3.0703
STEP04 0.1237 0.2347 1.6066 0.3784 7.9176 0.6577 1.2910 5.3821 2.1349 5.0100
STEP05 0.2079 0.9257 2.1772 0.5676 15.3678 1.0667 2.0250 6.4850 2.3651 6.8446
STEP06 0.2028 0.9109 2.2862 0.5628 14.6413 1.0730 2.4520 8.9126 2.5886 6.7294
STEP07 0.3517 0.7462 3.2216 0.8517 1.1614 1.5647 2.6040 2.5953 2.5819 8.8261
STEP08 0.3610 0.5692 3.3288 0.8486 1.4243 1.5382 2.8520 5.0403 2.9967 9.0171
STEP09 0.7034 0.8040 6.2853 1.4900 2.3596 3.5868 4.2123 6.2972 4.4762 17.2414
STEP10 0.6758 0.7642 5.1177 1.4007 2.1092 2.4515 4.4111 7.8003 3.8051 14.0647
STEP11 1.3709 0.9497 9.5602 4.0144 5.0735 6.2383 6.4892 9.1996 6.6711 26.5252
STEP12 1.2764 2.9002 6.6181 2.2123 4.5558 4.1493 6.2422 9.6711 5.2966 18.0832
LSCORE 0.0961 0.3168 0.7176 0.2780 8.1604 0.3119 0.8156 5.1935 1.2128 2.9228
LSCORE/$ 0.0347 1.0200 0.0389 0.1019 0.6491 0.1516 0.3653
HSCORE 0.6876 0.9466 5.2829 1.5409 2.3974 2.8593 4.2104 6.2021 4.0841 14.4740
HSCORE/$ 0.1926 0.2996 0.3574 0.5263 0.7752 0.5105 1.8093
SCORE 0.2571 0.5476 1.9471 0.6545 4.4231 0.9444 1.8532 5.6755 2.2256 6.5044
SCORE/$ 0.0818 0.5528 0.1180 0.2316 0.7094 0.2782 0.8130
3.6.2 Scalefactor1,2querystreams,1serverSF1 #2/1 owlim5.3 owlim5.3 owlim5.3 virtuoso6 virtuoso6 virtuoso6 virtuoso7 virtuoso7 virtuoso7 virtuoso7
METHOD basic quad rtree++ basic quad rtree basic quad rtree rtree++
STEP01 0.0438 0.2046 0.1377 0.1450 8.7857 0.0592 0.2899 9.3350 0.5868 1.0153
STEP02 0.0760 0.3324 0.3922 0.2556 11.8142 0.1581 0.9248 10.0023 1.5272 2.1669
STEP03 0.1010 0.4646 1.0689 0.4358 12.5180 0.4040 2.7962 6.2473 4.2260 4.5340
STEP04 0.1579 0.7876 2.2714 0.6660 18.5405 0.8659 2.7117 9.9469 4.0895 8.4835
STEP05 0.2746 0.8481 4.6420 0.9965 25.4895 1.7024 3.4837 13.1304 4.8692 13.6153
STEP06 0.2704 0.9686 4.4050 0.9913 26.2031 1.7180 3.6731 19.0339 5.8937 13.4819
STEP07 0.4914 0.9496 6.5400 1.6060 2.0462 3.3723 5.5321 6.6725 5.8411 19.9092
STEP08 0.4976 0.9409 6.0975 1.6073 2.3136 3.3690 6.0500 10.2762 6.7234 19.6660
STEP09 1.0367 2.5329 12.4141 2.9654 5.3145 8.2270 9.8168 12.8568 11.5399 39.1348
STEP10 0.8869 2.6699 10.3294 3.0043 4.2300 5.9599 9.5017 16.0707 10.4757 29.2999
STEP11 1.8653 5.0793 20.1726 6.3119 8.6424 13.6262 13.5619 17.2091 17.0043 55.0651
STEP12 1.7027 3.8357 14.5142 5.3378 7.5906 9.8195 13.5811 17.8367 12.7472 38.4264
LSCORE 0.1258 0.5230 1.17876 0.4690 15.8713 0.4610 1.7210 10.6290 2.7614 4.9919
LSCORE/$ 0.0586 1.9839 0.0576 0.2151 1.3286 0.3451 0.6239
HSCORE 0.9454 2.2132 10.6857 3.0293 4.2201 6.4824 9.1109 12.7630 10.0386 31.3102
HSCORE/$ 0.3786 0.5275 0.8103 1.1388 1.5953 1.2548 3.9137
SCORE 0.3449 1.0759 3.5490 1.1920 8.1840 1.7287 3.9598 11.6472 5.2651 12.5020
SCORE/$ 0.1490 1.0230 0.2160 0.4949 1.4559 0.6581 1.5627
D5.1.4–v1.0
Page32
3.6.3 Scalefactor1,4querystreams,1serverSF1 #4/1 owlim5.3 owlim5.3 owlim5.3 virtuoso6 virtuoso6 virtuoso6 virtuoso7 virtuoso7 virtuoso7 virtuoso7
METHOD basic quad rtree++ basic quad rtree basic quad rtree rtree++
STEP01 0.0703 0.3726 0.2693 0.2875 18.7083 0.1195 0.5566 21.3604 1.5452 2.0748
STEP02 0.1121 0.4790 0.8264 0.5028 25.3611 0.3937 1.5829 18.1676 2.8779 5.3924
STEP03 0.1498 1.00556 2.3242 0.7441 31.2361 0.8582 4.6161 13.0845 8.6183 9.5042
STEP04 0.22611 1.4740 4.7571 1.1240 46.2466 1.9203 5.0335 21.9932 8.2786 16.5111
STEP05 0.3634 2.7782 7.6465 1.6884 74.7055 3.2683 6.9151 30.5185 9.1625 23.0400
STEP06 0.3694 2.6684 7.9984 1.6681 74.7607 3.2947 6.6436 45.7031 10.3364 24.2530
STEP07 0.7162 2.7846 12.2841 3.1894 4.4121 5.7132 11.4062 16.1533 10.6061 33.4092
STEP08 0.6863 2.8749 12.0616 2.9065 4.4726 5.7027 12.4466 23.5253 10.956 33.4019
STEP09 1.4399 4.4919 26.8988 6.7532 8.3481 13.7416 19.6487 28.2182 19.1360 65.3226
STEP10 1.3783 5.8125 19.5487 6.6083 8.09174 10.0076 19.2529 36.7007 17.3637 52.3030
STEP11 2.8112 12.0010 43.3150 14.7862 18.6510 27.0136 29.8168 40.1119 26.7478 96.3516
STEP12 2.4259 9.2902 27.8078 13.0194 15.15110 16.2686 30.6850 43.2767 23.4443 60.7765
LSCORE 0.1817 1.1187 2.3055 0.8357 39.2294 0.9703 3.1287 23.1667 5.5719 9.6245
LSCORE/$ 0.1044 4.9036 0.1212 0.3910 2.8958 0.6964 1.2030
HSCORE 1.3712 5.3409 21.2913 6.5543 8.4910 11.1847 19.1156 29.6368 16.9895 52.9802
HSCORE/$ 0.8192 1.0613 1.3980 2.3894 3.7046 2.1236 6.6225
SCORE 0.4992 2.4444 7.0063 2.34052 18.2510 3.2944 7.7335 26.2028 9.7295 22.5812
SCORE/$ 0.2925 2.2813 0.4118 0.9666 3.2753 1.2162 2.8226
3.6.4 Scalefactor1,8querystreams,1serverSF1 #8/1 owlim5.3 owlim5.3 owlim5.3 virtuoso6 virtuoso6 virtuoso6 virtuoso7 virtuoso7 virtuoso7 virtuoso7
METHOD quad basic rtree++ basic quad rtree basic quad rtree rtree++
STEP01 0.5448 0.0563 0.2743 0.4113 36.2784 0.1811 0.9454 40.7706 2.7136 3.4060
STEP02 0.7765 0.0892 0.9029 0.7124 46.3338 0.5223 2.9656 36.3412 7.7649 6.6037
STEP03 1.4639 0.1222 2.3402 1.1828 58.2323 1.3596 8.3323 31.9119 13.4467 13.7075
STEP04 2.5405 0.1910 5.1597 1.8460 90.0222 2.9270 9.1631 47.4474 13.7959 25.0132
STEP05 3.4629 0.3169 8.3595 2.7192 132.4990 5.2606 12.2425 59.0893 13.7280 38.6629
STEP06 4.6291 0.3096 8.8595 2.7852 139.9900 5.3247 13.1893 85.2281 15.7198 38.5826
STEP07 4.5387 0.5878 14.1144 4.9718 7.7641 10.0048 23.0038 33.8289 20.6775 55.1534
STEP08 4.6977 0.6517 15.3833 4.7890 7.9362 10.0792 23.8945 45.0062 21.1386 59.0653
STEP09 7.6525 1.3616 38.3478 10.2151 13.7468 24.4637 40.1600 55.6296 37.3605 105.0690
STEP10 9.7112 1.2118 25.6037 9.2489 13.3397 17.5706 37.1036 72.2938 31.9566 83.1055
STEP11 19.0270 2.2852 66.4700 21.7008 30.4268 45.2330 54.0770 77.3047 51.9836 156.4850
STEP12 13.2889 2.0098 34.6726 18.9416 25.7544 29.1453 54.146 79.2604 45.7914 105.1910
LSCORE 1.7121 0.1503 2.4589 1.3007 73.8155 1.4807 5.7035 47.2966 9.7115 15.0085
LSCORE/$ 0.1625 9.2269 0.1850 0.7129 5.9120 1.2139 1.8760
HSCORE 8.5786 1.1943 27.7408 9.8613 15.3680 19.6025 36.5335 57.7652 32.7411 87.9627
HSCORE/$ 1.2326 1.7960 2.4503 4.5666 7.22066 4.0926 10.9953
SCORE 3.8325 0.4238 8.2590 3.5814 32.5666 5.3876 14.4350 52.2695 17.8317 36.3344
SCORE/$ 0.4476 4.07083 0.6734 1.8043 6.5336 2.2289 4.5418
D5.1.4–v1.0
Page33
3.6.5 Scalefactor1,8querystreams,8replicatedserversSF1 #8/8 owlim5.3 owlim5.3 owlim5.3 virtuoso6 virtuoso6 virtuoso6 virtuoso7 virtuoso7 virtuoso7 virtuoso7
METHOD Basic Quad rtree++ basic rtree Quad basic rtree quad rtree++
STEP01 0.2772 0.9410 0.6660 0.7247 0.3102 38.1134 1.0867 1.8415 35.7302 4.9520
STEP02 0.4450 1.1669 2.0693 1.2831 0.9078 43.1499 2.6029 4.5126 29.9850 11.4025
STEP03 0.6244 2.3751 6.3406 2.0958 2.2179 55.4785 8.3064 15.0094 37.1740 24.5625
STEP04 0.9902 1.8779 12.8534 3.0279 5.2617 63.3413 10.3306 17.0794 35.0570 40.0802
STEP05 1.6636 7.4060 17.4178 4.5413 8.5342 114.942 16.2042 18.9214 51.8806 54.7570
STEP06 1.6230 7.2879 18.2899 4.5029 8.5846 117.1300 19.622 20.7093 71.3013 53.8358
STEP07 2.8137 5.9697 25.7732 6.8143 12.5176 9.2915 20.833 20.6558 20.7630 70.6090
STEP08 2.8886 4.5539 26.6311 6.7888 12.3058 11.3944 22.8180 23.9736 40.3226 72.1370
STEP09 5.6274 6.4365 50.2829 11.9207 28.6944 18.8768 33.6984 35.8102 50.3778 137.9310
STEP10 5.4065 6.1138 40.9417 11.2061 19.6126 16.8741 35.2890 30.4414 62.4025 112.5180
STEP11 10.9679 7.5980 76.4818 32.1156 49.9064 40.5886 51.9143 53.3689 73.5970 212.2020
STEP12 10.2119 23.2018 52.9450 17.6991 33.1950 36.4465 49.9376 42.3729 77.3694 144.6660
LSCORE 0.7686 2.5324 5.7364 2.2223 2.4934 65.2294 6.5201 9.6946 41.5142 23.3633
LSCORE/$ 0.0317 0.0356 0.9318 0.0931 0.1384 0.5930 0.333762
HSCORE 5.4967 7.5666 42.2283 12.31 22.8554 19.1639 33.6555 32.6461 49.5762 115.7030
HSCORE/$ 0.1759 0.3265 0.2737 0.4807 0.4663 0.7082 1.6529
SCORE 2.0554 5.3774 15.5641 5.2318 7.5490 35.3561 14.8135 17.7902 45.3665 51.9923
SCORE/$ 0.0747 0.1078 0.5050 0.2116 0.2541 0.6480 0.74274
3.6.6 Scalefactor10,1querystream,1server(8serversforv7cluster)SF10 #1/1 owlim5.3 owlim5.3 owlim5.3 v7cluster v7cluster v7cluster virtuoso7 virtuoso7 virtuoso7 virtuoso7
METHOD basic quad rtree++ basic quad rtree basic quad rtree rtree++
STEP01 0.0050 0.0198 0.0104 0.4749 3.8095 0.5610 0.1166 0.8274 0.0947 0.1757
STEP02 0.0109 0.0207 0.0323 0.7364 3.2862 0.6941 0.1037 1.6655 0.0854 0.4154
STEP03 0.0208 0.0456 0.1077 0.9268 4.5829 0.6060 0.1109 1.7379 0.1068 0.6811
STEP04 0.0346 0.0272 0.2104 1.0624 4.4822 0.9128 0.0942 1.1002 0.0928 0.7865
STEP05 0.0500 0.1210 0.3210 1.3428 5.8858 1.1353 0.1786 2.5316 0.1802 1.0473
STEP06 0.0492 0.1136 0.3207 1.8733 6.9979 1.4039 0.2710 3.9169 0.2731 1.0511
STEP07 0.0667 0.0961 0.4799 1.0271 1.0001 0.7162 0.2425 0.2147 0.2204 0.9768
STEP08 0.0683 0.0788 0.4983 1.1344 1.7382 0.7238 0.3339 0.4377 0.2771 1.0534
STEP09 0.1121 0.0931 1.4551 1.2110 1.6526 0.9791 0.4087 0.6935 0.3241 2.1173
STEP10 0.1049 0.0941 0.9657 1.2301 1.8821 0.8685 0.7598 3.3512 0.4157 1.5506
STEP11 0.1920 0.1191 3.1938 1.3126 1.9704 1.4876 0.9669 3.6563 0.7057 3.6496
STEP12 0.1678 0.6825 1.4376 1.3282 1.9142 1.0537 0.9318 5.3705 0.6004 1.9149
LSCORE 0.0215 0.0438 0.0963 0.9763 4.6834 0.8368 0.1353 1.7222 0.1258 0.5921
LSCORE/$ 0.0088 0.0426 0.0076 0.0169 0.2152 0.0157 0.0740
HSCORE 0.1096 0.1325 1.0749 1.2026 1.6526 0.9403 0.5321 1.2746 0.3895 1.6934
HSCORE/$ 0.0171 0.0236 0.0134 0.0665 0.1593 0.0487 0.2116
SCORE 0.0485 0.0762 0.3217 1.0836 2.7820 0.8870 0.2684 1.4816 0.2214 1.0014
SCORE/$ 0.0098 0.0253 0.008 0.0335 0.1852 0.0276 0.1251
D5.1.4–v1.0
Page34
3.6.7 Scalefactor10,2querystreams,1server(8serversforv7cluster)SF10 #2/1 owlim5.3 owlim5.3 owlim5.3 v7cluster v7cluster v7cluster virtuoso7 virtuoso7 virtuoso7 virtuoso7
METHOD basic quad rtree++ basic quad rtree basic quad rtree rtree++
STEP01 0.0065 0.0339 0.0100 0.8428 6.9037 1.0140 0.1989 2.5406 0.1479 0.2817
STEP02 0.0140 0.0475 0.0196 1.3476 8.2174 1.2511 0.2304 4.9220 0.1974 0.6734
STEP03 0.0280 0.0672 0.0912 2.1892 6.2783 1.2983 0.2715 3.9435 0.2591 1.1448
STEP04 0.0451 0.1049 0.1901 2.4971 9.3195 1.7669 0.1964 5.1139 0.1986 1.6744
STEP05 0.0689 0.0953 0.3302 2.7946 11.242 1.8100 0.2883 6.3940 0.2929 2.1664
STEP06 0.0691 0.1240 0.4815 3.5152 14.435 2.0111 0.5444 5.4518 0.5457 2.2474
STEP07 0.0995 0.1169 0.7798 2.0170 2.1313 1.4603 0.5686 0.5905 0.4803 1.7302
STEP08 0.1002 0.1180 0.4186 2.3943 3.7048 1.2957 0.7306 1.1359 0.5880 1.6374
STEP09 0.1738 0.3818 0.6771 2.3188 3.2702 2.1333 1.0197 1.6843 0.6839 3.7533
STEP10 0.1573 0.4045 0.7272 2.3477 3.6686 1.9933 1.3206 7.5586 0.8095 2.5963
STEP11 0.2782 1.3700 6.5831 2.5421 3.4931 2.2806 1.6689 7.5430 1.5672 6.9030
STEP12 0.2377 0.6441 3.3194 2.6716 3.8395 2.1771 1.6871 8.9356 1.1288 3.7002
LSCORE 0.0286 0.0716 0.0905 1.9835 9.0125 1.4817 0.2697 4.5402 0.2494 1.0999
LSCORE/$ 0.0180 0.0818 0.0134 0.0337 0.5675 0.0311 0.1374
HSCORE 0.1621 0.3515 1.2328 2.3721 3.2894 1.8485 1.0786 2.8829 0.8073 2.9821
HSCORE/$ 0.0215 0.0299 0.0168 0.1348 0.3603 0.1009 0.3727
SCORE 0.0682 0.1587 0.3341 2.1691 5.4448 1.6550 0.5394 3.6179 0.4488 1.8111
SCORE/$ 0.0197 0.0495 0.0150 0.0674 0.4522 0.0561 0.2263
3.6.8 Scalefactor10,4querystreams,1server(8serversforv7cluster)SF10 #4/1 owlim5.3 owlim5.3 owlim5.3 v7cluster v7cluster v7cluster virtuoso7 virtuoso7 virtuoso7 virtuoso7
METHOD basic quad rtree++ basic quad rtree basic quad rtree rtree++
STEP01 0.0105 0.0652 0.0242 1.6013 14.0383 1.8670 0.3959 5.8348 0.2644 0.5549
STEP02 0.0212 0.0702 0.0820 2.4171 14.2772 2.1800 0.3698 6.9096 0.3257 1.4373
STEP03 0.0427 0.1441 0.2278 3.4893 10.7229 2.6489 0.4208 10.2357 0.3958 2.2133
STEP04 0.0657 0.1946 0.4703 4.4887 17.2670 3.6181 0.3787 13.6636 0.3721 2.7233
STEP05 0.0992 0.3381 0.7094 5.2234 18.9309 3.2839 0.6232 22.4468 0.6240 3.5838
STEP06 0.1000 0.3374 0.8064 6.6113 29.4153 3.7726 1.0406 27.9520 1.1231 3.4744
STEP07 0.1538 0.3841 1.2721 3.8237 4.4956 2.2509 1.2260 1.7904 0.9262 2.6010
STEP08 0.1479 0.4572 1.1595 4.0627 6.5959 2.5554 1.7328 4.2651 1.2789 2.4533
STEP09 0.2576 0.6944 3.4924 4.7477 5.8628 3.0736 1.6984 4.9390 1.2811 5.3958
STEP10 0.2424 0.9163 2.7187 4.8525 6.6822 3.0457 2.2505 16.0512 1.4027 3.9986
STEP11 0.4361 2.2408 9.6816 4.4008 6.3772 4.1216 3.1623 18.0522 3.1385 10.445
STEP12 0.4076 1.7863 3.3537 4.5142 7.0365 3.7629 2.7759 23.5145 1.9080 4.7708
LSCORE 0.0429 0.1565 0.2228 3.5748 16.5469 2.8002 0.4975 12.3316 0.4553 1.9773
LSCORE/$ 0.0325 0.1505 0.0254 0.0621 1.5414 0.0569 0.2471
HSCORE 0.2515 0.8745 2.7719 5.3825 6.1075 3.0673 2.0357 7.9669 1.5281 5.3566
HSCORE/$ 0.0398 0.0555 0.0279 0.2544 0.9958 0.1910 0.54457
SCORE 0.1039 0.3700 0.7859 3.9581 10.0529 2.9307 1.0063 9.9118 0.8341 2.9350
SCORE/$ 0.0360 0.0914 0.0266 0.1257 1.2389 0.1042 0.3668
D5.1.4–v1.0
Page35
3.6.9 Scalefactor10,8querystreams,1server(8serversforv7cluster)SF10 # owlim5.3 owlim5.3 owlim5.3 v7cluster v7cluster v7cluster virtuoso7 virtuoso7 virtuoso7 virtuoso7
METHOD basic quad rtree++ basic quad rtree basic quad rtree rtree++
STEP01 0.0111 0.1058 0.0179 2.4963 17.8207 3.3792 0.6484 12.6956 0.4822 0.7251
STEP02 0.0244 0.1041 0.0554 3.6211 16.9718 4.4317 0.6449 17.6573 0.5667 1.7568
STEP03 0.0491 0.1936 0.1601 5.1298 17.7603 5.1870 0.7271 22.5014 0.7020 2.9094
STEP04 0.0782 0.3317 0.3211 6.6643 23.7782 6.1971 0.6475 28.0677 0.6225 3.8872
STEP05 0.1172 0.4156 0.5781 7.6602 27.0735 6.3392 0.9747 35.9309 0.9837 5.1091
STEP06 0.1174 0.5800 0.5869 11.6547 42.7366 7.4973 1.9773 47.5651 2.2304 4.8659
STEP07 0.1784 0.5611 1.1142 6.3838 8.0113 3.9240 2.4544 3.6566 1.5296 2.5534
STEP08 0.1741 0.5952 1.1449 6.9567 11.1574 4.1500 2.8321 6.4509 1.9406 2.6415
STEP09 0.3091 1.0202 3.6585 6.7299 9.2481 5.5324 3.3316 8.8511 2.2520 5.4293
STEP10 0.2730 1.3726 2.5702 7.6770 10.8405 5.2976 3.9193 29.6419 2.4232 4.1596
STEP11 0.5210 3.3685 9.4047 6.3636 10.1749 6.7232 5.1006 31.5107 4.5335 9.8784
STEP12 0.4747 2.6020 3.7997 7.1536 10.7785 6.4210 4.5376 40.4639 3.4996 5.3200
LSCORE 0.0493 0.2356 0.1609 5.4932 22.9646 5.3245 0.8509 24.9306 0.8000 2.6639
LSCORE/$ 0.0500 0.2089 0.0484 0.1063 3.1163 0.1000 0.3329
HSCORE 0.2943 1.2650 2.7448 6.8573 9.9619 5.2325 3.5769 14.0949 2.5205 4.4699
HSCORE/$ 0.0623 0.0906 0.0476 0.4471 1.7618 0.3150 0.5587
SCORE 0.1205 0.5460 0.6647 6.1375 15.1252 5.2783 1.7446 18.7455 1.4200 3.4507
SCORE/$ 0.0830 0.0558 0.1376 0.0480 0.2180 2.3431 0.17751 0.4313
3.6.10 Scalefactor10,8querystreams,8replicatedserversSF10 #8/8 owlim5.3* owlim5.3* owlim5.3* virtuoso7* virtuoso7* virtuoso7* virtuoso7*
METHOD basic quad rtree++ basic rtree quad rtree++
STEP01 0.0406 0.1590 0.0839 0.9335 0.7577 6.6192 1.4058
STEP02 0.0871 0.1657 0.2590 0.8303 0.6837 13.3240 3.3235
STEP03 0.1666 0.3649 0.8623 0.8872 0.8547 13.9034 5.4495
STEP04 0.2774 0.2179 1.6838 0.7539 0.7429 8.8018 6.2927
STEP05 0.4000 0.9684 2.5683 1.4292 1.4420 20.2530 8.3787
STEP06 0.3942 0.9093 2.5656 2.1685 2.1855 31.3357 8.4095
STEP07 0.5339 0.7689 3.8395 1.9402 1.7636 1.7182 7.8147
STEP08 0.5470 0.6305 3.9868 2.6718 2.2173 3.5018 8.4272
STEP09 0.8974 0.7452 11.6414 3.2697 2.5930 5.5482 16.9384
STEP10 0.8394 0.7531 7.7257 6.0785 3.3257 26.8097 12.4050
STEP11 1.5367 0.9532 25.5510 7.7354 5.6457 29.2505 29.1971
STEP12 1.3430 5.4607 11.5009 7.4550 4.8039 42.9646 15.3198
LSCORE 0.1720 0.3504 0.7699 1.0822 1.0060 13.7665 4.7334
LSCORE/$ 0.0154 0.0143 0.1966 0.0676
HSCORE 0.8767 1.0597 8.5926 4.2533 3.1142 10.1884 13.5361
HSCORE/$ 0.0607 0.0444 0.1455 0.19337
SCORE 0.3884 0.6093 2.5720 2.1455 1.7700 11.8431 8.0045
SCORE/$ 0.0306 0.0252 0.1691 0.1143
D5.1.4–v1.0
Page36
3.6.11 Scalefactor100,1and2querystreams,8partitionedserversSF100 #1/1 v7cluster v7cluster v7cluster SF100 #2/1 v7cluster v7cluster v7cluster
METHOD basic quad rtree METHOD basic quad rtree
STEP01 0.1058 0.4640 0.0821 STEP01 0.1683 0.6229 0.1695
STEP02 0.2303 0.4118 0.3703 STEP02 0.3586 0.8682 0.5513
STEP03 0.3943 0.9395 0.5208 STEP03 0.5829 0.5552 1.0560
STEP04 0.5587 1.0636 0.4454 STEP04 0.8644 2.5406 1.0788
STEP05 0.7482 2.1896 0.6861 STEP05 1.1468 4.5236 1.5198
STEP06 0.9933 2.9455 0.7742 STEP06 1.4029 6.4307 1.5022
STEP07 0.8034 0.5584 0.2992 STEP07 1.1142 0.6679 0.6052
STEP08 0.8792 1.0936 0.3157 STEP08 1.4387 2.1125 0.6102
STEP09 0.9900 1.0012 0.4326 STEP09 1.6271 1.7392 0.6401
STEP10 0.9835 1.1958 0.2905 STEP10 1.8525 2.0997 0.5107
STEP11 1.0821 1.2656 0.5427 STEP11 1.7895 2.1142 1.0567
STEP12 1.1686 1.3336 0.5325 STEP12 1.7769 2.2917 0.8551
LSCORE 0.3984 1.0353 0.3943 LSCORE 0.6049 1.6760 0.7901
LSCORE/$ 0.0036 0.0094 0.0035 LSCORE/$ 0.0055 0.0152 0.0071
HSCORE 0.9770 1.0356 0.3885 HSCORE 1.5764 1.7092 0.6914
HSCORE/$ 0.0088 0.0094 0.0035 HSCORE/$ 0.0143 0.0155 0.0062
SCORE 0.6239 1.0355 0.3914 SCORE 0.9765 1.6925 0.7391
SCORE/$ 0.0056 0.0094 0.0035 SCORE/$ 0.0088 0.0154 0.0067
3.6.12 Scalefactor100,4and8querystreams,8partitionedserversSF100 #4/1 v7cluster v7cluster v7cluster SF100 #8/1 v7cluster v7cluster v7cluster
METHOD basic quad rtree METHOD basic quad rtree
STEP01 0.2992 1.3469 0.3275 STEP01 0.2387 1.6907 0.3302
STEP02 0.5750 0.9592 0.7332 STEP02 0.5327 1.5570 0.5674
STEP03 0.9540 2.5549 1.4347 STEP03 0.7490 2.6810 1.2959
STEP04 1.1772 4.1456 1.4000 STEP04 1.1318 4.1801 0.9172
STEP05 1.2345 6.470 1.4869 STEP05 1.3073 6.1331 1.3168
STEP06 1.5035 12.354 2.0413 STEP06 1.7921 9.9537 1.6515
STEP07 1.8901 1.8784 1.3785 STEP07 1.6881 2.1458 0.7626
STEP08 2.3643 3.2899 1.3904 STEP08 2.1298 3.0693 0.9078
STEP09 1.9110 2.4358 0.9731 STEP09 1.8109 2.6453 0.9261
STEP10 2.5792 2.7515 1.0189 STEP10 2.2441 3.2538 0.8878
STEP11 2.2809 2.8921 1.0824 STEP11 1.7612 3.0709 0.9556
STEP12 3.2223 3.2312 1.1578 STEP12 1.7843 3.4796 1.0949
LSCORE 0.8429 3.2084 1.0656 LSCORE 0.7951 3.4863 0.8862
LSCORE/$ 0.0076 0.0291 0.0097 LSCORE/$ 0.0072 0.0317 0.0080
HSCORE 2.3338 2.6985 1.1555 HSCORE 1.8918 2.9076 0.9173
HSCORE/$ 0.0212 0.0245 0.0105 HSCORE/$ 0.0172 0.0264 0.0083
SCORE 1.4026 2.9424 1.1096 SCORE 1.2265 3.1838 0.9016
SCORE/$ 0.0127 0.0267 0.0101 SCORE/$ 0.0111 0.0289 0.0082
D5.1.4–v1.0
Page37
4. ConclusionsIn this report, we have described an evaluation of the LOD2 GeoBench on a variety of system
configurations.Wenowdrawconclusionsonthefollowingissues:
The benchmark itself. The LOD2 GeoBench is a challenging benchmark, specifically the InstanceAggregation and Retrieval Queries pose an intense workload to the system. We see that exactimplementations (i.e. basic, rtree, rtree++ but not quad) have a hard time scaling the InstanceAggregationQuerywellatthehigherzoomlevels.WealsoseethattheInstanceRetrievalQueryatthefirst zoom levels where it is used (7‐9) causes a dip in performance due to such retrieval queriesyieldingmanyinstancesandaccessingmanydatapagesinthedatabasesubsystem.Ontheonehandthistellsusthatthebenchmarkisinteresting.Publishingaboutthisbenchmarkwillputemphasisonfinding better solutions to e.g. the Instance Retrieval Query, e.g. by pushing the envelope in queryoptimization. Further, the inherent problems in the lower zoom levels may help the RDF servervendorstoprovidebetterhookstoperformindexingandpre‐computation.Asideasforav3.0ofthebenchmark,we should consider changing the switchover point from InstanceAggregationQuery toInstance Retrieval query at a deeper zoom level. This would be a natural reaction in a real‐lifeapplication toensuredependable latenciesacrossqueries.Further, in the futureweneed to testonlarger data, and with many more concurrent query streams. Finally, a better analysis of theperformancestabilityoftheresultsisneeded.Becauseweareworkingonrealdata,thecardinalitiesoftheselectionsarenotfullypredictableandcanvaryconsiderably,potentiallyintroducingnoiseinthebenchmarkscores.Thiscouldbeaddressedbyhavingthequerygeneratorbeingevenmoreintelligentingeneratingquerypatterns,suchastogenerateproperevenlybalancedparameterbindings.
ThestateofRDFdatabasetechnology.ThethreerightmostresultgroupsinFigure3areanexampleof the achievements in theLOD2project,whereacademic researchperformedbyCWIon columnarandvectorizedqueryexecutionhasmeasurably improvedtheperformanceof theOpenlinkVirtuosoproduct from V6 to V7 by a factor 7, in this case; creating a competitive advantage. In general,geographicalindextechnologyisshownbytheLOD2GeoBenchtobequiteeffective,inthehighzoomlevels. The plans do show some unexpected results, with certain quad Virtuoso V7 query plansbecoming slower than in V6, which likely is down to query optimizer issues. Query optimizationremainsoneofthebiggestchallengesinSPARQLqueryexecution;whichintheLOD2GeoBenchshowsin faults in properly handling the disjunctive queries (the four FACET selections) and the complexquadexpressions.
Even though the thinking in theRDFcommunitymaybe thatRDFdatabasesupport isclosing inonindustry readiness on relational technology, the LOD2 GeoBench shows some very significantconceptualholes.Forinstance,inrelationaltechnologythereareimportantphysicaldesignconcepts,suchasmaterializedviews andclustered indexes, explicitly created for certainpredicates.Theseconcepts are not possible to express in the RDF world. For instance, in a multi‐resolution mapsituation, a relational DBA or database designer would likely develop multiple tables at multipleresolutions, and create separate (RTree) indexes for these. Such tables, or materialized views thatstoreprecomputedexpressions(likefacetcountsatacertaingranularity).Thismeansthatqueriesona lowzoomlevelwouldonlyaccessthematerializedviewrelevant for itonly,whichona lowzoomlevel could have pruned most of dthe detailed data (the individual lamp posts). Accessing thatmaterializedviewthroughitsRTree indexwillbeefficient. Ifallmaterializedviewsforthedifferentresolutionswouldbeunifiedintoonebigdatastructure,allinformationforotherzoomlevelswouldendupintermingledinthesamediskblocksoftheRTree,suchthatmostofthedatascannedwouldbeirrelevant(becauseforadifferentresolution).Thisunifyingofalldata inonebigbucket iswhattheRDFmodeldoes.Whatisneededaremechanismstocreatematerializedviews(maybebyconstructingderiveddatainaspecialkindoftriplegraph)andallowingcertainindexes(suchasRTree)tobebuiltseparately for such a triple graph. That way the RTree will only contain relevant information.Currently,RDFdatabasetechnologydoesnotoffersuchdatabasedesignconcepts.
D5.1.4–v1.0
Page38
RDF geographical browsing application design: faceted browsing on large datasets needs pre‐computation.There isnoway aGoogleMaps experience canbe created straight from the rawbasedata (triples) in a dataset. The quad approach described and benchmarked here specificallytransformstheapplicationdatabaseneedsinsuchawaythatprecomputationofexpressionsbecomespossible. In this case, thequadapproachprecomputes facet instance counts for all tiles, atmultipledifferentgranularities.Queriesthenusetheseprecomputedcountstoavoidhavingtogotothebasedata. Itcannotbestressedenoughthatwithoutprecomputation,queriesatahighzoomlevelwouldneverperformwell,norwouldtheyeverproducenice‐lookingresults(justmillionsoflamp‐poststhatcannotbesensiblydrawnonamap).Thereisalsolittlehopethatsuchprecomputationandindexingcouldbearrangedfullyautomatically.Thismeansthatapplicationdesignersneedtotakethedatabasedesignissueveryseriously.
The latest version of the LOD2Geographical Browser adds significant new features thatmakes theassociationbetweengeographical informationandRDFdata flexibletospecify.Theolderversion,ofwhich a screenshot has been posted in Figure 1, just assumed that the geographical literal (point,polyline,polygon)wouldbeadirectpropertyofafacetinstancesubject.Itis,however,alsopossibletoassociate facet instances over long(er) join paths to geographical literals. The consequence of suchlongerjoinpathsisthatgeographicalquerieswillexperiencelesslocalityfromtheRTreejoinpath,butmoreimportantly,aninterfacewheresuchjoinpathscouldbevariedatrun‐timeflexiblywouldmakeit much more difficult to generate materialized views (such as our pre‐generated quad triples).Creating a Browsing Interface that flexibly allows users to specify these associations, yet rendersresultpages in interactive timeonvery large‐scaledata isextremelychallenging(ifnot impossible).Anotherissueiswhetherordinaryusers,accessing(RDF)dataviagraphicalinterfacesarelookingfortheflexibilitytoassociatejoinpathsthroughacomplexdatamodel,thatislikelyunknowntothem.Itseemsmorepropablethatifrelevantcomplexjoinpathsbetweeninstancesandtheirgeographyexist,itwouldbe the taskofanapplicationdesigner to identify these. In suchacase, theaforementioneddesiredmaterializedviewfunctionalitythatiscalledforinRDFdatabasesystemswouldthencomeinhandytopre‐materializethesegeographiesasdirectpropertiesandacceleratetheminseparateRTreeindexes.
D5.1.4–v1.0
Page39
5. Appendix:ConfigurationDetails5.1 Software
Virtuoso6:Version06.04.3132‐pthreadsforLinuxasofMay142012
Virtuoso7(forbothsingleandtheclusterversion):We used a development version of OpenLink Virtuoso Universal Server: Version07.00.3203‐pthreadsforLinuxasofAug182013
Owlim:Owlim‐SE:Version5.3.6156Tomcat:Version7.0.30
5.2 HardwareWe used CWI Scilens (www.scilens.org) cluster for the benchmark experiment. This cluster is
designed for high I/O bandwidth, and consists ofmultiple layers ofmachines. In order to get largeamounts of RAM,we used only the “bricks” layer,which contains itsmost powerfulmachines. ThemachineswereconnectedbyMellanoxMCX353A‐QCBTConnectX3VPIHCAcard(QDRIB40Gb/sand10GigE)throughanInfiniScaleIVQDRInfiniBandSwitch(MellanoxMIS5025Q).Eachmachinehasthefollowingspecification.
Hardware:(8machines)‐ Processors:2x Intel(R)Xeon(R)CPUE5‐2650,2.00GHz(8cores&hyperthreading),
SandyBridgearchitecture‐ Memory:256GB‐ HardDisks:3x1.8TB(7,200rpm)SATAinRAID0(180MB/ssequentialthroughput).
Software:‐ OperatingSystem:Linuxversion3.3.4‐3.fc16.x86_64
Filesystem:ext4‐ JavaVersionandJVM:Version1.6.0_31,64‐BitServerVM(build20.6‐b01).
ThetotalcostofthisconfigurationwasEUR70,000;whenacquiredin2012.
5.3 Configurationfiles Virtuoso6&Virtuoso7&V7cluster
Eachdatabasehas a virtuoso.ini file as the configuration file. For the cluster version, inaddition to the virtuoso.ini file, there are three other configuration files in each node:cluster.ini,virtuoso.global.ini,clusterglobal.ini.‐ Thevirtuoso.inifilereads:[Database]
DatabaseFile = virtuoso.db
TransactionFile = virtuoso.trx
ErrorLogFile = virtuoso.log
ErrorLogLevel = 7
FileExtend = 200
Striping = 0
D5.1.4–v1.0
Page40
Syslog = 0
;
; Server parameters
;
TempStorage = TempDatabase
[Parameters]
ServerPort = 1113
ServerThreads = 100
AsyncQueueMaxThreads = 50
ThreadsPerQuery = 32
CheckpointInterval = 120
NumberOfBuffers = 6000000
MaxDirtyBuffers = 450000
MaxCheckpointRemap = 2500000
DefaultIsolation = 2
MaxMemPoolSize = 40000000
StopCompilerWhenXOverRunTime = 1
AdjustVectorSize = 1
IndexTreeMaps = 64
FDsPerFile = 4
UnremapQuota = 0
CaseMode = 2
AllowOSCalls = 1
SafeExecutables = ../../bin/isql
Debug = 0
SQLOptimizer = 1
CallstackOnException = 0
PlDebug = 0
DirsAllowed = /,., ../../vad,../../dataset
MaxVectorSize = 500000
[HTTPServer]
ServerPort = 8892
ServerThreads = 30
ServerRoot = .
FTPServerPort = 10565
FTPServerAnonymousLogin = 1
FTPServerTimeout = 1200
[AutoRepair]
BadParentLinks = 0
BadDTP = 0
[Client]
D5.1.4–v1.0
Page41
SQL_QUERY_TIMEOUT = 0
SQL_TXN_TIMEOUT = 0
SQL_PREFETCH_ROWS = 100
SQL_PREFETCH_BYTES = 16000
[VDB]
ArrayOptimization = 0
NumArrayParameters = 10
[TempDatabase]
DatabaseFile = virtuoso.tdb
TransactionFile = virtuoso.ttr
FileExtend = 200
[Replication]
ServerName = virt6565
ServerEnable = 1
QueueMax = 50000
[URIQA]
DefaultHost = localhost.localdomain:13565
LocalHostNames = localhost:13565, master:13565, 10.1.1.1:13565
LocalHostMasks = master_.iv.dev.null:13565, master_:13565
Notethatineachclusternode,theserverportisdifferent.
‐ Thevirtuoso.global.iniinnode1(masternode)reads:
[Parameters] MaxQueryMem = 30G MaxVectorSize = 1000000 Affinity = 1-7 16-23 ListenerAffinity = 0 [Flags] enable_subscore = 0 dfg_empty_more_pause_msec = 100 dfg_max_empty_mores = 100000 qp_thread_min_usec = 100 cl_dfg_batch_bytes = 100000000 enable_high_card_part = 1 enable_vec_reuse = 1 mp_local_rc_sz = 0 dbf_explain_level = 3 enable_feed_other_dfg = 1 enable_cll_nb_read = 1 dbf_no_sample_timeout = 1
Notethat,mostoftheparametersinthevirtuoso.global.iniarethesameforeverynode,exceptthe“Affinity”,whichis9‐1524‐31forthenodeatevenindex(e.g.,node2,node4,etc)
‐ Thecluster.iniinnode1(masternode)reads:
D5.1.4–v1.0
Page42
[Cluster]
Threads = 200
Master = Host1
ThisHost = Host1
ReqBatchSize = 10000
BatchesPerRPC = 4
BatchBufferBytes = 20000
LocalOnly = 2
MaxKeepAlivesMissed = 3000
[ELASTIC]
Slices = 16
Segment1 = 1024, cl1/cl1.db = q1
Notethat,only“ThisHost”parameterischangedforothernodes.“Slices=16”appearsonlyinthemasternode.
‐ Theclusterglobal.inireads:[Cluster]
Threads = 200
Master = Host1
ReqBatchSize = 10000
BatchesPerRPC = 4
BatchBufferBytes = 20000
LocalOnly = 2
MaxKeepAlivesMissed = 2000
Host1 = 192.168.64.203:22201
Host2 = 192.168.64.203:22202
Host3 = 192.168.64.204:22203
Host4 = 192.168.64.204:22204
Host5 = 192.168.64.209:22205
Host6 = 192.168.64.209:22206
Host7 = 192.168.64.207:22207
Host8 = 192.168.64.207:22208
Host9 = 192.168.64.205:22209
Host10 = 192.168.64.205:22210
Host11 = 192.168.64.211:22211
Host12 = 192.168.64.211:22212
Host13 = 192.168.64.212:22213
Host14 = 192.168.64.212:22214
Host15 = 192.168.64.213:22215
Host16 = 192.168.64.213:22216
D5.1.4–v1.0
Page43
Owlim‐ Thegetting‐startedapplicationwasusedforbulk‐loading.Theowlim.ttlfileingetting‐
startedapplicationreads:
# Sesame configuration template for a owlim repository
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix rep: <http://www.openrdf.org/config/repository#>.
@prefix sr: <http://www.openrdf.org/config/repository/sail#>.
@prefix sail: <http://www.openrdf.org/config/sail#>.
@prefix owlim: <http://www.ontotext.com/trree/owlim#>.
[] a rep:Repository ;
rep:repositoryID "owlim" ;
rdfs:label "OWLIM Getting Started" ;
rep:repositoryImpl [
rep:repositoryType "openrdf:SailRepository" ;
sr:sailImpl [
sail:sailType "owlim:Sail" ;
owlim:owlim-license "OWLIM_SE_01092013_128cores.license" ;
owlim:entity-index-size "500000000" ;
owlim:repository-type "file-repository" ;
owlim:ruleset "empty" ;
owlim:storage-folder "owlim-storage" ;
owlim:transaction-mode "fast" ;
# OWLIM-SE parameters
owlim:cache-memory "120G" ;
# OWLIM-Lite parameters
owlim:noPersist "false" ;
]
].
‐ Theexample.shscriptsingetting‐startedapplicationreads:foo=`pwd`
cd ..
. ./setvars.sh
cd $foo
#$JAVA_HOME/bin/java -Xmx512m -cp "bin:$CP_TESTS" GettingStarted $*
$JAVA_HOME/bin/java -d64 -Xmx200G -Xms160G -Dcache-memory=100G -Ddisable-plugins=rdfpriming -cp "bin:$CP_TESTS" GettingStarted context= $*
D5.1.4–v1.0
Page44
‐ Owlim databases have a Sesame template file in ~/.aduna/openrdf‐sesame‐console/templates/.Sesametemplatefilereads:
#
# Sesame configuration template for an OWLIM-SE repository
#
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix rep: <http://www.openrdf.org/config/repository#>.
@prefix sr: <http://www.openrdf.org/config/repository/sail#>.
@prefix sail: <http://www.openrdf.org/config/sail#>.
@prefix owlim: <http://www.ontotext.com/trree/owlim#>.
[] a rep:Repository ;
rep:repositoryID "olgeo10" ;
rdfs:label "OWLIM Geo 10" ;
rep:repositoryImpl [
rep:repositoryType "openrdf:SailRepository" ;
sr:sailImpl [
sail:sailType "owlim:Sail" ;
owlim:owlim-license OWLIM_SE_01092013_128cores.license" ;
owlim:base-URL "{%Base URL|http://example.org/owlim#%}" ;
owlim:defaultNS "{%Default namespaces for imports(';' delimited)%}" ;
owlim:entity-index-size "{%Entity index size|200000%}" ;
owlim:entity-id-size "{%Entity ID bit-size|32%}" ;
owlim:imports "{%Imported RDF files(';' delimited)%}" ;
owlim:repository-type "{%Repository type|file-repository%}" ;
owlim:ruleset "{%Rule-set|owl-horst-optimized%}" ;
owlim:storage-folder "{%Storage folder|storage%}" ;
owlim:enable-context-index "{%Use context index|false%}" ;
owlim:cache-memory "50G" ;
owlim:tuple-index-memory "{%Main index memory|80m%}" ;
owlim:enablePredicateList "{%Use predicate indices|false%}" ;
owlim:predicate-memory "{%Predicate index memory|0%}" ;
owlim:fts-memory "{%Full-text search memory|0%}" ;
owlim:ftsIndexPolicy "{%Full-text search indexing policy|never%}" ;
owlim:ftsLiteralsOnly "{%Full-text search literals only|true%}" ;
owlim:in-memory-literal-properties "{%Cache literal language tags|false%}" ;
owlim:enable-literal-index "{%Enable literal index|true%}" ;
owlim:index-compression-ratio "{%Index compression ratio|-1%}" ;
D5.1.4–v1.0
Page45
owlim:check-for-inconsistencies "{%Check for inconsistencies|false%}" ;
owlim:disable-sameAs "{%Disable OWL sameAs optimisation|false%}" ;
owlim:enable-optimization "{%Enable query optimisation|true%}" ;
owlim:transaction-mode "{%Transaction mode|safe%}" ;
owlim:transaction-isolation "{%Transaction isolation|true%}" ;
owlim:query-timeout "{%Query time-out (seconds)|0%}" ;
owlim:query-limit-results "{%Limit query results|0%}" ;
owlim:throw-QueryEvaluationException-on-timeout "{%Throw exception on query time-out|false%}" ;
owlim:useShutdownHooks "{%Enable shutdown hooks|true%}" ;
owlim:read-only "{%Read-only|false%}" ;
]
].
5.4 BulkLoad Virtuoso6
‐ Bulk‐loadingwasrunwithonlysingle loadingprocess.Bulk‐loading forscale1withVirtuoso6takes3h35p.
11:41:47 PL LOG: Loader started
13:43:31 Checkpoint started
13:45:27 Checkpoint finished, log reused
15:16:20 PL LOG: No more files to load. Loader has finished
Virtuoso7‐ Bulk‐loadingwasrunwith14loadingprocessesinparallel.Forexample,Bulk‐loading
forscale10withVirtuoso7takes1hand32minutes.01:38:19 PL LOG: Loader started
01:38:19 PL LOG: Loader started
01:38:19 PL LOG: Loader started
01:38:19 PL LOG: Loader started
01:38:19 PL LOG: Loader started
01:38:19 PL LOG: Loader started
01:38:19 PL LOG: Loader started
01:38:19 PL LOG: Loader started
01:38:19 PL LOG: Loader started
01:38:19 PL LOG: Loader started
01:38:19 PL LOG: Loader started
01:38:19 PL LOG: Loader started
01:38:19 PL LOG: Loader started
01:38:19 PL LOG: Loader started
02:30:24 PL LOG: No more files to load. Loader has finished,
02:31:02 PL LOG: No more files to load. Loader has finished,
02:31:58 PL LOG: No more files to load. Loader has finished,
02:32:29 PL LOG: No more files to load. Loader has finished,
D5.1.4–v1.0
Page46
02:33:47 PL LOG: No more files to load. Loader has finished,
02:36:08 PL LOG: No more files to load. Loader has finished,
02:39:10 PL LOG: No more files to load. Loader has finished,
02:40:21 PL LOG: No more files to load. Loader has finished,
02:40:21 PL LOG: No more files to load. Loader has finished,
02:45:59 PL LOG: No more files to load. Loader has finished,
02:46:06 PL LOG: No more files to load. Loader has finished,
02:47:06 PL LOG: No more files to load. Loader has finished,
02:47:39 PL LOG: No more files to load. Loader has finished,
03:10:47 PL LOG: No more files to load. Loader has finished,
V7cluster‐ Bulk‐loading was run with 2 loading processes in each node (thus, 32 loadingprocessesinall16nodes).Forexample,Bulk‐loadingforscale100withV7clusterintakes5hand11minutesinMasterNode.
17:13:50 PL LOG: Loader started
17:16:07 PL LOG: Loader started
22:24:00 PL LOG: No more files to load. Loader has finished
22:24:02 PL LOG: No more files to load. Loader has finished
Owlim‐ Bulk‐loadingwasrunbyusinggetting‐startedapplication.Thedatasetiscopiedtothe
preload directory and then is loaded into the owlim repository by running scriptexample.sh.Forexample,bulk‐loadingforscale1withOwlimtakes1h25p.
18:34:40 ===== Load Files (from the 'preload' parameter) ==========
18:34:40 Loading files from: /scratch/duc/lod2/GeoBench/owlim/owlim-se-5.3.6156/getting-started/./preload
Loading FacetCount12.nt 373566 statements
Loading FacetCount14.nt . 731376 statements
Loading FacetCount16.nt .. 1380933 statements
Loading FacetCount18.nt ..... 2563677 statements
Loading FacetCount20.nt ......... 4681485 statements
Loading FacetCount22.nt ................ 8342163 statements
Loading FacetCount24.nt ............................ 14383152 statements
Loading FacetMap12.nt ......... 4627860 statements
Loading FacetMap14.nt ................ 8364460 statements
Loading FacetMap16.nt ............................. 14697532 statements
Loading FacetMap18.nt ................................................. 24647112 statements
Loading FacetTile.nt ........................................................ 28258360 statements
Loading LGD-Dump-Ontology.nt 8721 statements
Loading refined_LGD-Dump-RelevantNodes.sorted.nt .......................................................................................................................................... 69495409 statements
Loading refined_LGD-Dump-RelevantWays.sorted.nt ........................................................................................................................................ 68475661 statements
19:59:54 TOTAL: 251031467 statements loaded
D5.1.4–v1.0
Page47
5.4.1 Sizing Virtuoso6&Virtuoso7
Thedatabasesizeiscomputedbymeasuringthesizeofvirtuoso.*ineachdatabasedirectory.Forexample,databaseofscale10ismeasured:
[duc@bricks05 10gindex]$ ls -al -h virtuoso.*
-rw-r--r-- 1 duc ins1 108G Aug 21 12:03 virtuoso.db
-rwxrwxr-x 1 duc ins1 1.8K Aug 6 16:58 virtuoso.ini
-rw-r--r-- 1 duc ins1 14K Aug 21 12:03 virtuoso.log
-rw-r--r-- 1 duc ins1 0 Aug 1 01:34 virtuoso.pxa
-rw-r--r-- 1 duc ins1 14M Aug 21 02:54 virtuoso.tdb
-rw-r--r-- 1 duc ins1 0 Aug 21 12:03 virtuoso.trx
V7Cluster
The database size is computed by summarizing the size of each database directory in eachnode.Forexample,databaseofscale100ismeasured:
du -s -h /scratch/duc/lod2/cg100/*/cl*/
Database size at the node bricks03
88G /scratch/duc/lod2/cg100/01/cl1/
81G /scratch/duc/lod2/cg100/02/cl2/
Database size at the node bricks04
74G /scratch/duc/lod2/cg100/03/cl3/
73G /scratch/duc/lod2/cg100/04/cl4/
Database size at the node bricks09
71G /scratch/duc/lod2/cg100/05/cl5/
71G /scratch/duc/lod2/cg100/06/cl6/
Database size at the node bricks07
69G /scratch/duc/lod2/cg100/07/cl7/
93G /scratch/duc/lod2/cg100/08/cl8/
Database size at the node bricks05
67G /scratch/duc/lod2/cg100/09/cl9/
69G /scratch/duc/lod2/cg100/10/cl10/
Database size at the node bricks11
66G /scratch/duc/lod2/cg100/11/cl11/
68G /scratch/duc/lod2/cg100/12/cl12/
Database size at the node bricks12
69G /scratch/duc/lod2/cg100/13/cl13/
76G /scratch/duc/lod2/cg100/14/cl14/
Database size at the node bricks13
69G /scratch/duc/lod2/cg100/15/cl15/
72G /scratch/duc/lod2/cg100/16/cl16/
Owlim
Thedatabasesizeiscomputedbymeasuringthesizeofthecreatedrepository.Forexample,databasesizeofscale1ismeasured:
[duc@bricks13 data]$ du -s -h openrdf-sesame/repositories/olgeo1
24G openrdf-sesame/repositories/olgeo1
D5.1.4–v1.0
Page48
5.4.2 BulkLoadScript
Virtuoso6andVirtuoso7
The bulk loading script for Virtuoso is applied on an empty database. First, theregister_load_files.sqlisruntoregisterthelistoffilestoload.Then,theloadingprocessisrunbyusingthescriptrdfload.sh.ForVirtuoso7,14“rdf_loader_run()”wereexecuted.
isql 1113 dba dba < register_load_files.sql
./rdfload.sh
[duc@bricks13 10gindex]$ cat register_load_files.sql
ld_dir ('/scratch/duc/lod2/GeoBench/datasets/10geoindex/', '%.gz', 'http://GeoBench.org');
[duc@bricks13 10gindex]$ cat rdfload.sh
echo "Start loading "
date
isql 1113 dba dba exec="rdf_loader_run();" &
isql 1113 dba dba exec="rdf_loader_run();" &
isql 1113 dba dba exec="rdf_loader_run();" &
isql 1113 dba dba exec="rdf_loader_run();" &
isql 1113 dba dba exec="rdf_loader_run();" &
isql 1113 dba dba exec="rdf_loader_run();" &
isql 1113 dba dba exec="rdf_loader_run();" &
isql 1113 dba dba exec="rdf_loader_run();" &
isql 1113 dba dba exec="rdf_loader_run();" &
isql 1113 dba dba exec="rdf_loader_run();" &
isql 1113 dba dba exec="rdf_loader_run();" &
isql 1113 dba dba exec="rdf_loader_run();" &
isql 1113 dba dba exec="rdf_loader_run();" &
isql 1113 dba dba exec="rdf_loader_run();" &
wait
isql 1113 dba dba exec="checkpoint;"
echo "end loading"
date
V7Cluster
The dataset files are equally divided into each machines. In each machines, theregister_load_files_GEO.sqlisusedforregisteringthelistoffiletoloadinthatmachine.
ssh bricks03 "isql 1113 dba dba < /scratch/duc/lod2/cg100/register_load_files_GEO.sql"
ssh bricks04 "isql 12203 dba dba < /scratch/duc/lod2/cg100/register_load_files_GEO.sql"
ssh bricks09 "isql 12205 dba dba < /scratch/duc/lod2/cg100/register_load_files_GEO.sql"
ssh bricks07 "isql 12207 dba dba < /scratch/duc/lod2/cg100/register_load_files_GEO.sql"
ssh bricks05 "isql 12209 dba dba < /scratch/duc/lod2/cg100/register_load_files_GEO.sql"
ssh bricks11 "isql 12211 dba dba < /scratch/duc/lod2/cg100/register_load_files_GEO.sql"
ssh bricks12 "isql 12213 dba dba < /scratch/duc/lod2/cg100/register_load_files_GEO.sql"
ssh bricks13 "isql 12215 dba dba < /scratch/duc/lod2/cg100/register_load_files_GEO.sql"
[duc@bricks05 /]$ cat /scratch/duc/lod2/cg100/register_load_files_GEO.sql
ld_dir ('/scratch/duc/lod2/cg100/datasetg100_5', '%.gz', 'http://linkedgeodata.org');
ThentheMasternodestartstheloadingprocessinallthenodes.
D5.1.4–v1.0
Page49
cl_exec (' rdf_ld_srv ()' ) &
cl_exec (' rdf_ld_srv ()' ) &
Owlim
Thebulkloadingisdonebycallingexample.shinthegetting‐startedapplication.cd /scratch/duc/lod2/GeoBench/owlim/owlim-se-5.3.6156/getting-started
./example.sh