Semantic Search Over The Web

Preview:

Citation preview

SEMANTICSEARCHOVERTHEWEB

ByALİERKAN

Introduction• Semanticsearchistoimprovetheaccuracyofthesearchprocessbyunderstandingthecontextandlimitingtheambiguity,• SemanticsearchistomakethesemanticsoftheWebcontentmachineunderstandable.• ThesemanticWebcreatesassociationsbetweendifferentrepresentationsofthesamereal-worldentity.• Theyallowdatafrommanydifferentsourcestobeinterlinked(linkedopendatacloud).• ExistingsolutionsareeithersearchenginesthatsimplyindexthesemanticWebdataorthetraditionalsearchenginesenhancedwithsomebasicformofsynonymusage,assupportedbyGoogleandBing.• ThesemanticWebisahugedistributeddatabasewecanquerytogetinformationcomingfromdifferentsources.

NatureofSemanticData

ResourceDescriptionFramework(RDF)

• AlldataitemsinRDFareuniformlyrepresentedastriplesoftheform(subject,predicate,object) or(subject,property,value)triples.

• RDFextendsthelinkingstructureoftheWebtouseURIstonametherelationshipbetweenthingsaswellasthetwoendsofthelink.

• Thislinkingstructureformsadirected,labeledgraph.

• ThegraphviewistheeasiestpossiblementalmodelforRDF.

AdvantagesofRDF

• RDFoffersastandardizedandflexibleframeworkforpublishingstructureddataontheWebsuchthat• (1)datacanbelinked,incorporated,extended,andreusedbyotherRDFdataacrosstheWeb;• (2)heterogeneousdatafromindependentsourcescanbeautomaticallyintegratedbysoftwareagents;• (3)themeaningofdatacanbewelldefinedusingontologies

WebofData

• Today,mostWebsitesaregeneratedfromstructureddatathatisstoredinrelationaldatabases.• Themainbenefitofusingtheontologyisthatthecorrespondingdataarecleanandwellstructured.• AlotofWebsitesthatembedstructureddataintoHTMLpages.• Google,Yahoo!,andMicrosofthavejointlyagreedonasetofvocabulariesfordescribingover200differenttypesofentities.• Question:• “HowcanweembedstructureddataintoHTMLpagesandlinkthemeachother?”

TopologyoftheWebofData• Microformats• Microformatsisatechniqueformarkingupstructureddataaboutspecifictypesonentities.

• RDFa• W3Cstartedin2004tostandardizeRDFa asanalternative.

• Microdata• MicrodataisanalternativeproposalforembeddingstructureddataintoWebpageswhichwasinitiallypresentedaspartoftheHTML5standardizationeffortin2009.

• LinkedData• ThetermLinkedDatareferstoasetofbestpracticesforpublishingstructureddatadirectlyontheWeb.

Microformats

• Designedforhumansfirst,machinessecond.• Microformats requiresthedevelopmentofspecializedparsersforeachformat.• Microformatsisusedtoaddressspecificusecases.• Microformats consistofadefinitionofavocabulary(namesforclassesandproperties),aswellasasetofrules(e.g.,requiredproperties,correctnestingofelements).• HTML/XHTMLattributesareusedforinsertingmarkup.• Themicroformatscommunityencouragesmixingmicroformatsandreusingexistingformatswhencreatingnewones.

MicroformatsSyntax• FigureshowsMicroformat representationoftheexampledataPeterSmith.• Thevcard isarootclassnameindicatingthepresenceofanhCard.• Thepropertiesareurl (Peter’shomepage)andfn (fullname).• ThemarkupalsostatesthatPeterknowsPaulawiththepropertymetacquaintence.

MicroformatsDeploymentontheWeb• Yahoo!SearchareindexingsemanticmarkupincludinghCard,hCalendar,hReview,hAtom,andXFN.• GoogleareparsingthehCard,hReview,andhProduct microformatsandusingthemtopopulatesearchresultpages.• FacebookpublisheseventpagesannotatedwithhCalendar,• Yelp.com addshReview andhCard toalloftheirlistings• Wikipediatemplatesareabletoautomaticallygeneratemicroformatssuchasgeo,hCard,andhCalendar markup.

RDFasyntax• RDFa allowsonetoembedRDFtripleswithintheHTMLdocumentobjectmodel(DOM).• TheRDFasyntaxspecifieshowHTMLelementsmaybeannotatedwithentityidentifiers,entitytypes,stringproperties,andrelationshipproperties.

• TheHTMLattribute@aboutindicatesthattheentityidentifiedbytheURIreferencehttp://example.com/Peter.

• TheHTMLattribute@rel specifiesarelationshippropertybetweenthe HTMLelementandthetargetURL.• Thepropertyfoaf:knows tostatethatPeterknowsPaula.• Forstringproperties,theattribute@property(foaf:name)toexpressPeter’sname.• AcentralideaofRDFaisthesupportformultiple,decentralized,independent,extensiblevocabularies,incontrasttothecommunity-drivencentralizedmanagementofmicroformats.

Microdata

• MicrodataisanattempttoprovideasimpleralternativetoRDFaandMicroformats.• ItdefinesfivenewHTMLattributes(ascomparedtozeroforMicroformatsandeightforRDFa),• Providesaunifiedsyntax(incontrasttoMicroformats),• Allowsfortheusageofanyvocabularies(similarlytoRDFa).• W3Ccurrentlyhastwodraftspecifications(MicrodataandRDFa)withthesameobjective.

MicrodataSyntax• Microdataconsistsofagroupofname–valuepairs.• Thegroupsarecalleditems,andeachname–valuepairisaproperty.• Inordertomarkupanitem,theitemscope attributeisappliedtoanHTMLelement.• Toaddapropertytoanitem,theitemprop attributeisused.

LinkedData

• ThetermLinkedDatareferstoasetofbestpracticesforpublishingstructureddatadirectlyontheWeb.• LinkedDatauseshyperlinkstoconnectdisparatedataintoasingleglobaldataspace.• ALinkedDataapplicationthathaslookedupaURIandretrievedRDFdatabyfollowinglinks.• InaLinkedDatacontext,ifanRDFlinkconnectsURIsindifferentnamespaces,itultimatelyconnectsresourcesindifferentdatasets.

LinkedDataPrinciples

1. UseHTTPURIsasnamesforthings.

2. WhensomeonelooksupaURI,provideusefulinformation,usingrecommendedstandards(RDF,SPARQL).

3. IncludelinkstootherURIs,sothattheycandiscovermorethings.

4. WheneveraLinkedDataclientlooksupanHTTPURIovertheHTTPprotocol,thecorrespondingWebserverreturnsanRDFdescriptionoftheidentifiedobjectusingtheRDF/XMLsyntax.

LinkedData(RDF/XML)Syntax• FOAF,avocabularyfordescribingpeople.• URIhttp://example.com/Peteroftypefoaf:Person.• foaf:name statesthatthisthinghasthenamePeterSmith.• Foaf:knows statesthatPeterSmithknowsPaulaJones,whichisidentifiedbytheURIreferencehttp://example.com/People/Paula.

EvaluationDataForSearchEngines• AnumberofpubliclyavailableevaluationdatasetsthathavebeencrawledfromtheWebandcanbeusedforevaluatingsemanticsearchapplications:• ClueWeb09• TRECEntity• CommonCrawl• WebDataCommons• Sindice• BillionTripleChallenge• SemSearch

• OrtoobtainWebdata,usepubliclyavailablesoftwareforcrawlingtheWeb,suchasNutchforcrawlingWebpagesandLDSpider forcrawlingLinkedData

Challengesof“WebofData”

• ApplicationsthatwanttoexploittheWebofDataarefacingtwomainchallengestoday:• SemanticHeterogeneity.ThedifferenttechniquesthatareusedtopublishdataontheWebleadtoacertaindegreeofsyntaxheterogeneity• DataQuality.TheWebisanopenmediumandeverybodycanpublishdataontheWeb.Thus,theWebwillalwayscontaindatathatisoutdated,conflicting,orintentionallywrong(spam).

StoringandIndexingStructuredData

PerspectivestostorageandindexingofRDFdatasets• TheRelationalPerspective• AnRDFgraphisjustaparticulartypeofrelationaldata,andthattechniquesdevelopedforstoring,indexing,andansweringqueriesonrelationaldata.

• TheEntityPerspective• ResourcesintheRDFgraphareinterpretedas“objects”or“entities”.Eachentityisdeterminedbyasetofattribute–valuepairsintheentityperspective.

• TheGraph-BasedPerspective• ThefocusisonsupportingnavigationintheRDFgraphwhenviewedasaclassicalgraphinwhichsubjectsandobjectsformthenodes,andtriplesspecifydirected,labelededges.

StoringandIndexingUndertheRelationalPerspective• TwodifferentapproachesforstoringRDFdatainrelationaldatabases.• Theverticalrepresentation:• StoresalltriplesinanRDFgraphasasingletableovertherelationschema(subject,predicate,object).• DuetothelargesizeoftheRDFgraphsandthepotentiallylargenumberofself-joinsrequiredtoanswerqueries.

• Thehorizontalrepresentationapproachinterpretstriplepredicatevaluesascolumnnames,andstoresRDFgraphsinoneormorewidetables.

HorizontalRepresentation• RDFdataareconceptuallystoredinasingletableofthefollowingformat:• ThetablehasonecolumnforeachpredicatevaluethatoccursintheRDFgraphandonerowforeachsubjectvalue.Foreach(s,p,o)triple,theobjectoisplacedinthepcolumnofrows.

DisadvantagesandAdvantages

• Thereisaweaknesswhenansweringqueriesthatdonotspecifythepredicatevalue.• TherelationalschemamustbechangedwheneveranewpredicatevalueisaddedtotheRDFgraph.

• Onthepositiveside,thehorizontalrepresentationmakesiteasytosupporttypingofobjectvalues.• itiseasytointegrateexistingrelationaldatawithRDFdata.

StoringandIndexingUndertheEntityPerspective

• ResourcesintheRDFgraphareinterpretedas“objects,”or“entities.”• Eachentityisdeterminedbyasetofattribute–valuepairs.• Heavyuseoftheinvertedindexdatastructure.• Typically,thefollowingtwogeneraltypesofqueriesaretobesupported• Simplekeywordqueries:Akeywordqueryreturnsallentitiesthatcontainanattribute,relationship,and/orvaluerelevanttoagivenkeyword.• Conditionalentity-centricqueries:Aconditionalentity-centricqueryreturnsallknownentitiesthatsatisfysomegivenconditionsonacombinationofattribute,relationships,andvaluesatthesametime

StoringandIndexingUndertheGraph-BasedPerspective• ThefocusisonsupportingnavigationintheRDFgraphinwhichsubjectsandobjectsformthenodes,andpredicatesspecifydirected,labelededges.• Typicalquerypatternsaregraph-theoreticqueriessuchasreachabilitybetweennodes.• Themajorissueunderthisperspectiveishowtoexplicitlyandefficientlystoreandindextheimplicitgraphstructure.• Astructuralindexisusedtoobtainareducedversionofthisgraphwherecertainnodeshavebeenmergedwhilemaintainingalledges.

FurtherIndexResearches

• Amajoropenissueistheincorporationofschemaandontologyreasoning(e.g.,RDFSandOWL)instorageandindexing.• Alittleworkontheimpactofreasoningondisk-baseddatastructures.• Efficientmaintenanceofstorageandindexingstructuresasdatasets.• Intheentityperspective,investigationofsupportforricherquerylanguagesandintegrationwithtechniquesfromtheothertwoperspectives.• Studyofricherstructuralindexingtechniquesandrelatedqueryprocessingstrategies.

SemanticWiki• Semanticwikisarewikisthataddmachine-processable annotationstowikipages.• Annotationsexistsfordataitems,mostfrequentlywikipagesandtags,butalsosmallerportionsoftext.• Theannotationsmaybefreelychosentags,ormoreformalmechanismssuchasRDFbackedby(imported)RDFSorOWLontologiesareofferedaswell.• Theannotationsmaybeusedforsomeprocesses:consistencychecking,improvednavigation,search,querying,personalization,context-dependentpresentation,andreasoning.

SemanticWikiQueries

• AnnotationsareoftenrepresentedinRDF.TheyarecompatiblewithSPARQL.• Semanticwikisusuallyprovidesimplefull-textsearchforthequeryingoftextualcontentorRDFliterals.• AstandardRDFquerylanguagesuchasSPARQLorRDQLcanoftenbeusedforqueryingtheannotations.• Anumberofsemanticwikisalsocomewiththeirownlanguageforqueryingannotations(i.e.,Kiwi-KWQL).

DBpedia

• DBpedia isextracted structuredcontent fromWikipedia.• Thisstructuredinformationismadeavailableonthe WorldWideWeb.• DBpediaallowsusersto semanticallyquery relationshipsandpropertiesofWikipediaresources• DBpediaisincludinglinkstootherrelated datasets.• ItispossibletoaskcomplexqueriestotheDbpedia withSPARQLendpoint.

Dbpedia SPARQL

• SupposewewereinterestedinknowingwhicharethemovieswhereHughGrantandColinFirthstarredtogether,wecouldaskDBpediathefollowingSPARQLquery:

SELECT?movieWHERE{?movie<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>

<http://dbpedia.org/ontology/Film>.?movie<http://dbpedia.org/ontology/starring>

<http://dbpedia.org/resource/Hugh_Grant>.?movie<http://dbpedia.org/ontology/starring>

<http://dbpedia.org/resource/Colin_Firth>}

Keyword-basedSearchSystems

Keyword-basedsearchsystemsaddressthefollowingkeysteps:

• Composingavalidsemanticquery,• forauseritisdifficulttomasteraquerylanguage(e.g.,SPARQL)andacquiresufficientknowledgeabouttheontologyortheschemaofthedatasource.

• Identifying(substructuresholding)datamatchinginputkeywords,• byusinganindexingsystemoradatabaseengine.Indexingmaybemadebyshortestpathtorootnodes.

• Linkingidentifieddata(substructures)intosolutions• sincedataisusuallyscatteredacrossmultipleplaces,e.g.,indifferenttablesorXMLelements.

• Rankingsolutionsaccordingtoarelevantcriterion(i.e.,asuitablescoringfunction).• SpecificimplementationofTF/IDFmaybeusedforscoringkeywordelements.

• Onlythetop-ksolutionswithhighestscore,arereturnedtotheusersasqueryanswers.

TheinterfaceofInteractiveQueryConstructionofQUICKconsistsofthreeparts:

• QuickcreatesSemanticQueriesfromkeywords:• Asearchfield(onthetop),• Theconstructionpaneshowingqueryconstructionoptions(ontheleft),• Thequerypaneshowingsemanticqueries(ontheright).

PerformanceMeasurementsofSearch

• Threemeasureshavebeenproposedtoevaluateperformance:• Exhaustivitymeasurestherelevanceofasolutionintermsofthenumberofkeywordsitcontains.• Specificitymeasurestheprecisionofasolutionintermsofthenumberofkeywordsitcontainswithrespecttootherirrelevanttermsoccurringinthesolution.• Overlapmeasurestheinformationcontentofasolutionintermsofitsintersectionwithothersolutions.

• Clearly,thebestrankingstrategybalancesexhaustivityandspecificitywhilereducingoverlap.

SemanticWebSearchEngines

• HiddenWeb/DeepWebApproaches• RDF-CentricSearchEngines• DistributedWebSearchArchitectures

HiddenWeb/DeepWebApproaches• AvastamountoftheinformationavailableontheWebishiddenbehindsiteswithheavydynamiccontent,usuallybackedbyrelationaldatabases.• Manuallyconstructed,site-specificwrapperstoextractstructureddatafromHTMLpagesortocommunicatedirectlywiththeunderlyingdatabaseofsuchsites.• Automaticall crawlerexists,however,thisapproachis“taskspecific”andnotappropriateforgeneralcrawling.• TheSemanticWebmayrepresentafuturedirectionforbringingDeepWebinformationtothesurfacebyusingRDFasacommonandflexibledatamodel.

RDF-CentricSearchEngines• EarlyprototypesareOntobroker andSHOE usingtheconceptsofontologiesandsemanticsontheWeb.• Swoogle offerssearchoverRDFdocumentsbymeansofaninvertedkeywordindexandarelationaldatabase.• Watson alsoprovideskeywordsearchfacilitiesoverSemanticWebdocumentsbutadditionallyprovidessearchoverentities.• Sindice isaregistryandlookupserviceforRDFfilesbasedonLuceneandaMapReduceframework.• Falcons searchengineoffersentity-centricsearchingforentities(andconcepts)overRDFdata.• Ithasrankentitiesbyusingalogarithmofthecountofdocumentsinwhichtheyarementioned.

• GoWeb systemdemonstratesthebenefitofsearchingstructureddataforthebiomedicaldomain.

DistributedWebSearchArchitectures

• DistributedarchitectureshavelongbeencommonintraditionalWebsearchengines.

• Thesystemarchitecturesincludesanincrementalcrawler,rankerandstoragemanager,indexer,andqueryprocessor.

• Somesystemsuseadistributedinvertedindex• (basedonanembeddeddatabasesystem)overalargecorpusofWebpages,forsubsequentanalysisandqueryprocessing.

SemanticSearchWebEngine(SWSE)SystemArchitecture

SWSE• SWSE consistsofcrawling,dataenhancing,indexingandauserinterfaceforsearch,browsingandretrievalofinformation;operatesoverRDFWebdata(LinkedData).• SWSEallowsuserstospecifykeywordqueriesinaninputboxandrespondswitharankedlistofresultsnippets.• Theresultsrefertoentitiesnotdocuments(entitysearchoverinstancedata).• Userscansubsequentlynavigatetorelatedentities,assuch,browsingtheWebofData.

SWSEPreprocessing• ThecrawleracceptsasetofseedURIsandretrievesalargesetofRDFdatafromtheWeb,• Theconsolidationcomponenttriestofindsynonymous(i.e.,equivalent)identifiersinthedata,andcombinesthedataaccordingtotheequivalencesfound,• Therankingcomponentperformslinks-basedanalysisoverthecrawleddataandderivesscoresindicatingtheimportanceofindividualelementsinthedata(PageRank).• Thereasoningcomponentproducesnewdatawhichisimpliedbytheinherentsemanticsoftheinputdata,• Theindexingcomponentpreparesanindexwhichsupportstheinformationretrievaltasksrequiredbytheuserinterface(InvertedIndex).

SWSEQueryProcessing• Withthedistributedindexbuiltandpreparedontheslavemachines,thequeryprocessorisabletoacceptuserqueries.

• Foratop-kkeywordquery,thecoordinatingmachinerequestskresultidentifiersandranksfromeachoftheslavemachines.

• Thecoordinatingmachinethencomputestheaggregatedtop-khits.

• Toprovidetherawdatarequired,themastermachinedirectlyrequestsdatafromtherespectiveslavemachine(focusview).

Resultsviewforkeywordquery“billClinton” Focusviewforentity“BillClinton”

SWSESearch

Watson(http://watson.kmi.open.ac.uk/WatsonWUI/)

ARecommenderSystemforLinkedDataMORE(MOREthanMovieRecommendation)• Thesystemsisneededtorecommenditemsbasedonuserpreferences.

• Thesystemsshouldallowaneasyandfriendlyexplorationoftheinformation/datarelatedtoaparticulardomainofinterest.

• NewchallengeswiththehugeamountofinterlinkeddatacomingfromthesemanticWeb.

SemanticVectorSpaceModel(MORE)

• InVSM,weightsareassignedtoindextermsinqueriesandindocuments(setsofterms),• Weightsareusedtocomputethedegreeofsimilaritybetweeneachdocumentinthecollectionandthequery.• WholeRDFgraphmayberepresentedasathree-dimensionaltensorwhereeachtwo-dimensionalslicereferstoanontologyproperty.• Givenaproperty,eachmovieisseenasavector,whosecomponentsrefertotheTF-IDF(resourcefrequency-inversemoviefrequency).• Foraparticularproperty,thesimilaritydegreebetweentwomoviesisrepresentedbythecorrelationbetweenthetwovectors.• Toobtaintheglobalcorrelationbetweentwomovies,aweightedsumofeachpropertyiscalculated.

TensorrepresentationoftheRDFgraph

ImportanceweightsofthepropertiesThepropertiesinvolvedinthesimilaritydetectionprocessdonothavethesameimportance.EachpropertycanhaveadifferentimportancefortheuserthatcanbespecifiedthroughaweightinMORE.

SampleofRDFgraphrelatedtothemoviedomain

FigureshowsasketchofourRDFgraphonmovies.Itcontains2movies,3actors,2directors,3categories,1genre,and5differentpredicates.

ExploratorySearchApplications• Theyaredesignedtosatisfytheneedsofuserswithspecificaims.• Theysupportsthepublishingandintegrationofdatasourcesforverticaldomains.• Theuserwillbeabletoselectsourcesbasedonindividualorcollectivetrust.• Andsystemswillbeabletoroutequeriestosuchsourcesandtoprovideeasyto-useinterfacesforcombiningthemwithinsearchstrategies.

DeploymentArchitecture

• ThedeploymentofexploratoryWebapplicationsintegratingdatasourcesrequiresanumberofsoftwarecomponentsandsophisticatedinteractionsbetweenthem:• Theprocessingmodules inchargeofinvokingservicesthatquerythedatasources.• Theexecutionengineisadataandcontrol-drivenqueryenginespecificallydesignedtohandlemultidomainqueries.• Thecontrollayeristhecontrollerofthearchitecture;itisdesignedtohandleseveralsysteminteractions.• Therepository containsthesetofcomponentsanddatastoragesusedbythesystem.

ExploratorySearchApplicationsExamples

• NightPlanner• WeekendBrowser• Real-EstateBrowser• Job-HouseCombinationBrowser

NightPlanner

• Anightplannerisashort-termWebapplicationpresentingseveralgeolocalizedservices,describingrestaurants,shows,movies,familyevents,musicconcerts,andthelike.• Selectedrestaurantsarerankedbydistancefromtheuserandpossiblybytheirscore

WeekendBrowser

• Aweekendbrowserisashort-termWebapplicationpresentingtouserstheeventswhichareoccurringinoneormoreselectedcitiesofinterest.• Onceshe/heisconsideringaparticularlocation,she/heisofferedadditionalservicesforcompletingtheweekendplan.

Real-EstateBrowser

• Areal-estatebrowserisalong-lived,hierarchicalapplication.• Itiscenteredaroundareal-estate.• Ausermayselectsomehouseoffersandevaluatethemaccordingtosomesearchdimensions(e.g.,distancefromwork,school).• Thedesignermaysimplifytheinteractionbycombiningseveralservicesintoonequery(e.g.,walkabilityandvicinitytomarketsandparks)

Job-HouseCombinationBrowser

• Awork-jobbrowserisalong-lived,hierarchicalapplicationwheretwohierarchicalroots,onecenteredonworkoffersandoneonhouseoffers.

• TheapplicationasdesignedforapplicantstoPhDprograms,whereopeningsarelinkedtodoctoralschools,thentotheirprofessors,thentotheirresearchprograms,andanon-campushousing

References

• SemanticSearchOverTheWeb,RobertoDeVirgilio,FrancescoGuerra,Yannis Velegrakis,Springer,2012.• https://rdfa.info/• http://microformats.org/wiki/Main_Page• https://schema.org/docs/gs.html• http://wiki.dbpedia.org/