54
15/05/2022 Mondeca’s approach to enriching search engines using business knowledge Mondeca [email protected] om

Enriching search engines - thomas francart - english version

Embed Size (px)

DESCRIPTION

Thomas Francart/Mondeca - How to add value to search engines using semantic information.

Citation preview

Page 1: Enriching search engines - thomas francart - english version

08/04/2023

Mondeca’s approach to enriching search engines using business knowledge

Mondeca [email protected]

Page 2: Enriching search engines - thomas francart - english version

The intersection points of several domains

Knowledge-based enhanced

search

ContentContent semantic

annotation

Smart structure-based indexing of content

Search Knowledge

Page 3: Enriching search engines - thomas francart - english version

LERUDI use case

Content

Knowledge

CMS

WCM

AFSITM

SDK

Search

Page 4: Enriching search engines - thomas francart - english version

What knowledge are we talking about?

• Internal business/reference vocabularies :– Thesauri (multilingual)– Dictionaries– Named entities lists– Classification rules– Thesaurus alignments– …

• Structured data - Always• Linked Data :

– E.g.: GEMET thesaurus, subset of DBPedia named entities, etc.

Page 5: Enriching search engines - thomas francart - english version

At which level do we bring value?

• at 2 different levels:– when indexing content

• via index enrichment– when users perform search

• by assisting them in the query (re)formulation

• The preferred /most useful technique is to enrich content during the indexing phase– but this implies that content be reindexed every time

business knowldege evolves or changes

Page 6: Enriching search engines - thomas francart - english version

The search engine we used to demonstrate this

• Lucene SolR :– Open-source– Has advanced plain text search capabilities– Allows faceted search– Offers a highlight feature– Has spellchecker capabilities– Includes a « More Like This » (find related content)

feature– Is UIMA compliant– … full feature list available at :

http://lucene.apache.org/solr/features.html

• Principles discussed in the next slides may be applied to other search engines

Page 7: Enriching search engines - thomas francart - english version

SolR explorer : a test interface

•SolR returns an XML feed to an http request

–http://localhost:8080/solr/select/q=lac&start=0&length=10

•SolR explorer :–A web interface to visualize / navigate / test the retunred XML feed–Definitely not meant for end users!–https://issues.apache.org/jira/browse/SOLR-1163

Page 8: Enriching search engines - thomas francart - english version

The data set

• Structured catalogue of an e-tourism portal– Hotels– Restaurants– Activities– Contacts– Etc.

• Each resource is linked to a web site

Page 9: Enriching search engines - thomas francart - english version

Starting point: simple web indexing– without enrichment

Page 10: Enriching search engines - thomas francart - english version

Plan

1 - Synonyms2 - Translations3 – Specific terms4 – Spelling mistakes5 - Facets6 – Dynamic classification7 – Vocabulary alignments8 - Disambiguation

Page 11: Enriching search engines - thomas francart - english version

1 : enrichment using synonyms

• Why?– Increase recall, expand a request using similar terms

• How?– By providing a list of equivalent terms to the search

engine– SolR configuration:<fieldType name="text" class="solr.TextField" positionIncrementGap="100">

<analyzer type="index">  <tokenizer class="solr.WhitespaceTokenizerFactory" /> <!-- in this example, we will only use synonyms at index time   -->   <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" /> <!-- ... -->   </analyzer> <analyzer type="query">  <tokenizer class="solr.WhitespaceTokenizerFactory" /> <!-- <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>   --> <!-- ... -->   </analyzer></fieldType>

Page 12: Enriching search engines - thomas francart - english version

Thomas Francart - Enrichissement des moteurs de recherche à partir de connaissances métier

Format of the synonyms file

• One line for each equivalent synonym• Option 1 : use all equivalent terms

– If one term is found, all equivalent terms are added to the index

• Option 2 : use controlled term only– If one of the terms is found, only the controlled term is added to the

index

déreglementation,libéralisation,dérégulationcroisière,croisière de plaisance,croisière maritimespectacle,attraction,showvacances familiales,tourisme familialpêche,pêche à la ligne,permis de pêche,article de pêche,pêche au gros,pêche touristiqueoffice de tourisme,otsi,office municipal de tourisme,syndicat d'initiative

libéralisation,dérégulation => déréglementationcroisière de plaisance,croisière maritime => croisièreattraction,show => spectacletourisme familial => vacances familialespêche à la ligne,permis de pêche,article de pêche,pêche au gros,pêche touristique => pêcheotsi,office municipal de tourisme,syndicat d'initiative => office de tourisme

Page 13: Enriching search engines - thomas francart - english version

Generation of a synonyms file

• Generation of « synonyms.txt » from a SKOS file– E.g.: using the World Tourism Organisation thesaurus

<skos:Concept rdf:about="http://thes.world-tourism.org#VACANCES">  <skos:altLabel xml:lang="en">long stays</skos:altLabel>   <skos:altLabel xml:lang="fr">marché des vacances</skos:altLabel>   <skos:altLabel xml:lang="fr">genre de vacances</skos:altLabel>   <skos:altLabel xml:lang="fr">long séjour</skos:altLabel>   <skos:altLabel xml:lang="en">holiday markets</skos:altLabel>   <skos:altLabel xml:lang="en">vacations</skos:altLabel>   <skos:altLabel xml:lang="es">mercado de vacaciones</skos:altLabel>   <skos:altLabel xml:lang="fr">activité de vacances</skos:altLabel>   <skos:altLabel xml:lang="fr">type de vacances</skos:altLabel>   <skos:altLabel xml:lang="en">holiday tourism</skos:altLabel>   <skos:altLabel xml:lang="fr">congés payés</skos:altLabel>   <skos:altLabel xml:lang="es">estancia larga</skos:altLabel>   <skos:altLabel xml:lang="fr">06.09</skos:altLabel>   <skos:broader rdf:resource="http://thes.world-tourism.org#FLUX_TOURISTIQUE" />   <skos:inScheme rdf:resource="http://thes.world-tourism.org#_06_FLUX_TOURISTIQUE" />   <!-- … -->   <skos:narrower rdf:resource="http://thes.world-tourism.org#VACANCES_D'HIVER" />   <skos:narrower rdf:resource="http://thes.world-tourism.org#VACANCES_D'ETE" />   <skos:prefLabel xml:lang="en">HOLIDAYS</skos:prefLabel>   <skos:prefLabel xml:lang="fr">VACANCES</skos:prefLabel>   <skos:prefLabel xml:lang="es">VACACIONES</skos:prefLabel> </skos:Concept>

Activités nautiques

HOLIDAYS,VACANCES,long stays,marché des vacances,genre de vacances,long séjour,holiday markets,vacations,activité de vacances,type de vacances,holiday tourism,congés payés,06.09KOREA DPR,COREE RDP,20.03.05.03

TOURISM IN NATIONAL ECONOMIES,TOURISME DANS L'ECONOMIE NATIONALE,04.04.04,place du tourisme dans l'économie

Page 14: Enriching search engines - thomas francart - english version

Result

Page 15: Enriching search engines - thomas francart - english version

Handle synonyms at index-time or query-time ?

• In most cases, it is recommended to handle synonyms at index-time– A synonym composed of several words (e.g.:« nautical

activities ») is tokenised at query and will not be correctly identified• Even when using quotes?

– It impacts the search engine’s scoring algorithms (IDF)– prefix queries (« naut* ») or fuzzy queries

(« ~activities ») are not analysed at the moment of the query and will not be extended to synonyms

• But :– The index will get all the more bigger– If synonyms change, reindexing must be done

Page 16: Enriching search engines - thomas francart - english version

To expand, or not to expand queries…?

• One possible solution to avoid inflating the index:– Avoid expanding from a list of synonyms…

– …but rather restrict expansion to one controlled value…

– … which could be the URI of a concept

• Advantages:– Index size does not inflate– No impact on scoring algorithms

• But it requires analysis when indexing and querying• Does not solve issue of synonyms composed of several words

spectacle,attraction,showpêche,pêche à la ligne,permis de pêche,article de pêche,pêche au gros,pêche touristiqueoffice de tourisme,otsi,office municipal de tourisme,syndicat d'initiative

attraction,show => spectaclepêche à la ligne,permis de pêche,article de pêche,pêche au gros,pêche touristique => pêcheotsi,office municipal de tourisme,syndicat d'initiative => office de tourisme

attraction,show,spectacle => http://thes.world-tourism.org#SPECTACLEpêche, pêche à la ligne,permis de pêche,article de pêche,pêche au gros,pêche touristique => http://thes.world-tourism.org#PECHEoffice de tourisme, otsi,office municipal de tourisme,syndicat d'initiative => http://thes.world-tourism.org#OFFICE_DE_TOURISME

Page 17: Enriching search engines - thomas francart - english version

Mixed approaches

• Use two synonym lists:– One tailored for indexing– Another one tailored for search expansion at query-time

• When new synonyms are needed:– Add them to the synonym list tailored for search

• They can be leveraged in real time, no need for reindexing• Does not solve the question of synonyms composed of several

words– Add them to the synonym list tailored for indexing too

• They will be leveraged at the next indexing phase • At the next indexing phase:

– Empty the synonyms list tailored for search• Another mixed approach:

– Process all the synonyms of a given single word when searching– Process all the synonyms composed of several words at indexing

phase

Page 18: Enriching search engines - thomas francart - english version

Plan

1 - Synonyms2 - Translations3 – Specific terms4 – Spelling mistakes5 - Facets6 – Dynamic classification7 – Vocabulary alignments8 - Disambiguation

Page 19: Enriching search engines - thomas francart - english version

2 : enrich using translations• Why?

– Add multilingual capabilities to the search engine / allow searching for content in a different language than the one used in the query

• Same methodology as for synonyms– Translations are declared as equivelent synonyms

• Example– Using the GEMET thesaurus (sustainable developpment)– Can be download in SKOS at http://www.eionet.europa.eu/gemet

…achat,purchase,compramosaïque,mosaic,mosaicostation de montagne,mountain resort,centro turístico de montaña…

<rdf:Description rdf:about="concept/10910">  <skos:prefLabel xml:lang="fr">station de montagne</skos:prefLabel>   <skos:prefLabel xml:lang="en">mountain resort</skos:prefLabel>   <skos:prefLabel xml:lang="es">centro turistico de montana</skos:prefLabel> </rdf:Description>

<fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index">   <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />   <filter class="solr.SynonymFilterFactory" synonyms="gemet.txt" ignoreCase="true" expand="true" /> </analyzer> <!-- … --></fieldType>

Page 20: Enriching search engines - thomas francart - english version

Resultat

Page 21: Enriching search engines - thomas francart - english version

!?

• Why would a search using «  mosaic » match « poterie » and « vitrail »?

• In GEMET, the only information available is:

• BUT, in WTO, we also find the following information:

• As we are using GEMET and WTO dictionaries of synonyms, the result when indexing is:– « Poterie » « mosaïque » « mosaic »

• We are exploiting both WTO synonyms and translations from GEMET– B eware of any unwanted interactions!

…achat,purchase,compramosaïque,mosaic,mosaicostation de montagne,mountain resort,centro turístico de montaña…

ARTISANAT,vitrail,orfèvrerie,mécanique,dentelle,plomberie,tapisserie,ébénisterie,mosaïque,modélisme,tissage,porcelaine,crafts,artisanat d'art,menuiserie,cristallerie,joaillerie,émaux,peinture sur soie,poterie

Page 22: Enriching search engines - thomas francart - english version

Plan

1 - Synonyms2 - Translations3 – Specific terms4 – Spelling mistakes5 - Facets6 – Dynamic classification7 – Vocabulary alignments8 - Disambiguation

Page 23: Enriching search engines - thomas francart - english version

3 : enrichment using specific terms

• Why?– To increase recall. Allows searching on generic notions

• The GEMET and WTO thesauri rely on a hierarchy of terms– Loisirs > loisirs de plein air > randonnée > randonnées

cycliste– Loisirs > sorties > spectacle > cirque

• A search on « sorties » should find documents containing « spectacle » or « cirque »– A search on « Loisirs » (leisure) should find documents

containing « randonnée » (trek) or « spectacle » (show)– Etc.

Page 24: Enriching search engines - thomas francart - english version

• How?– Same methodology as

for the synonyms– Translation of specific

terms is performed when indexing, so as to translate a specific term into all of its corresponding generic terms• If done at search,

we would translate from generic to specific

• If « peinture » (paint) is in the text, then we must add « loisirs culturels » et « loisirs » which are the generic terms of that specific one

Generation of the specific terms file

<skos:Concept rdf:about="http://thes.world-tourism.org#LOISIRS">  <skos:narrower rdf:resource="http://thes.world-tourism.org#LOISIRS_DE_PLEIN_AIR" />   <skos:narrower rdf:resource="http://thes.world-tourism.org#SORTIE" />   <skos:narrower rdf:resource="http://thes.world-tourism.org#LOISIRS_D'INTERIEUR" />   <skos:narrower rdf:resource="http://thes.world-tourism.org#OISIVETE" />   <skos:narrower rdf:resource="http://thes.world-tourism.org#LOISIRS_CULTURELS" />   <skos:narrower rdf:resource="http://thes.world-tourism.org#JEU" />   <skos:narrower rdf:resource="http://thes.world-tourism.org#ARTISANAT" />   <skos:prefLabel xml:lang="fr">LOISIRS</skos:prefLabel> </skos:Concept><skos:Concept rdf:about="http://thes.world-tourism.org#LOISIRS_CULTURELS">  <skos:altLabel xml:lang="fr">loisirs artistiques</skos:altLabel>   <skos:broader rdf:resource="http://thes.world-tourism.org#LOISIRS" />   <skos:narrower rdf:resource="http://thes.world-tourism.org#PEINTURE" />   <skos:prefLabel xml:lang="fr">LOISIRS CULTURELS</skos:prefLabel> </skos:Concept><skos:Concept rdf:about="http://thes.world-tourism.org#PEINTURE">  <skos:altLabel xml:lang="fr">09.03.07</skos:altLabel>   <skos:broader rdf:resource="http://thes.world-tourism.org#LOISIRS_CULTURELS" />   <skos:prefLabel xml:lang="fr">PEINTURE</skos:prefLabel> </skos:Concept>

RESEAU => TRAFIC,TRANSPORTPEINTURE => LOISIRS CULTURELS,LOISIRSFETE => MANIFESTATION CULTURELLE,MANIFESTATION TOURISTIQUETRANSPORT FLUVIAL => MODE DE TRANSPORT,TRANSPORT

Page 25: Enriching search engines - thomas francart - english version

Result

Page 26: Enriching search engines - thomas francart - english version

Plan

1 - Synonyms2 - Translations3 – Specific terms4 – Spelling mistakes5 - Facets6 – Dynamic classification7 – Vocabulary alignments8 - Disambiguation

Page 27: Enriching search engines - thomas francart - english version

4 : spell checking

• Why?– Provide users with similar terms for a worng entry

• e.g.: « retsaurant » « Did you mean ‘restaurant’ ? »• How? Here are the two ways to to build smart

spellchecking :– By using the index as a dictionary

• Spelling corrections are in fact existing entries in the index– Hence almost a 100% chances to find resutls, except when

spellchecked terms are combined with other terms from the query

• But not all of the controlled/business terms are necessarily available for spell checking – If they do not exist in the indexed content

– By using a list of controlled terms• The suggested spelling corrections will not necessarily

trigger results– There is not garanty that any of the indexed document contains

the proposed terms• But all business terms are available for controlled searches

Page 28: Enriching search engines - thomas francart - english version

Spellchecking using an authority list

• Configuration SolR : solrconfig.xml<config><searchComponent name="spellcheck" class="solr.SpellCheckComponent">  <str name="queryAnalyzerFieldType">textSpell</str> <lst name="spellchecker">  <str name="name">default</str>   <str name="field">name</str>   <str name="spellcheckIndexDir">./spellchecker</str>   </lst> <lst name="spellchecker">  <str name="classname">solr.FileBasedSpellChecker</str>   <str name="name">file</str>   <str name="sourceLocation">spellcheck.txt</str>   <str name="characterEncoding">UTF-8</str>   <str name="accuracy">0.8</str>   <str name="spellcheckIndexDir">./spellcheckerFile</str>   </lst></searchComponent>

<requestHandler name="standard" class="solr.SearchHandler" default="true"> <lst name="defaults">  <str name="echoParams">explicit</str>   <str name="spellcheck.onlyMorePopular">false</str>   <str name="spellcheck.extendedResults">false</str>   <str name="spellcheck.count">1</str>  </lst> <arr name="last-components">  <str>spellcheck</str>   </arr></requestHandler></config>

Page 29: Enriching search engines - thomas francart - english version

Generation of spellcheck.txt

• Generation of spellcheck.txt from the WTO SKOS

<skos:Concept rdf:about="http://thes.world-tourism.org#VACANCES">  <skos:altLabel xml:lang="en">long stays</skos:altLabel>   <skos:altLabel xml:lang="fr">marché des vacances</skos:altLabel>   <skos:altLabel xml:lang="fr">genre de vacances</skos:altLabel>   <skos:altLabel xml:lang="fr">long séjour</skos:altLabel>   <skos:altLabel xml:lang="en">holiday markets</skos:altLabel>   <skos:altLabel xml:lang="en">vacations</skos:altLabel>   <skos:altLabel xml:lang="es">mercado de vacaciones</skos:altLabel>   <skos:altLabel xml:lang="fr">activité de vacances</skos:altLabel>   <skos:altLabel xml:lang="fr">type de vacances</skos:altLabel>   <skos:altLabel xml:lang="en">holiday tourism</skos:altLabel>   <skos:altLabel xml:lang="fr">congés payés</skos:altLabel>   <skos:altLabel xml:lang="es">estancia larga</skos:altLabel>   <skos:altLabel xml:lang="fr">06.09</skos:altLabel>   <skos:broader rdf:resource="http://thes.world-tourism.org#FLUX_TOURISTIQUE" />   <skos:inScheme rdf:resource="http://thes.world-tourism.org#_06_FLUX_TOURISTIQUE" />   <!-- … -->   <skos:narrower rdf:resource="http://thes.world-tourism.org#VACANCES_D'HIVER" />   <skos:narrower rdf:resource="http://thes.world-tourism.org#VACANCES_D'ETE" />   <skos:prefLabel xml:lang="en">HOLIDAYS</skos:prefLabel>   <skos:prefLabel xml:lang="fr">VACANCES</skos:prefLabel>   <skos:prefLabel xml:lang="es">VACACIONES</skos:prefLabel> </skos:Concept>

ORGANISMO DE CREDITO14.11.02Activités nautiquesHOLIDAYSVACANCESVACACIONESmarché des vacancesgenre de vacanceslong séjouractivité de vacancestype de vacancescongés payés06.09KOREA DPR…

Page 30: Enriching search engines - thomas francart - english version

Result

Page 31: Enriching search engines - thomas francart - english version

Plan

1 - Synonyms2 - Translations3 – Specific terms4 – Spelling mistakes5 - Facets6 – Dynamic classification7 – Vocabulary alignments8 - Disambiguation

Page 32: Enriching search engines - thomas francart - english version

5 : content semantic structuring

• « Smarter content = smarter index »• It takes content semantic structuring to enhance

the search experience– Associate meaningful metadata to content– Meaningful metadata bring unambiguous values from reference

vocabularies (identification using URIs)• Associating structured metadata to content

enables faceted navigation• This is a wide-ranging process which we will not

describe in details in this presentation– E.g.: use of Text-Mining and/or integration middleware sur as

Mondeca’s CA Mananger• SolR supports UIMA integration in its indexing chain to add

text mining tools– E.g.: manual tagging in the case of tourism catalogues

Page 33: Enriching search engines - thomas francart - english version

Strucutured catalogue in RDF

Page 34: Enriching search engines - thomas francart - english version

Index schema configuration

• 1 index field for each metadata– In conf/schema.xml

  <field name="Mot_Cle_103696" multiValued="true" type="string" indexed="true" stored="true" />   <field name="animaux_acceptes" multiValued="false" type="string" indexed="true" stored="true" />   <field name="bassin_touristique_at" multiValued="true" type="string" indexed="true" stored="true" />   <field name="bordereau_Tourinfrance_103952" multiValued="true" type="string" indexed="true" stored="true" />   <field name="commune_at" multiValued="true" type="string" indexed="true" stored="true" />   <field name="zone_geographique_at" multiValued="true" type="string" indexed="true" stored="true" />   <field name="paiement_accepte" multiValued="true" type="string" indexed="true" stored="true" />   <field name="label_at" multiValued="true" type="string" indexed="true" stored="true" />   <field name="langue_parlee" multiValued="true" type="string" indexed="true" stored="true" />   <field name="type_h" multiValued="true" type="string" indexed="true" stored="true" />   <field name="classement" multiValued="true" type="string" indexed="true" stored="true" />   <field name="tarif_nuit_mini" multiValued="true" type="string" indexed="true" stored="true" />

Page 35: Enriching search engines - thomas francart - english version

Result: facets

Page 36: Enriching search engines - thomas francart - english version

Plan

1 - Synonyms2 - Translations3 – Specific terms4 – Spelling mistakes5 - Facets6 – Dynamic classification7 – Vocabulary alignments8 - Disambiguation

Page 37: Enriching search engines - thomas francart - english version

Thomas Francart - Enrichissement des moteurs de recherche à partir de connaissances métier

7 : dynamic conetntn classification

• Why?– The classification plan

used in the catalogue is not meant to be understood by end users• « objective » vs.

« subjective » vision of the content

– There is a need to adapt the classification plan :• To different types if

audiences• For diffent channels

– The same catalogue needs to be presented according to different perspectives

– To increase content repurposing

• Looking for place to stay?• Simple?• Classic?• Elegant?

Page 38: Enriching search engines - thomas francart - english version

ITM-rules : création des règles

Page 39: Enriching search engines - thomas francart - english version

Rules definitions: format

• New hierarchical classifications are in SKOS• A SPARQL classification rule (generated from ITM

Rule Editor) is associated to each entry in the SKOS file<skos:Concept rdf:about="itm:n#_migration_taxo_106544">  <skos:prefLabel xml:lang="fr">Raffiné</skos:prefLabel>   <skos:definition>

PREFIX r: <itm:n#>PREFIX q: http://www.nievre-tourisme.com/onto#CONSTRUCT { ?SEARCHED_TOPIC <http://purl.org/dc/terms/subject>

r:_migration_taxo_106544 .}WHERE { ?SEARCHED_TOPIC a q:Hebergement . ?SEARCHED_TOPIC q:classement

q:class_CAT4 . } </skos:definition>   <skos:definition>

PREFIX j: <itm:n#>PREFIX i: http://www.nievre-tourisme.com/onto#CONSTRUCT { ?SEARCHED_TOPIC <http://purl.org/dc/terms/subject>

j:_migration_taxo_106544 .}WHERE { ?SEARCHED_TOPIC a i:Hebergement . ?SEARCHED_TOPIC i:classement

i:class_4EP . } </skos:definition> </skos:Concept>

Page 40: Enriching search engines - thomas francart - english version

Content Classifier : rules execution

Taxonomy (Classification Rules)

SKOS + SPARQL

Classification engine

RD

F C

onte

nt M

etad

ata

Cla

ssifi

catio

n M

etad

ata

• Based on RDF triplestore• Loads terminology and metadata• Infer on terminology

• OWL & SKOS inference• Custom rules

• Apply SPARQL classification rules

• optionnaly, simplifies RDF structure

?x is a <Hotel> and price(?x) < 50

?x is a <Camping> and size(?x) > 300

TerminologySKOS + RDF

Content classified with additionnal

dcterms:subject and dc:subject properties

Page 41: Enriching search engines - thomas francart - english version

Catalogue classified with additional metadata

Page 42: Enriching search engines - thomas francart - english version

Additional index fields for the new classifications

  <field name="" multiValued="true" type="string" indexed="true" stored="true" /> taxo_confort  <field name="taxo_generale" multiValued="true" type="string" indexed="true" stored="true" />

• In conf/schema.xml

Page 43: Enriching search engines - thomas francart - english version

Dynmaic Classification: Result

  <field name="" multiValued="true" type="string" indexed="true" stored="true" /> taxo_confort  <field name="taxo_generale" multiValued="true" type="string" indexed="true" stored="true" />

Page 44: Enriching search engines - thomas francart - english version

Plan

1 - Synonyms2 - Translations3 – Specific terms4 – Spelling mistakes5 - Facets6 – Dynamic classification7 – Vocabulary alignments8 - Disambiguation

Page 45: Enriching search engines - thomas francart - english version

8 : using reference vocabulary alignments

WTO

• Why?– What if content A is annotated using thesaurus A, and users want to

search content using thesaurus B ?– Allows queries on a corpus annotated with a thesaurus different from the

one used to control queries

GEMET

Thesaurus alignment

Page 46: Enriching search engines - thomas francart - english version

ITM-align : creation ofn alignments

Page 47: Enriching search engines - thomas francart - english version

Alignment fiormats

<map> <Cell rdf:about="150046"> <entity1>  <edoal:Class rdf:about="http://eurlex-directory-codes.europa.eu/0350" />   </entity1> <entity2>  <edoal:Class rdf:about="http://eurovoc.europa.eu/2897" />   </entity2>  <relation>fr.inrialpes.exmo.align.impl.rel.EquivRelation</relation>   <measure rdf:datatype="http://www.w3.org/2001/XMLSchema#float">1.0</measure>   </Cell></map><map> <Cell rdf:about="152849"> <entity1>  <edoal:Class rdf:about="http://eurlex-directory-codes.europa.eu/0350" />   </entity1> <entity2>  <edoal:Class rdf:about="http://eurovoc.europa.eu/2479" />   </entity2>  <relation>fr.inrialpes.exmo.align.impl.rel.EquivRelation</relation>   <measure rdf:datatype="http://www.w3.org/2001/XMLSchema#float">1.0</measure>   </Cell></map>

Aligned concepts

Relation type Score

« EDOAL » format from INRIA: http://alignapi.gforge.inria.fr/edoal.html

Page 48: Enriching search engines - thomas francart - english version

Using alignments

• When indexing• The original document annotations are translated

using the alignement– from Thesaurus A to thesaurus B

• The index is enriched with concepts from thesaurus B– The index now contains annotations based on thesaurus A and

thesaurus B

• One can then search the corpus using concepts from thesaurus B

• The alignment is interpreted by specific code in the indexing chain, there is no specific configuration in SolR– except to specify a dedicated field which will be used for the

result of the alignment translation

Page 49: Enriching search engines - thomas francart - english version

Reference vocabulary alignments: result

Keywords from the source thesaurus (eurovoc)

Keywords from concepts

translated using alignments (from

eurovoc to eurlex)

Page 50: Enriching search engines - thomas francart - english version

Plan

1 - Synonyms2 - Translations3 – Specific terms4 – Spelling mistakes5 - Facets6 – Dynamic classification7 – Vocabulary alignments8 - Disambiguation

Page 51: Enriching search engines - thomas francart - english version

Thomas Francart - Enrichissement des moteurs de recherche à partir de connaissances métier

Disambiguation

• Why?– Match a user’s searched term to a controlled entity

• « loisirs »   http://thes.world-tourism.org#LOISIRS• Disambiguate entities when searching only

makes sense if the same entities have been disambiguated when indexing– Either the document was explicitly categorized using a

controlled entity (its URI)– Or the entity was extracted using text mining tools

• disambiguation of an entity from a controlled vocabulary by the search engine is possible only if the controlled vocabulary has itself been indexed by the search engine

Page 52: Enriching search engines - thomas francart - english version

Disambiguation: principle

1. Use reference vocabulary when

indexing

2. Indexing of reference

vocabulary

http://www.z.fr/e1 doc1

http://www.z.fr/e1 doc2

venus http://www.z.fr/e1

cupidon http://www.z.fr/e2

3. Keyword disambiguation using

a controlled entity

4. Search on controlled entity id

Page 53: Enriching search engines - thomas francart - english version

Disambiguation: result

Page 54: Enriching search engines - thomas francart - english version

Thank you for your attention !

[email protected]